# Activity: Performing regression over RFM segments

We have tried out regression techniques for the entire dataset in class, but how does the insights and performance change when **we focus on certain segments**?

1. Pick an RFM segment of focus using (a) results from rule-based RFM in `cc_rfm.csv` or (b) from k-means in the previous activity

2. Redo all of the steps to perform linear and logistic regression.

* You might want to adjust the spend requirement in logistic regression to accommodate your segment

* Some segments might not have enough points for ML training. Adjust your pick if your segment has < 10 customers

3. Compare your results with the results for the entire dataset. Did you achieve better/worse results? Why or why not?

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, r2_score, mean_squared_error, mean_absolute_error, classification_report, confusion_matrix

In [2]:
# Mount GDrive's folders
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# This code imports a library "os" that allows file navigation
import os
# This code sets the home directory
# Find your folder and put the path here as a string
os.chdir('/content/drive/MyDrive/my_workspace')

**Objective:**

Test if a customer's spending on certain categories for past 3 quarters can predict the total spending in the current quarter


Lets set current quarter as 2021 Q3 (the latest complete quarter in the dataset).

## Read the dataset

In [15]:
df = pd.read_csv("Data/cc_clean.csv")
df.head()

Unnamed: 0,cc_num,gender,city,city_pop,job,dob,acct_num,acct_num2,trans_num,unix_time,category,amt,trans_datetime
0,676000000000.0,M,Dasmarinas,659019,Chartered loss adjuster,12/12/1958,798000000000.0,798000000000,a72eaa86b043eed95b25bbb25b3153a1,1581314011,shopping_net,68.88,2020-02-10 13:53:31
1,3520000000000000.0,M,Digos,169393,"Administrator, charities/voluntary organisations",31/08/1970,968000000000.0,968000000000,060d12f91c13871a13963041736a4702,1590902968,entertainment,50.06,2020-05-31 13:29:28
2,4.14e+18,M,Calapan,133893,Financial controller,23/07/1953,628000000000.0,628000000000,18aafb6098ab0923886c0ac83592ef8d,1585461157,food_dining,105.44,2020-03-29 13:52:37
3,4720000000000000.0,M,Laoag,111125,Dance movement psychotherapist,11/01/1954,257000000000.0,257000000000,c20ee88b451f637bc6893b7460e9fee0,1601282159,gas_transport,82.69,2020-09-28 16:35:59
4,3530000000000000.0,M,City of Paranaque,665822,"Engineer, water",31/07/1961,540000000000.0,540000000000,b389cc449c9c298e8c004024449f7a27,1594960430,shopping_net,363.49,2020-07-17 12:33:50


In [18]:
rfm_df = pd.read_csv("Data/combined_df.csv")
rfm_df.head()

Unnamed: 0,acct_num,recency,recency_score,frequency,frequency_score,total_amt,monetary_score,rfm_score,rfm_level,cluster
0,124000000000.0,24,3,931,3,66457.92,3,9,Top,0
1,169000000000.0,141,1,9,1,2814.6,1,3,Low,1
2,170000000000.0,24,3,890,3,64448.85,3,9,Top,0
3,201000000000.0,25,3,306,2,24489.46,2,7,Top,2
4,203800000000.0,111,1,12,1,8803.87,1,3,Low,1


In [19]:
# Convert to pandas datetimes
df['trans_datetime'] = pd.to_datetime(df['trans_datetime'])
# Convert acct_num to int
df['acct_num'] = df['acct_num'].astype(int)
df.head()

Unnamed: 0,cc_num,gender,city,city_pop,job,dob,acct_num,acct_num2,trans_num,unix_time,category,amt,trans_datetime
0,676000000000.0,M,Dasmarinas,659019,Chartered loss adjuster,12/12/1958,798000000000,798000000000,a72eaa86b043eed95b25bbb25b3153a1,1581314011,shopping_net,68.88,2020-02-10 13:53:31
1,3520000000000000.0,M,Digos,169393,"Administrator, charities/voluntary organisations",31/08/1970,968000000000,968000000000,060d12f91c13871a13963041736a4702,1590902968,entertainment,50.06,2020-05-31 13:29:28
2,4.14e+18,M,Calapan,133893,Financial controller,23/07/1953,628000000000,628000000000,18aafb6098ab0923886c0ac83592ef8d,1585461157,food_dining,105.44,2020-03-29 13:52:37
3,4720000000000000.0,M,Laoag,111125,Dance movement psychotherapist,11/01/1954,257000000000,257000000000,c20ee88b451f637bc6893b7460e9fee0,1601282159,gas_transport,82.69,2020-09-28 16:35:59
4,3530000000000000.0,M,City of Paranaque,665822,"Engineer, water",31/07/1961,540000000000,540000000000,b389cc449c9c298e8c004024449f7a27,1594960430,shopping_net,363.49,2020-07-17 12:33:50


In [11]:
combined = pd.read_csv("Data/combined_df.csv")
combined.head()

Unnamed: 0,acct_num,recency,recency_score,frequency,frequency_score,total_amt,monetary_score,rfm_score,rfm_level,cluster
0,124000000000.0,24,3,931,3,66457.92,3,9,Top,0
1,169000000000.0,141,1,9,1,2814.6,1,3,Low,1
2,170000000000.0,24,3,890,3,64448.85,3,9,Top,0
3,201000000000.0,25,3,306,2,24489.46,2,7,Top,2
4,203800000000.0,111,1,12,1,8803.87,1,3,Low,1


## Prepare the data

1. Filter to only selected categories and rfm segment

In [7]:
selected_categories = ['shopping_health','gas_transport','health_fitness']

In [None]:
# filter by category
 # add code here

In [None]:

 # add code here

2. Filter to inclusive dates

In [13]:
df = combined
df

Unnamed: 0,acct_num,recency,recency_score,frequency,frequency_score,total_amt,monetary_score,rfm_score,rfm_level,cluster
0,1.240000e+11,24,3,931,3,66457.92,3,9,Top,0
1,1.690000e+11,141,1,9,1,2814.60,1,3,Low,1
2,1.700000e+11,24,3,890,3,64448.85,3,9,Top,0
3,2.010000e+11,25,3,306,2,24489.46,2,7,Top,2
4,2.038000e+11,111,1,12,1,8803.87,1,3,Low,1
...,...,...,...,...,...,...,...,...,...,...
83,9.690000e+11,26,3,291,1,20507.08,1,5,Middle,2
84,9.710000e+11,25,3,907,3,61747.38,3,9,Top,0
85,9.890000e+11,25,3,301,1,19471.66,1,5,Middle,2
86,9.940000e+11,25,3,608,2,39818.18,2,7,Top,2


In [14]:
df["month"] = df.trans_datetime.dt.month #01
df["month_abbr"] = df.trans_datetime.dt.strftime('%b')
df['quarter'] = 'Q' + ((df['month']/4).astype(int) + 1).astype(str)

AttributeError: 'DataFrame' object has no attribute 'trans_datetime'

In [None]:
start_date = pd.to_datetime('2020-10-01')
end_date = pd.to_datetime('2021-09-30')
df = df[(df['trans_datetime'] >= start_date) & (df['trans_datetime'] <= end_date)]
df['trans_datetime'].min(),df['trans_datetime'].max()

3. Get quarterly total spending per category in  quarter = `'2021Q3' for each customer

In [None]:
df['quarter'] = df['trans_datetime'].dt.to_period('Q').astype(str)
df.head()

In [None]:
total_df = # add code here
total_df

4. Get total spending for selected categories for quarters=`['2020Q3','2021Q1','2021Q2']` for each customer



In [None]:
cutoff_date = pd.to_datetime('2021-06-30')
data = df[df['trans_datetime']<=cutoff_date]

In [None]:
category_df = data.groupby(['acct_num','category','quarter'])['amt'].agg(['count','sum'])
category_df = category_df.reset_index()
category_df = category_df.rename(columns={'sum':'total'})
category_df

5. Reshape table so each category has its own column

In [None]:
pivot_category_df = category_df.pivot(index='acct_num', columns=['category','quarter'], values=['count','total']).fillna(0)
pivot_category_df

In [None]:
#flatten columns
pivot_category_df.columns = ['_'.join(col) for col in pivot_category_df.columns]
pivot_category_df

6. Join total spending table with total spending per category table

In [None]:
print(len(total_df), len(pivot_category_df))
# use join instead of merge if you have used a non-default index
total_df = total_df.join(pivot_category_df, how='inner')
print(len(total_df))
total_df.head()

In [None]:
total_df.hist(figsize=(12,10))

## 1. Predict using Linear Regression

> Can we predict the total spending amount for 2021Q3 given spending from selected categories?


In [None]:
# Declare columns to use as features (input)
feature_cols = ['count_gas_transport_2020Q4', 'count_gas_transport_2021Q1',
       'count_gas_transport_2021Q2', 'count_health_fitness_2020Q4',
       'count_health_fitness_2021Q1', 'count_health_fitness_2021Q2',
       'total_gas_transport_2020Q4', 'total_gas_transport_2021Q1',
       'total_gas_transport_2021Q2', 'total_health_fitness_2020Q4',
       'total_health_fitness_2021Q1', 'total_health_fitness_2021Q2']

In [None]:
# Declare input and target variables
X = total_df[feature_cols]
y = total_df['total_amt']

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

In [None]:
# Initialize model
model = LinearRegression()

In [None]:
# Fit the model to the training data
model.fit(X_train, y_train)

In [None]:
# Predict on the test data
y_pred = model.predict(X_test)

In [None]:
# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2:.2f}')

In [None]:
# Helper function to calculate errors
def calculate_errors(y_test, y_pred):
  mse = mean_squared_error(y_test, y_pred)
  rmse = np.sqrt(mse)
  mae = mean_absolute_error(y_test, y_pred)
  mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
  return rmse, mae, mape

In [None]:
# Evaluate the model's performance
rmse, mae, mape = calculate_errors(y_test, y_pred)
print(f'Root Mean Square Error: {rmse:.2f}')
print(f'Mean Absolute Error: {mae:.2f}')
print(f'Mean Absolute Percentage Error: {mape:.2f}')

In [None]:
# View the slopes (coefficients) for each feature
coefficients = model.coef_
print("Slopes (coefficients) for each feature:")
for i, coef in enumerate(coefficients):
    print(f"Feature {feature_cols[i]}: {coef:.4f}")

In [None]:
# plot actual and predicted
plt.scatter(y_test,y_pred,s=20)
plt.xlabel('Actual Spending')
plt.xlabel('Predicted Spending')
# y = x reference line
plt.plot(range(0,15000,1000),range(0,15000,1000), ls='--')

## 2. Predict using Logistic Regression
> Can we predict if the total spending category for 2021Q3 will exceed the spend requirement for a promo, given spending from selected categories?

Lets set the spend requirement as 8,000 USD


In [None]:
# create total_amt_cat column
total_df['total_amt_cat'] =  (total_df['total_amt']>8000).astype(int)
total_df.head()

In [None]:
# Declare input and target variables
X = total_df[feature_cols]
y = total_df['total_amt_cat']

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

In [None]:
# Initialize model
model = LogisticRegression()

In [None]:
# Fit the model to the training data
model.fit(X_train, y_train)

In [None]:
# Predict on the test data
y_pred = model.predict(X_test)

In [None]:
# Confusion matrix
print("Confusion Matrix:")
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

In [None]:
plt.figure(figsize=(4,4))
sns.heatmap(conf_matrix, annot=True, cmap="Blues", fmt="d", xticklabels=['below reqts', 'met reqts'], yticklabels=['below reqts', 'met reqts'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')

In [None]:
# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))