### Feature Engineering

**RFM** - approach that underlies many feature engineering methods\
**Recency** - time since last customer transaction\
**Frequency** - number of purchases in the observed period\
**Monetary** - total amount spend in observed period

### Prepare the data for modeling - Data Wrangling

In [1]:
# read the file and clean up the column name
import pandas as pd
online = pd.read_csv('online.csv')
online = online.rename(columns={'  TotalSum': 'TotalSum'})

In [2]:
# Exclude target variable for crossvalidation
online_X = online[online['InvoiceMonth']!= '2011-11']
online_X = online_X.reset_index(drop = True)

In [3]:
# Define snapshot date
import datetime as dt
NOW = dt.datetime(2011,11,1)

In [4]:
# change InvoiceDate to date type
online_X['InvoiceDate'] = online_X['InvoiceDate'].astype('datetime64[ns]')

In [5]:
# feature engineering
import numpy as np
features = online_X.groupby('CustomerID').agg({
    'InvoiceDate': lambda x:(NOW - x.max()).days,
    'InvoiceNo': pd.Series.nunique,
    'TotalSum': np.sum,
    'Quantity': ['mean','sum']
}).reset_index()

In [6]:
# renaming features for feature engineering
features.columns = ['CustomerID','recency','frequency','monetary','quantity_avg'\
                    ,'quantity_total']

In [7]:
# show feature head
features.head()

Unnamed: 0,CustomerID,recency,frequency,monetary,quantity_avg,quantity_total
0,12748,110,1,4.15,1,1
1,12867,225,1,5.9,2,2
2,12902,226,1,5.04,12,12
3,12952,40,1,12.48,6,6
4,12963,15,1,17.0,4,4


In [8]:
# create a pivot of monthly purchases
cust_month_tx = pd.pivot_table(data = online, index = ['CustomerID'],
                              values = 'InvoiceNo', columns = ['InvoiceMonth'],
                              aggfunc = pd.Series.nunique, fill_value = 0)
# pivot table of unique customer purchases per month
cust_month_tx.head()

InvoiceMonth,2010-12,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,2011-11,2011-12
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
12748,0,0,0,0,0,0,0,1,0,0,0,0,0
12867,0,0,0,1,0,0,0,0,0,0,0,0,0
12902,0,0,0,1,0,0,0,0,0,0,0,0,0
12952,0,0,0,0,0,0,0,0,0,1,0,0,0
12963,0,0,0,0,0,0,0,0,0,0,1,0,0


In [22]:
# obtain customerID for monthly purchases
custid = ['CustomerID']
target = ['2011-11']

In [23]:
Y = cust_month_tx[target]

In [20]:
X = pd.merge(Y, features, left_index=True, right_on='CustomerID')

In [32]:
Y = X[['2011-11','CustomerID']]

In [35]:
Y = Y.set_index(['CustomerID'])

In [39]:
X = X.drop(['2011-11','CustomerID'],axis = 1)

In [40]:
# train_test_split using the cleaned up data
from sklearn.model_selection import train_test_split
train_X, test_X, train_Y, test_Y = train_test_split(X, Y, test_size = 0.25,
                                                   random_state = 99)

In [50]:
test_X

Unnamed: 0,recency,frequency,monetary,quantity_avg,quantity_total
21,7,1,12.6,6,6
33,246,1,12.6,6,6
24,163,1,7.9,1,1
22,236,1,9.9,6,6
38,61,1,15.0,4,4
37,226,1,0.85,1,1
39,111,1,13.68,72,72
45,331,1,2.1,1,1
13,54,1,17.0,4,4
19,334,1,15.12,36,36


In [49]:
test_Y

Unnamed: 0_level_0,2011-11
CustomerID,Unnamed: 1_level_1
14286,0
16187,0
14546,0
14334,0
16729,0
16710,0
16843,0
17796,0
13769,0
14060,0


### Build the OLS model

In [42]:
# Import the linear regression module
from sklearn.linear_model import LinearRegression

# Initialize the regression instance
linreg = LinearRegression()

# Fit model on the training data
linreg.fit(train_X, train_Y)

# Predict values on both training and testing data
train_pred_Y = linreg.predict(train_X)
test_pred_Y = linreg.predict(test_X)

In [43]:
# Import performance measurement functions
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# Calculate metrics for training data
rmse_train = np.sqrt(mean_squared_error(train_Y, train_pred_Y))
mae_train = mean_absolute_error(train_Y, train_pred_Y)

# Calculate metrics for testing data
rmse_test = np.sqrt(mean_squared_error(test_Y, test_pred_Y))
mae_test = mean_absolute_error(test_Y, test_pred_Y)

# Print performance metrics
print('RMSE train: {:.3f}; RMSE test: {:.3f}\nMAE train: {:.3f}, MAE test: {:.3f}'.format(
rmse_train, rmse_test, mae_train, mae_test))

RMSE train: 0.207; RMSE test: 0.355
MAE train: 0.102, MAE test: 0.171


In [45]:
# Import the library
import statsmodels.api as sm

# Convert target variable to `numpy` array
train_Y = np.array(train_Y)

# Initialize and fit the model
olsreg = sm.OLS(train_Y, train_X)
olsreg = olsreg.fit()

# Print model summary
print(olsreg.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.141
Model:                            OLS   Adj. R-squared:                  0.065
Method:                 Least Squares   F-statistic:                     1.856
Date:                Sat, 04 Jan 2020   Prob (F-statistic):              0.156
Time:                        21:04:24   Log-Likelihood:                 5.9334
No. Observations:                  38   AIC:                            -3.867
Df Residuals:                      34   BIC:                             2.684
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
recency         9.228e-05      0.000      0.

#### The prediction results are not so good in this model as the data is just a small sample of the original data. However, in an ideal situation, the model would have a higher R-square and would depend on recency, frequency and quantity_total.

#### The ultimate purpose of this model is to identify high value customers using purchase history. However, marketing would be a bit tricky as the team would have to figure out who to target: should you target existing high value customers to squeeze more value out of them, or should you do a re-targeting campaign to capture low value customers.