# Customer Purchase Prediction

### Data
This dataset is about whether or not different consumers purchase in response to a test mailing of a certain catalog and, how much money each consumer spent. This dataset has two possible outcome variables: Purchase (0/1 value: whether or not the purchase was made) and Spending (numeric value: amount spent). 

### Goal
Build numeric prediction models that predict Spending based on the other available customer information.

### Result
1st Explore data

2nd Modeling based on all data (linear regression, KNN, regression tree, SVM regression, ensembling models, neural network)

3rd Modeling based on “restricted” dataset, which includes only purchase records (i.e., where Purchase = 1)

4th Compare performance between 2nd and 3rd

### Variable Description
US -- a US address? (binary, 1/0)

2-16 Source_* -- Source catalog for the record (binary, 1/0)

Freq. -- Number of transactions in last year at source catalog (numeric)

last_update_days_ago -- How many days ago was last update to cust. record (numeric)

1st_update_days_ago -- How many days ago was 1st update to cust. record (numeric)

Web_order -- Customer placed at least 1 order via web (binary, 1/0)

Gender=male -- Customer is male (binary, 1/0)

Address_is_res -- Address is a residence binary (binary, 1/0)

Purchase -- Person made purchase in test mailing (binary, 1/0)

Spending -- Amount spent by customer in test mailing ($) (numeric)

## Data Processing

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split, KFold, GridSearchCV, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression, ElasticNet
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.svm import SVR
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.ensemble import StackingRegressor, StackingClassifier, RandomForestRegressor

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor, KerasClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.utils import to_categorical
import keras_tuner as kt

import warnings
warnings.filterwarnings("ignore")

In [2]:
data = pd.read_excel('HW3.xlsx')
data.drop('sequence_number', axis=1, inplace=True)
data.head()

Unnamed: 0,US,source_a,source_c,source_b,source_d,source_e,source_m,source_o,source_h,source_r,...,source_x,source_w,Freq,last_update_days_ago,1st_update_days_ago,Web order,Gender=male,Address_is_res,Purchase,Spending
0,1,0,0,1,0,0,0,0,0,0,...,0,0,2,3662,3662,1,0,1,1,127.87
1,1,0,0,0,0,1,0,0,0,0,...,0,0,0,2900,2900,1,1,0,0,0.0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,2,3883,3914,0,0,0,1,127.48
3,1,0,1,0,0,0,0,0,0,0,...,0,0,1,829,829,0,1,0,0,0.0
4,1,0,1,0,0,0,0,0,0,0,...,0,0,1,869,869,0,0,0,0,0.0


## Exploratory Data Analysis

In [3]:
data.describe()

Unnamed: 0,US,source_a,source_c,source_b,source_d,source_e,source_m,source_o,source_h,source_r,...,source_x,source_w,Freq,last_update_days_ago,1st_update_days_ago,Web order,Gender=male,Address_is_res,Purchase,Spending
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,0.8245,0.1265,0.056,0.06,0.0415,0.151,0.0165,0.0335,0.0525,0.0685,...,0.018,0.1375,1.417,2155.101,2435.6015,0.426,0.5245,0.221,0.5,102.560745
std,0.380489,0.332495,0.229979,0.237546,0.199493,0.358138,0.12742,0.179983,0.223089,0.252665,...,0.132984,0.344461,1.405738,1141.302846,1077.872233,0.494617,0.499524,0.415024,0.500125,186.749816
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1133.0,1671.25,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,2280.0,2721.0,0.0,1.0,0.0,0.5,1.855
75%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,3139.25,3353.0,1.0,1.0,0.0,1.0,152.5325
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,15.0,4188.0,4188.0,1.0,1.0,1.0,1.0,1500.06


In [4]:
n_samples, n_features = data.shape
print ('The dimensions of the data set are', n_samples, 'by', n_features)

The dimensions of the data set are 2000 by 24


In [5]:
X = data.drop(['Spending','Purchase'],axis=1)
y = data[['Spending']]

In [6]:
# categorical 
cat_col=[['US', 'source_a', 'source_c', 'source_b', 'source_d', 'source_e','source_m', 'source_o', 'source_h', 'source_r', 'source_s', 'source_t',
'source_u', 'source_p', 'source_x', 'source_w', 'Web order','Gender=male', 'Address_is_res']]

for col in cat_col:
    X[col] = X[col].astype("category")

In [7]:
# numeric
num_col=["Freq", "last_update_days_ago", "1st_update_days_ago"]

## Normalization

In [8]:
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X[num_col])
X_normalized = pd.DataFrame(X_normalized, columns=num_col)

In [10]:
# cat
X = X.drop(X[num_col], axis = 1)
# new = cat + num
X = pd.concat([X, X_normalized], axis=1)

# Modeling (a): all data

## Split training and testing set

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [12]:
for col in cat_col:
    X_train[col] = X_train[col].astype("float")
    X_test[col]= X_test[col].astype("float")

## Score

In [13]:
def MSE(y_test, y_pred):
    mse = mean_squared_error(y_test, y_pred)
    return mse

def score():
    return make_scorer(MSE, greater_is_better=False)

## Cross-validation for inner and outer loops

In [14]:
i = 42
inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

##  linear regression

In [15]:
lr = ElasticNet(random_state = 42)
ratio = [0.1, 0.3, 0.5, 0.7, 0.9]
lr_grid = {'l1_ratio':ratio}
         
lr_clf = GridSearchCV(estimator=lr, param_grid=lr_grid,cv=inner_cv, scoring = score())
lr_pred = lr_clf.fit(X_train, y_train)

print('best params: ', lr_pred.best_params_)
print('rmse: ', np.sqrt(-lr_pred.best_score_))

best params:  {'l1_ratio': 0.9}
rmse:  129.62258077177015


## K-NN

In [16]:
k = list(range(2,10))
knn_grid = {'n_neighbors':k}

knn = KNeighborsRegressor()
knn_clf = GridSearchCV(knn, knn_grid, cv=inner_cv, scoring=score())
knn_pred = knn_clf.fit(X_train, y_train)

print('best params: ', knn_pred.best_params_)
print('rmse: ', np.sqrt(-knn_pred.best_score_))

best params:  {'n_neighbors': 8}
rmse:  130.3426530853851


## Regression tree

In [17]:
depth = list(range(2,6))
rt_grid = {'max_depth':depth}

tree = DecisionTreeRegressor()
rt_clf = GridSearchCV(tree, rt_grid, cv=inner_cv, scoring=score())
rt_pred = rt_clf.fit(X, y)

print('best params: ', rt_pred.best_params_)
print('rmse: ', np.sqrt(-rt_pred.best_score_))

best params:  {'max_depth': 3}
rmse:  137.8310639730004


## SVM regression

In [18]:
kernel = ['rbf']
c = [1, 10, 100]
g =  ['scale', 'auto']

svr_grid = {'kernel':kernel, 'C':c, 'gamma':g} 
svr = SVR()
svr_clf = GridSearchCV(svr, svr_grid, cv=inner_cv, scoring=score())
svr_pred = svr_clf.fit(X_train, y_train)

print('best params: ', svr_pred.best_params_)
print('rmse:' , np.sqrt(-svr_pred.best_score_))

best params:  {'C': 100, 'gamma': 'auto', 'kernel': 'rbf'}
rmse: 134.96664056364997


## Ensembling models

Use Linear regression, regression tree, knn and svr for stacking, and tune hyperparameter in regression tree.

In [19]:
estimator = [('linear regression', ElasticNet(l1_ratio= 0.9)),
             ('regression tree', DecisionTreeRegressor()),
             ('knn', KNeighborsRegressor(n_neighbors=8)),
             ('svr', SVR(C=100, gamma= 'auto', kernel= 'rbf'))]

srlf = StackingRegressor(estimators= estimator, final_estimator=RandomForestRegressor())
grid = {'regression tree__max_depth':list(range(2,6))}
search=GridSearchCV(estimator=srlf, param_grid=grid,cv = inner_cv, scoring = score())
result=search.fit(X_train,y_train)

print('best params: ', result.best_params_)
print('rmse:' , np.sqrt(-result.best_score_))

best params:  {'regression tree__max_depth': 4}
rmse: 128.43318879681942


## Neural Network

In [20]:
def create_model(activation, nb_hidden):
    model = Sequential()
    model.add(Dense(nb_hidden, input_dim=X_train.shape[1], activation=activation))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mse'])
    return model

act = ['relu','tanh']
hidden = [64, 128, 256]
epoch=list(range(3, 10))

NN_grid = {'activation':act,'nb_hidden':hidden, 'epochs':epoch}
NN = KerasRegressor(build_fn=create_model, batch_size=256, verbose=0)

In [21]:
NN_clf = GridSearchCV(estimator=NN, param_grid=NN_grid, scoring = score(), cv=5)
NN_pred=NN_clf.fit(X_train, y_train)

print('best params: ', NN_pred.best_params_)
print('rmse:' , np.sqrt(-NN_pred.best_score_))

best params:  {'activation': 'tanh', 'epochs': 9, 'nb_hidden': 256}
rmse: 211.42966178602362


## Compate score between models

In [22]:
lr_score = cross_val_score(lr_clf, X=X_normalized, y=y, cv=outer_cv)
knn_score = cross_val_score(knn_clf, X=X_normalized, y=y, cv=outer_cv)
rt_score = cross_val_score(rt_clf , X=X_normalized, y=y, cv=outer_cv)
svm_score = cross_val_score(svr_clf, X=X_normalized, y=y, cv=outer_cv)
stack_score = cross_val_score(search, X=X_normalized, y=y, cv=outer_cv)

In [23]:
score = {}
score['linear regression'] = lr_score.mean()
score['KNN'] = knn_score.mean()
score['Regression tree'] = rt_score.mean()
score['SVR'] = svm_score.mean()
score['Stack'] = stack_score.mean()
score

{'linear regression': -18335.41238831878,
 'KNN': -18087.422392807333,
 'Regression tree': -20912.865511454293,
 'SVR': -20894.365836420966,
 'Stack': -20375.502124324896}

The KNN model has the lowest MSE.(-18087)

# Modeling (b): Purchase = 1

In [26]:
datab = data[data['Purchase'] == 1]

In [27]:
X = datab.drop(['Spending','Purchase'],axis=1)
y = datab[['Spending']]

In [28]:
# categorical 
cat_col=[['US', 'source_a', 'source_c', 'source_b', 'source_d', 'source_e','source_m', 'source_o', 'source_h', 'source_r', 'source_s', 'source_t',
'source_u', 'source_p', 'source_x', 'source_w', 'Web order','Gender=male', 'Address_is_res']]

for col in cat_col:
    X[col] = X[col].astype("category")
    
X.reset_index(inplace=True)

In [29]:
# numeric
num_col=["Freq", "last_update_days_ago", "1st_update_days_ago"]

## Normalization

In [30]:
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X[num_col])
X_normalized = pd.DataFrame(X_normalized, columns=num_col)

In [31]:
# # cat
X = X.drop(X[num_col], axis = 1)
# # new = cat + num
X_normalized = pd.concat([X, X_normalized], axis=1)
# X_normalized

## Split training and testing set

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.3, random_state=42)

## Score

In [33]:
def MSE(y_test, y_pred):
    mse = mean_squared_error(y_test, y_pred)
    return mse

def score():
    return make_scorer(MSE, greater_is_better=False)

## Cross-validation for inner and outer loops

In [37]:
i = 42
inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

## linear regression

In [35]:
lr = ElasticNet(random_state = 42)
ratio = [0.1, 0.3, 0.5, 0.7, 0.9]
lr_grid = {'l1_ratio':ratio}
         
lr_clf = GridSearchCV(estimator=lr, param_grid=lr_grid,cv=inner_cv, scoring = score())
lr_pred = lr_clf.fit(X_train, y_train)

print('best params: ', lr_pred.best_params_)
print('rmse: ', np.sqrt(-lr_pred.best_score_))

best params:  {'l1_ratio': 0.9}
rmse:  160.9491083348334


## knn

In [36]:
k = list(range(2,10))
knn_grid = {'n_neighbors':k}

knn = KNeighborsRegressor()
knn_clf = GridSearchCV(knn, knn_grid, cv=inner_cv, scoring=score())
knn_pred = knn_clf.fit(X_train, y_train)

print('best params: ', knn_pred.best_params_)
print('rmse: ', np.sqrt(-knn_pred.best_score_))

best params:  {'n_neighbors': 9}
rmse:  221.7687165879512


## Regression tree

In [38]:
depth = list(range(2,6))
rt_grid = {'max_depth':depth}

tree = DecisionTreeRegressor()
rt_clf = GridSearchCV(tree, rt_grid, cv=inner_cv, scoring=score())
rt_pred = rt_clf.fit(X, y)

print('best params: ', rt_pred.best_params_)
print('rmse: ', np.sqrt(-rt_pred.best_score_))

best params:  {'max_depth': 2}
rmse:  227.68424077233254


## SVM regression

In [39]:
kernel = ['rbf']
c = [1, 10, 100]
g =  ['scale', 'auto']

svr_grid = {'kernel':kernel, 'C':c, 'gamma':g} 
svr = SVR()
svr_clf = GridSearchCV(svr, svr_grid, cv=inner_cv, scoring=score())
svr_pred = svr_clf.fit(X_train, y_train)

print('best params: ', svr_pred.best_params_)
print('rmse:' , np.sqrt(-svr_pred.best_score_))

best params:  {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
rmse: 212.662801951569


## Ensembling models

In [40]:
estimator = [('linear regression', ElasticNet(l1_ratio= 0.9)),
             ('regression tree', DecisionTreeRegressor()),
             ('knn', KNeighborsRegressor(n_neighbors=9)),
             ('svr', SVR(C=1, gamma= 'scale', kernel= 'rbf'))]

srlf = StackingRegressor(estimators= estimator, final_estimator=RandomForestRegressor())
grid = {'regression tree__max_depth':list(range(2,6))}
search=GridSearchCV(estimator=srlf, param_grid=grid,cv = inner_cv, scoring = score())
result=search.fit(X_train,y_train)

print('best params: ', result.best_params_)
print('rmse:' , np.sqrt(-result.best_score_))

best params:  {'regression tree__max_depth': 3}
rmse: 164.56686855550615


## Neural Network

In [41]:
def create_model(activation, nb_hidden):
    model = Sequential()
    model.add(Dense(nb_hidden, input_dim=X_train.shape[1], activation=activation))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mse'])
    return model

act = ['relu','tanh']
hidden = [64, 128, 256]
epoch=list(range(3, 10))

NN_grid = {'activation':act,'nb_hidden':hidden, 'epochs':epoch}
NN = KerasRegressor(build_fn=create_model, batch_size=256, verbose=0)

In [42]:
NN_clf = GridSearchCV(estimator=NN, param_grid=NN_grid, scoring = score(), cv=inner_cv)
NN_pred=NN_clf.fit(X_train, y_train)

print('best params: ', NN_pred.best_params_)
print('rmse:' , np.sqrt(-NN_pred.best_score_))

best params:  {'activation': 'relu', 'epochs': 4, 'nb_hidden': 256}
rmse: 230.27771737924004


## Compate score between models

In [43]:
lr_score = cross_val_score(lr_clf, X=X_normalized, y=y, cv=outer_cv)
knn_score = cross_val_score(knn_clf, X=X_normalized, y=y, cv=outer_cv)
rt_score = cross_val_score(rt_clf , X=X_normalized, y=y, cv=outer_cv)
svm_score = cross_val_score(svr_clf, X=X_normalized, y=y, cv=outer_cv)
stack_score = cross_val_score(search, X=X_normalized, y=y, cv=outer_cv)

In [44]:
score = {}
score['linear regression'] = lr_score.mean()
score['KNN'] = knn_score.mean()
score['Regression tree'] = rt_score.mean()
score['SVR'] = svm_score.mean()
score['Stack'] = stack_score.mean()
score

{'linear regression': -27486.51821305923,
 'KNN': -53244.165965188884,
 'Regression tree': -32878.97117092854,
 'SVR': -51678.047770637604,
 'Stack': -29872.83210105402}

The linear regression model has the lowest MSE.

# (c) Comparison

Overall, MSE in 5 models drop from (a) to (b):

'linear regression': -18335 -> -27486
'KNN': -18087 -> -53244,
'Regression tree': -20912 -> -32878,
'SVR': -20894 -> -51678,
'Stack': -20375 -> -29872

In (a) which includes purchasing=1 and purchasing = 0, the KNN model has the lowest MSE. In (b) which only contains purchasing = 1, the linear regression model has the lowest MSE. The possible reason is the sample size is smaller and consumers who spend dollars for purchasing has more variations in different columns(source, weborder...).