# Purchase Model

The objective of this notebook is to make a supervised machine learning model that will take as inputs the 5 main features we have available, and predict the expected value of the revenue for that user.

From previous analysis we know that given the set of users who made a purchase, it is basically impossible to predict the amount of purchase (our hypothesis is that once the generating code decides a user will make a purchase, it then simply uses a random number generator to decide the magnitude of the purchase). It follows then that it is less important to decide the *magnitude* of a purchase that it is to simply predict the *probability* of a purchase. This gives us two options for extimating revenue.

The naive way is to simply regress our features against the column labeled "profit", which actually listes the revenue from the purchase.

The alternative way is to first calculate the mean purchase amount $\bar{r}$, and then use the features in a model that is trained against the binary column "purchase." This should allow us to predict for a given user $u_i$ the probability they will make a purchase $P(p_i)$. Then the expected revenue for that user is just the probability of a purchase times the expected value of a purchase: $ r_i = P(p_i) * \bar{r} $

In [1]:
import pandas as pd
import numpy as np
import pickle

from sklearn import linear_model, neighbors, ensemble
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error


from querents import Querent
import utils


data_fp = '/mnt/c/data/b2w'
historic_fp = data_fp + '/past_bids.csv'


## Computation

Start by splitting data into Train and Test sets.

In [2]:
historic = pd.read_csv(historic_fp)
historic = historic.fillna({'profit': 0})

train, test = train_test_split(historic, test_size = 0.2)

Next we compute the expected value of the revenue, given purchase ($\bar{r}$) as described above...

In [3]:
mean_purchase_amt = train.loc[train.purchase == True, 'profit'].mean()
mean_purchase_amt

10.130785791173304

The 5 available features (gender, marital status, age, income, and day of week) need to be put into the proper form before they can be used for modelling. The `frame_to_features()` function in the `utils` module does these computations. First, both age and gender are scaled and centered  ($x' = (x - \bar{x}) / \sigma$), then gender and marital status are each converted to a binary feature, and lastly the day of week is encoded in one-hot manner.

Lastly, we create a data frame to hold the results of our computations, starting with our two target variables, purchase and profit (a.k.a. revenue).

In [4]:
train_feat = utils.frame_to_features(train)
test_feat = utils.frame_to_features(test)

res = pd.DataFrame(index = test.index)
res['purchase'] = test.purchase
res['profit'] = test.profit

Before proceeding we check the correlation matrix to make sure there are no co-linearities that might throw our linear odels off...

In [22]:
train_feat.corr()

Unnamed: 0,sunday,monday,tuesday,wednesday,thursday,friday,saturday,female,marital_status,age,income
sunday,1.0,-0.167028,-0.166686,-0.167934,-0.167678,-0.168173,-0.166378,-0.011858,-0.001993,0.00378,0.002538
monday,-0.167028,1.0,-0.165501,-0.16674,-0.166486,-0.166978,-0.165195,0.000703,0.000593,-0.003005,-0.000299
tuesday,-0.166686,-0.165501,1.0,-0.166398,-0.166145,-0.166635,-0.164856,0.002767,-0.004253,-0.004611,0.004621
wednesday,-0.167934,-0.16674,-0.166398,1.0,-0.167389,-0.167883,-0.166091,0.003616,-0.0032,-0.004158,0.005371
thursday,-0.167678,-0.166486,-0.166145,-0.167389,1.0,-0.167627,-0.165837,-0.004835,0.000291,0.002299,-0.001736
friday,-0.168173,-0.166978,-0.166635,-0.167883,-0.167627,1.0,-0.166327,0.009517,0.008373,0.00385,-0.002589
saturday,-0.166378,-0.165195,-0.164856,-0.166091,-0.165837,-0.166327,1.0,0.00011,0.000163,0.001815,-0.007937
female,-0.011858,0.000703,0.002767,0.003616,-0.004835,0.009517,0.00011,1.0,0.009797,0.002855,-0.007109
marital_status,-0.001993,0.000593,-0.004253,-0.0032,0.000291,0.008373,0.000163,0.009797,1.0,0.012039,-6.4e-05
age,0.00378,-0.003005,-0.004611,-0.004158,0.002299,0.00385,0.001815,0.002855,0.012039,1.0,0.003588


Three models are included in this workbook. Several more complicated models were given a cursory test, but found to have performance no better than linear models and so are not included at this time.
1. A typical least-squares regression, targeting revenue
2. A logistic regression fit to specify whether a user will make a purchase or not
3. A Ridge Regression fit to predict revenue

In [24]:
log_mod = linear_model.LogisticRegression().fit(train_feat, train.purchase)
lin_mod = linear_model.LinearRegression().fit(train_feat, train.profit)
#gbm_mod = ensemble.GradientBoostingRegressor().fit(train_feat, train.profit)
bayes_mod = linear_model.BayesianRidge().fit(train_feat, train.profit)

In [25]:
res['log_class'] = log_mod.predict(test_feat)
res['log_purchase'] = log_mod.predict_proba(test_feat)[:, 1]
res['log_profit'] = res['log_purchase'] * mean_purchase_amt
res['pred_profit'] = lin_mod.predict(test_feat)
res['bayes_profit'] = bayes_mod.predict(test_feat)

In [26]:
res.head(5)

Unnamed: 0,purchase,profit,log_class,log_purchase,log_profit,pred_profit,bayes_profit
26167,False,0.0,False,0.236517,2.396105,2.585693,2.590452
42821,False,0.0,False,0.133979,1.357314,1.318604,1.323761
11719,False,0.0,False,0.391573,3.966939,3.97876,3.980486
38790,False,0.0,False,0.073132,0.740883,0.003418,-0.000881
5328,False,0.0,True,0.554318,5.615678,5.30127,5.294451


In [27]:
mean_squared_error(res.profit, res.log_profit)

23.60138764761442

In [28]:
mean_squared_error(res.profit, res.pred_profit)

23.656620061516762

In [29]:
mean_squared_error(res.profit, res.bayes_profit)

23.655102379355121

In [32]:
confusion_matrix(res.purchase, res.log_class, labels = [True, False])

array([[ 350, 2328],
       [ 333, 6989]])

The above analysis of our model is useful, but we may be able to get better results by dropping out users we are uncertian about. Unfortunately, the model does not seem to be very certian about anyone... our max score is .66 out or 1.00

In [None]:
test.loc[res.score < 0.25, 'profit'].mean(), test.loc[res.score < 0.25, 'profit'].count()

In [None]:
test.loc[ np.logical_and(0.25 < res.score, res.score < 0.5), 'profit'].mean(), test.loc[ np.logical_and(0.25 < res.score, res.score < 0.5), 'profit'].count()

In [None]:
test.loc[ 0.5 < res.score, 'profit'].mean(), test.loc[ 0.5 < res.score, 'profit'].count()

Since no clear winner is present, we go with the logistic regression, since our AI Agents can get both a probability of purchase and an estimated revenue, while the other model only offers the latter value.

In [None]:
pickle.dump(log_mod, open( data_fp+'/model.p', 'wb'))

In [None]:
any(res.pred)