# Determining Consumer Spend

Now that we have our demographics split out in the previous notebook we can focus on trying to solve our core problem of predicting consumer spend. 

To perform the prediction I will use a supervised learn technique called xgboost. This method has been used very successfully recently in both industry and by kaggle data scientist to solve some complex problems. It has become popular due to its high accuracy and low computation costs but the downside is that it has a lot of Hyper parameters to train. 

XGboost is a decision tree algorithm meaning that it splits data based on information gain (entropy) in order to group values. Normal decision tree methods (such as random forests) use a bagging which means many trees are produced and grouped together in an attempt to reduce overfitting.

Similarly to the k-means modeling first I will perform some preprocessing this time to combine our demographics data with the spend by day data. Then I will split the data into testing, training and validation datasets in order to fully test our model. The model will aim to predict spend for the demographic as a whole as this should be easier than individuals at this point. The prediction will be based on the number of consumers, the offers they received, demographic and other factors that may become apparent throughout the analysis. 

### Imports

In [1]:
# Install xgboost if needed using the line of code below
# ! pip install xgboost

In [2]:
# import general functions
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import math

# import functions for modelling
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# Saving the ML model
import joblib

# import the cleaning package
import sbpkg as sb

### Functions

### Global Variables

In [3]:
# read in the different datasources
portfolio_df = pd.read_json('data/portfolio.json', lines=True)
profile_df = pd.read_json('data/profile.json', lines=True)
transcript_df = pd.read_json('data/transcript.json', lines=True)

### Run Cleaning Functions

In [4]:
# run the initial cleaning on each dataset
clean_port_df = sb.clean_portfolio_data(portfolio_df)
clean_prof_df = sb.clean_profile_data(profile_df)
clean_trans_df = sb.clean_transcript_data(transcript_df)

# calculates the uninfluenced transactions for the modeling
uninflunced_trans = sb.norm_transactions(clean_trans_df, clean_port_df)

# process the user data to create the modeling input
user_data = sb.user_transactions(clean_prof_df, uninflunced_trans)

# load in the user spend by day
spd = sb.spend_per_day(clean_trans_df, clean_port_df)
spd.reset_index(inplace=True)

# predict the demographic of all the users in the data
predictions = sb.predict_demographic(user_data)
predictions.head()

Unnamed: 0,female,male,other,unknown gender,age,member joined,person,income,total transactions,total spend,spend per trans,spend per day,membership length,demographic
0,0,0,0,1,0,2017-02-12,68be06ca386d4c31939f3a4f0e3dd783,0.0,9,20.4,2.266667,0.68,76.0,2
1,1,0,0,0,55,2017-07-15,0610b486422d4921ae7d2bf64640c50b,112000.0,3,77.01,25.67,2.567,54.0,1
2,0,0,0,1,0,2018-07-12,38fe809add3b4fcf9315a9694bb96ff5,0.0,5,10.21,2.042,0.340333,2.0,2
3,1,0,0,0,75,2017-05-09,78afa995795e4d85b5d9ceeca43f5fef,100000.0,4,89.99,22.4975,2.999667,63.0,1
4,0,0,0,1,0,2017-08-04,a03223e636434f42ac4c3df47e8bac43,0.0,3,4.65,1.55,0.155,51.0,2


### Data Preparation

In [5]:
# drop everything except the demographic from the prediction data
demographics = predictions[['person','demographic']]

# merge the two datasets so that we have the demographic data for each person
input_data = spd.merge(demographics, on=['person'])
input_data.head()

# only keep the first 23 days so that the last 7 can be used for modeling
input_data = input_data[input_data['transaction_time'] < 24]

# sum the spend & number of offers that 
input_data = input_data.groupby(['transaction_time','person']).sum()
input_data.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,spend,0b1e1539f2cc45b7b9fa7c272da2e1d7,2298d6c36e964ae4a3e7e9706d1fb8c2,2906b810c7d4411798c6938adc9daaa5,3f207df678b143eea3cee63160fa8bed,4d5c57ea9a6940dd891ad53e9dbe8da0,5a8bc65990b245e5a138643cd4eb9837,9b98b8c7a33c4b65b9aebfe6a799e6d9,ae264e3637204a6fb9bb56bc8210ddfd,f19421c1d4aa40978ebb69ca19b0e20d,fafdcd668e3743c1bb461111dcafc2a4,demographic
transaction_time,person,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
23.0,fff29fb549084123bd046dbc5ceb4faa,32.57,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
23.0,fff3ba4757bd42088c044ca26d73817a,414.31,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1
23.0,fff7576017104bcc8677a8d63322b5e1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
23.0,fffad4f4828548d1b5583907f2e9906b,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
23.0,ffff82501cea40309d5fdd7edcca4a07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7


In [6]:
def create_dummy_days(data):
    """
    this module creates dummies depending on the day of the week
    (this is useful as users behaviour will be different on a weekday vs weekend)
    """
    # add dummies for the days of the week 
    day_of_week = pd.DataFrame(list(data['transaction_time']))
    for n in [1,2,3,4,5,6,7]:
        day_of_week = day_of_week.replace((n+7), n).replace((n+7*2), n).replace((n+7*3), n).replace((n+7*4), n)   
    day_of_week = pd.DataFrame([str(x) for x in day_of_week.iloc[:,0]])
    input_data_test = pd.concat([data, pd.get_dummies(day_of_week)], axis=1, join='inner')
    
    return input_data_test

In [7]:
# To perform testing on the data I will only keep one demographic for now
data = input_data[input_data['demographic'] == 0]

# reset the index
data.reset_index(inplace=True)

# add dummies for the days of the week 
input_data_test = create_dummy_days(data)

# keep only the columns needed
input_data_test = input_data_test.drop(columns=['transaction_time','person','demographic'])
input_data_test.head()

Unnamed: 0,spend,0b1e1539f2cc45b7b9fa7c272da2e1d7,2298d6c36e964ae4a3e7e9706d1fb8c2,2906b810c7d4411798c6938adc9daaa5,3f207df678b143eea3cee63160fa8bed,4d5c57ea9a6940dd891ad53e9dbe8da0,5a8bc65990b245e5a138643cd4eb9837,9b98b8c7a33c4b65b9aebfe6a799e6d9,ae264e3637204a6fb9bb56bc8210ddfd,f19421c1d4aa40978ebb69ca19b0e20d,fafdcd668e3743c1bb461111dcafc2a4,0_1.0,0_2.0,0_3.0,0_4.0,0_5.0,0_6.0,0_7.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0


In [8]:
# split the data again into testing and training data
X_train, X_test, y_train, y_test = train_test_split(input_data_test.iloc[:,1:],
                                                   input_data_test.iloc[:,0],
                                                   test_size=.1, 
                                                   random_state=42)

# change the input to a dmatrix to improve optimsation speed for testing
# Note DMatrix was choosen to replace the standard xgboost to reduce the training times
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# calculate the baseline RMSE error, this will be used for evaluation of the model
baseline_pred = np.ones(y_test.shape) * np.mean(y_train)
rmse_baseline = np.sqrt(mean_squared_error(y_test, baseline_pred))

print(f"Baseline RMSE is {rmse_baseline}")

Baseline RMSE is 20.075105668192464


  if getattr(data, 'base', None) is not None and \


### Modelling Spend

To start I will set up the initial parameters to x.

In [10]:
# set up the intial parameters for the xgboosting parameterisation
params = {
    'max_depth':5,
    'min_child_weight': 1,
    'eta':.1,
    'subsample': 1,
    'colsample_bytree': 1,
    'gamma':0.1,
    'objective':'reg:squarederror',
}

First, I will test the Max Depth & Min Child Weight. These are used to stop overfitting and limit the amount of times a decision tree will split the data.

In [11]:
# set the rmse value as a very high value
rmse = 1000

# set_ best parameters as nothing
best_parms = None

# sets values for max_depth & min_child_weight
testing_prams = [
    (max_depth, min_child_weight)
    for max_depth in range(7,12)
    for min_child_weight in range(3,8)
]

# loop through both of the variablles
for max_depth, min_child_weight in testing_prams:
    # update the parameters
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight
    
    # Run CV
    results = xgb.cv(
        params,
        dtrain,
        num_boost_round=999,
        seed=14,
        nfold=5,
        metrics={'rmse'},
        early_stopping_rounds=10
    )
    
    # save the best rmse score
    best_rmse = results['test-rmse-mean'].min()
    
    # print out the results
    print(f"max_depth = {max_depth} | "
          f"min_child_weight = {min_child_weight} | "
          f"rmse = {best_rmse}")
    
    # set the best parameters if they improve on the current rmse
    if best_rmse < rmse:
        rmse = best_rmse
        best_params = (max_depth, min_child_weight)
        
print(f"Best params: "
      f"max_depth = {best_params[0]} | "
      f"min_child_weight = {best_params[1]} | "
      f"rmse = {rmse}")

max_depth = 7 | min_child_weight = 3 | rmse = 16.4535424
max_depth = 7 | min_child_weight = 4 | rmse = 16.3819562
max_depth = 7 | min_child_weight = 5 | rmse = 16.3737226
max_depth = 7 | min_child_weight = 6 | rmse = 16.3849294
max_depth = 7 | min_child_weight = 7 | rmse = 16.3849376
max_depth = 8 | min_child_weight = 3 | rmse = 16.4593948
max_depth = 8 | min_child_weight = 4 | rmse = 16.3831482
max_depth = 8 | min_child_weight = 5 | rmse = 16.376122
max_depth = 8 | min_child_weight = 6 | rmse = 16.384988999999997
max_depth = 8 | min_child_weight = 7 | rmse = 16.385275999999998
max_depth = 9 | min_child_weight = 3 | rmse = 16.467271
max_depth = 9 | min_child_weight = 4 | rmse = 16.3865814
max_depth = 9 | min_child_weight = 5 | rmse = 16.3782414
max_depth = 9 | min_child_weight = 6 | rmse = 16.388047399999998
max_depth = 9 | min_child_weight = 7 | rmse = 16.388620799999998
max_depth = 10 | min_child_weight = 3 | rmse = 16.466247799999998
max_depth = 10 | min_child_weight = 4 | rmse = 16

In [12]:
# set the new best parameters based on the testing
params['max_depth'] = 7
params['min_child_weight'] = 5

Next, I will test subsample and colsample_bytree. These parameters essentially add some randomness in the training, this helps with reducing noise in the model.

In [13]:
# set the rmse value as a very high value
rmse = 1000

# set_ best parameters as nothing
best_parms = None

# sets values for max_depth & min_child_weight
testing_prams = [
    (subsample, colsample_bytree)
    for subsample in [i/10. for i in range(5,11)]
    for colsample_bytree in [i/10. for i in range(5,11)]
]

# loop through both of the variablles
for subsample, colsample_bytree in testing_prams:
    # update the parameters
    params['subsample'] = subsample
    params['colsample_bytree'] = colsample_bytree
    
    # Run CV
    results = xgb.cv(
        params,
        dtrain,
        num_boost_round=999,
        seed=14,
        nfold=5,
        metrics={'rmse'},
        early_stopping_rounds=10
    )
    
    # save the best rmse score
    best_rmse = results['test-rmse-mean'].min()
    
    # print out the results
    print(f"subsample = {subsample} | "
          f"colsample_bytree = {colsample_bytree} | "
          f"rmse = {best_rmse}")
    
    # set the best parameters if they improve on the current rmse
    if best_rmse < rmse:
        rmse = best_rmse
        best_params = (subsample, colsample_bytree)
        
print(f"Best params: "
      f"subsample = {best_params[0]} | "
      f"colsample_bytree = {best_params[1]} | "
      f"rmse = {rmse}")

subsample = 0.5 | colsample_bytree = 0.5 | rmse = 16.380117000000002
subsample = 0.5 | colsample_bytree = 0.6 | rmse = 16.3871046
subsample = 0.5 | colsample_bytree = 0.7 | rmse = 16.3911192
subsample = 0.5 | colsample_bytree = 0.8 | rmse = 16.413428200000002
subsample = 0.5 | colsample_bytree = 0.9 | rmse = 16.4288454
subsample = 0.5 | colsample_bytree = 1.0 | rmse = 16.403825
subsample = 0.6 | colsample_bytree = 0.5 | rmse = 16.385513000000003
subsample = 0.6 | colsample_bytree = 0.6 | rmse = 16.387587000000003
subsample = 0.6 | colsample_bytree = 0.7 | rmse = 16.395333400000002
subsample = 0.6 | colsample_bytree = 0.8 | rmse = 16.4155074
subsample = 0.6 | colsample_bytree = 0.9 | rmse = 16.4259086
subsample = 0.6 | colsample_bytree = 1.0 | rmse = 16.4002072
subsample = 0.7 | colsample_bytree = 0.5 | rmse = 16.383279199999997
subsample = 0.7 | colsample_bytree = 0.6 | rmse = 16.382112199999998
subsample = 0.7 | colsample_bytree = 0.7 | rmse = 16.386336800000002
subsample = 0.7 | cols

In [14]:
params['subsample'] = 1.0
params['colsample_bytree'] = 0.5

Finally, I will set the value of eta which is essentially the learning rate of the model. This is the amount the model updates itself between iterations

In [15]:
# set the rmse value as a very high value
rmse = 1000

# set_ best parameters as nothing
best_parms = None

# loop through both of the variablles
for eta in [.3, .2, .1, .05, .01, .005]:
    # update the parameters
    params['eta'] = eta
    
    # Run CV
    results = xgb.cv(
        params,
        dtrain,
        num_boost_round=999,
        seed=14,
        nfold=5,
        metrics={'rmse'},
        early_stopping_rounds=10
    )
    
    # save the best rmse score
    best_rmse = results['test-rmse-mean'].min()
    
    # print out the results
    print(f"eta = {eta} | "
          f"rmse = {best_rmse}")
    
    # set the best parameters if they improve on the current rmse
    if best_rmse < rmse:
        rmse = best_rmse
        best_params = (eta)
        
print(f"Best params: "
      f"eta = {best_params} | "
      f"rmse = {rmse}")

eta = 0.3 | rmse = 16.37307
eta = 0.2 | rmse = 16.370058800000002
eta = 0.1 | rmse = 16.3733332
eta = 0.05 | rmse = 16.3741284
eta = 0.01 | rmse = 16.3733182
eta = 0.005 | rmse = 16.374747399999997
Best params: eta = 0.2 | rmse = 16.370058800000002


In [16]:
params['eta'] = .2

The final thing I will do to use the native API 'train' option to determine the number of rounds to that need to be ran to get the best result this should speed up the training/fitting when running it in the SKlearn API for running the model on all of the demographics.

In [17]:
# test the number of boosted rounds needed to get the best model
best_model = xgb.train(
    params,
    dtrain,
    num_boost_round=999,
    evals=[(dtest, "Test")],
    early_stopping_rounds=10
)

# test the predictions created above
pred = best_model.predict(dtest)
print(np.sqrt(mean_squared_error(pred, y_test)))

[0]	Test-rmse:20.4446
Will train until Test-rmse hasn't improved in 10 rounds.
[1]	Test-rmse:20.1855
[2]	Test-rmse:19.9577
[3]	Test-rmse:19.8482
[4]	Test-rmse:19.7614
[5]	Test-rmse:19.7012
[6]	Test-rmse:19.5985
[7]	Test-rmse:19.5395
[8]	Test-rmse:19.5354
[9]	Test-rmse:19.5265
[10]	Test-rmse:19.5117
[11]	Test-rmse:19.478
[12]	Test-rmse:19.4427
[13]	Test-rmse:19.4209
[14]	Test-rmse:19.4502
[15]	Test-rmse:19.4752
[16]	Test-rmse:19.4935
[17]	Test-rmse:19.4811
[18]	Test-rmse:19.488
[19]	Test-rmse:19.4943
[20]	Test-rmse:19.4967
[21]	Test-rmse:19.4939
[22]	Test-rmse:19.5035
[23]	Test-rmse:19.503
Stopping. Best iteration:
[13]	Test-rmse:19.4209

19.502780948927686


Now that I have the optimum parameters for a model I can loop through each of the demographics and create a model for each. Ideally I would do the training I have done above individually for each of the demographics to obtain the best results for each demographic. However, due to time limitations on this projects I'm going to take all of the parameters above and just assume they will be similarly effective for all of the other demographics. 

For this I will use the SKlearn API as I don't need all the performance enhancements above and it will make the process super simple to use in pandas as I can do it without converting to a DMatrics etc. There are some differences in the names of the parameters for this though and they are as follows:
- eta becomes learning_rate
- num_boost_round becomes n_estimators

Now I will create a function to loop through each of the demographics below.

In [18]:
params

{'max_depth': 7,
 'min_child_weight': 5,
 'eta': 0.2,
 'subsample': 1.0,
 'colsample_bytree': 0.5,
 'gamma': 0.1,
 'objective': 'reg:squarederror'}

In [19]:
def create_spend_model(spend_data, demographics_data, model_demographic):
    """
    this function is used create a model that can predict the price for a given demographic
    """
    # drop everything except the demographic from the prediction data
    demographics = predictions[['person','demographic']]

    # merge the two datasets so that we have the demographic data for each person
    input_data = spd.merge(demographics, on=['person'])
    input_data.head()

    # only keep the first 23 days so that the last 7 can be used for modeling
    input_data = input_data[input_data['transaction_time'] < 24]

    # sum the spend & number of offers that are influenced 
    input_data = input_data.groupby(['transaction_time','person']).sum()
    
    # To perform testing on the data I will only keep one demographic for now
    data = input_data[input_data['demographic'] == model_demographic]

    # reset the index
    data.reset_index(inplace=True)
    
    # add dummies for the days of the week 
    input_data_test = create_dummy_days(data)

    # keep only the columns needed for modeling
    input_data_test = input_data_test.drop(columns=['transaction_time','person','demographic'])
    
    # split the data again into testing and training data
    X_train, X_test, y_train, y_test = train_test_split(input_data_test.iloc[:,1:],
                                                        input_data_test.iloc[:,0],
                                                        test_size=.1, 
                                                        random_state=14)
    
    
    # creates model using the XGB classifier    
    model = xgb.XGBRegressor(max_depth=7,
                min_child_weight=5,
                subsample=1.0,
                colsample_bytree=0.5,
                objective='reg:squarederror',
                n_estimators=13,
                learning_rate=0.2)
    
    # fit the model to the training data    
    model.fit(X_train, y_train)
    
    # calculate the rmse error
    test_pred = model.predict(X_test)
    mod_rmse = np.sqrt(mean_squared_error(test_pred, y_test))
    
    # calculate baseline rmse
    mean_train = np.mean(y_train)
    baseline_pred = np.ones(y_test.shape) * mean_train
    base_rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
    
    # save the model ready for the predictions
    joblib.dump(model, f"xgboost_price_model_{model_demographic}.pkl")
    print(f"xgboost_price_model_{model_demographic}.pkl")
    
    # print rmse compaired to the base level
    print(f"Baseline RMSE is {str(base_rmse)}")
    print(f"Model RMSE: {str(mod_rmse)}")   

In [20]:
# create the model for each of the different demographics:
for demographic in [0,1,2,3,4,5,6,7]:
    create_spend_model(spd, predictions, demographic)

  if getattr(data, 'base', None) is not None and \


xgboost_price_model_0.pkl
Baseline RMSE is 17.7902186954807
Model RMSE: 16.96676489950423


  if getattr(data, 'base', None) is not None and \


xgboost_price_model_1.pkl
Baseline RMSE is 17.598916894328337
Model RMSE: 16.567164259949678


  if getattr(data, 'base', None) is not None and \


xgboost_price_model_2.pkl
Baseline RMSE is 5.845066103278775
Model RMSE: 5.512487260073769


  if getattr(data, 'base', None) is not None and \


xgboost_price_model_3.pkl
Baseline RMSE is 123.29573022955441
Model RMSE: 123.67728841275417


  if getattr(data, 'base', None) is not None and \


xgboost_price_model_4.pkl
Baseline RMSE is 9.810507178176158
Model RMSE: 8.851607222341732


  if getattr(data, 'base', None) is not None and \


xgboost_price_model_5.pkl
Baseline RMSE is 9.192026545360221
Model RMSE: 8.706091972699607


  if getattr(data, 'base', None) is not None and \


xgboost_price_model_6.pkl
Baseline RMSE is 12.270761067837023
Model RMSE: 11.73321456780003


  if getattr(data, 'base', None) is not None and \


xgboost_price_model_7.pkl
Baseline RMSE is 11.974472367572652
Model RMSE: 11.441671388210306


In all cases we have improved on our base model but not by a lot in some cases with the average improvement only being less than 1 for some of demographics. This shows that in some cases the spend seems to be independent of the features of the offer influences. This might be due to erratic variations of spend in some demographics and this could even be a feature of that demographic such as demographic 6. These users have high spend per day and on their average uninfluenced transactions. This could show that they are unlikely to be influenced by any of the offers but will spend highly regardless.

In [29]:
# create a function to use each of the above models to predict spend
def predict_spend(input_data, model_demographic):
    """
    this function predicts the spend of users based on the model made for there demographic
    """    
    # load in the model needed to predict spend on the analysis
    demo_model = joblib.load(f"xgboost_price_model_{model_demographic}.pkl")
    
    # keep only the demographic data related to the model to be used in this section
    data = input_data[input_data['demographic'] == model_demographic]
    
    # reset the index
    data.reset_index(inplace=True)
    
    # add dummies for the days of the week 
    input_data_test = create_dummy_days(data)

    # keep only the columns needed
    input_data_test = input_data_test.drop(columns=['transaction_time','person','spend','demographic'])
    
    # calculate the prediction based on the input date
    prediction = demo_model.predict(input_data_test)
    
    # attach the prediction to the original filtered df
    input_data_test['prediction'] = prediction
    
    return input_data_test 

In [31]:
predictions_one = predict_spend(input_data, 1)
predictions_one.head(10)

Unnamed: 0,0b1e1539f2cc45b7b9fa7c272da2e1d7,2298d6c36e964ae4a3e7e9706d1fb8c2,2906b810c7d4411798c6938adc9daaa5,3f207df678b143eea3cee63160fa8bed,4d5c57ea9a6940dd891ad53e9dbe8da0,5a8bc65990b245e5a138643cd4eb9837,9b98b8c7a33c4b65b9aebfe6a799e6d9,ae264e3637204a6fb9bb56bc8210ddfd,f19421c1d4aa40978ebb69ca19b0e20d,fafdcd668e3743c1bb461111dcafc2a4,0_1.0,0_2.0,0_3.0,0_4.0,0_5.0,0_6.0,0_7.0,prediction
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,4.540162
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,25.424786
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,4.540162
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,4.540162
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,4.540162
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,4.540162
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,4.540162
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,4.540162
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,4.540162
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,4.540162


Now that we have built our models (with some obvious limitations) I will evaluate them and use them to look into user behaviour/offer performance in the next notebook. In this notebook I will also list any improvements or enhancements that could be performed to get a better prediction given more time, better data or a change of approach.