# Determining Consumer Spend

Now that we have our demographics split out in the previous notebook we can focus on trying to solve our core problem of predicting consumer spend. 

To perform the prediction I will use a supervised learn technique called xgboost. This method has been used very successfully recently in both industry and by kaggle data scientist to solve some complex problems. It has become popular due to its high accuracy and low computation costs but the downside is that it has a lot of Hyper parameters to train. 

XGboost is a decision tree algorithm meaning that it splits data based on information gain (entropy) in order to group values. Normal decision tree methods (such as random forests) use a bagging which means many trees are produced and grouped together in an attempt to reduce overfitting.

Similarly to the k-means modeling first I will perform some preprocessing this time to combine our demographics data with the spend by day data. Then I will split the data into testing, training and validation datasets in order to fully test our model. The model will aim to predict spend for the demographic as a whole as this should be easier than individuals at this point. The prediction will be based on the number of consumers, the offers they received, demographic and other factors that may become apparent throughout the analysis. 

### Imports

In [1]:
# Install xgboost if needed using the line of code below
# ! pip install xgboost

In [2]:
# import general functions
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

# import functions for modelling
import xgboost as xgb
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# import the cleaning package
import sbpkg as sb

### Functions

### Global Variables

In [3]:
# read in the different datasources
portfolio_df = pd.read_json('data/portfolio.json', lines=True)
profile_df = pd.read_json('data/profile.json', lines=True)
transcript_df = pd.read_json('data/transcript.json', lines=True)

### Run Cleaning Functions

In [4]:
# run the initial cleaning on each dataset
clean_port_df = sb.clean_portfolio_data(portfolio_df)
clean_prof_df = sb.clean_profile_data(profile_df)
clean_trans_df = sb.clean_transcript_data(transcript_df)

# calculates the uninfluenced transactions for the modeling
uninflunced_trans = sb.norm_transactions(clean_trans_df, clean_port_df)

# process the user data to create the modeling input
user_data = sb.user_transactions(clean_prof_df, uninflunced_trans)

# load in the user spend by day
spd = sb.spend_per_day(clean_trans_df, clean_port_df)

# predict the demographic of all the users in the data
predictions = sb.predict_demographic(user_data)
predictions.head()

ValueError: could not convert string to float: '68be06ca386d4c31939f3a4f0e3dd783'

### Data Preparation

In [None]:
# Merge the two datasets so that we have the demographic data for each person


In [None]:
# split the data again into testing and training data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.1, random_state=42)

In [None]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [None]:
# "Learn" the mean from the training data
mean_train = np.mean(y_train)
# Get predictions on the test set
baseline_predictions = np.ones(y_test.shape) * mean_train
# Compute MAE
mae_baseline = mean_absolute_error(y_test, baseline_predictions)
print("Baseline MAE is {:.2f}".format(mae_baseline))
Baseline MAE is 11.31

### Modelling Spend

In [None]:
params = {
    # Parameters that we are going to tune.
    'max_depth':6,
    'min_child_weight': 1,
    'eta':.3,
    'subsample': 1,
    'colsample_bytree': 1,
    # Other parameters
    'objective':'reg:linear',
}

In [None]:
num_boost_round = 999

In [None]:
model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")],
    early_stopping_rounds=10
)

print("Best MAE: {:.2f} with {} rounds".format(
                 model.best_score,
                 model.best_iteration+1))

In [None]:
cv_results = xgb.cv(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    seed=42,
    nfold=5,
    metrics={'mae'},
    early_stopping_rounds=10
)
cv_results

In [None]:
cv_results['test-mae-mean'].min()

In [None]:
# You can try wider intervals with a larger step between
# each value and then narrow it down. Here after several
# iteration I found that the optimal value was in the
# following ranges.
gridsearch_params = [
    (max_depth, min_child_weight)
    for max_depth in range(9,12)
    for min_child_weight in range(5,8)
]

In [None]:
# Define initial best params and MAE
min_mae = float("Inf")
best_params = None
for max_depth, min_child_weight in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}".format(
                             max_depth,
                             min_child_weight))
    # Update our parameters
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight
    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=num_boost_round,
        seed=42,
        nfold=5,
        metrics={'mae'},
        early_stopping_rounds=10
    )
    # Update best MAE
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (max_depth,min_child_weight)
print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

In [None]:
params['max_depth'] = 10
params['min_child_weight'] = 6

In [None]:
gridsearch_params = [
    (subsample, colsample)
    for subsample in [i/10. for i in range(7,11)]
    for colsample in [i/10. for i in range(7,11)]
]

In [None]:
min_mae = float("Inf")
best_params = None
# We start by the largest values and go down to the smallest
for subsample, colsample in reversed(gridsearch_params):
    print("CV with subsample={}, colsample={}".format(
                             subsample,
                             colsample))
    # We update our parameters
    params['subsample'] = subsample
    params['colsample_bytree'] = colsample
    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=num_boost_round,
        seed=42,
        nfold=5,
        metrics={'mae'},
        early_stopping_rounds=10
    )
    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (subsample,colsample)
print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

In [None]:
params['subsample'] = .8
params['colsample_bytree'] = 1.

In [None]:
%time
# This can take some time…
min_mae = float("Inf")
best_params = None
for eta in [.3, .2, .1, .05, .01, .005]:
    print("CV with eta={}".format(eta))
    # We update our parameters
    params['eta'] = eta
    # Run and time CV
    %time cv_results = xgb.cv(
            params,
            dtrain,
            num_boost_round=num_boost_round,
            seed=42,
            nfold=5,
            metrics=['mae'],
            early_stopping_rounds=10
          )
    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds\n".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = eta
print("Best params: {}, MAE: {}".format(best_params, min_mae))

In [None]:
params['eta'] = .01

In [None]:
params
{'colsample_bytree': 1.0,
 'eta': 0.01,
 'eval_metric': 'mae',
 'max_depth': 10,
 'min_child_weight': 6,
 'objective': 'reg:linear',
 'subsample': 0.8}

In [None]:
model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")],
    early_stopping_rounds=10
)

In [None]:
num_boost_round = model.best_iteration + 1
best_model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")]
)

In [None]:
best_model.save_model("my_model.model")

In [None]:
loaded_model = xgb.Booster()
loaded_model.load_model("my_model.model")

# And use it for predictions.
loaded_model.predict(dtest)

### Modelling Analysis

In [None]:
demographics_output.groupby('cluster').mean()

In [None]:
demographics_output.groupby('cluster')['age'].describe()