## Modeling Betcounts per Wager

In this notebook I will focus on finding a model to predict the total number of bets made per each individual wager. Through this process a production model will be selected and compared to the baseline metrics to understand how strong the model is performing. I will then analyze the most influential features contributing to the models predictions which will tell the story about what is contributing to the total number of bets that were made through the first five months of the 2018 horse racing season.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import time
import pickle
import csv

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso, LinearRegression, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error

%matplotlib inline

plt.style.use('fivethirtyeight')

np.random.seed(42)

  from numpy.core.umath_tests import inner1d


### Load the Data

Loading in the merged dataframe created during EDA which includes the data for each wager made along with the customer demographic data for each individual that made the wager.

In [2]:
final_wager_df = pd.read_csv('../data/final_wager.csv')

In [3]:
final_wager_df.head()

Unnamed: 0,race_month,userid,address,gender,age,internet_/_shop,bet_type_category,handle,revenue,betcount
0,2018-01-01,199,NY,Male,18,Mobile,WPS,200.0,20.0,5.0
1,2018-01-01,199,NY,Male,18,Desktop,WPS,25.0,2.5,1.0
2,2018-02-01,199,NY,Male,18,Mobile,WPS,42.0,4.2,1.0
3,2018-01-01,531,NY,Female,18,Mobile,WPS,205.0,20.5,1.0
4,2018-01-01,887,NY,Male,18,Mobile,WPS,205.0,20.5,2.0


In [4]:
final_wager_df.dtypes

race_month            object
userid                 int64
address               object
gender                object
age                    int64
internet_/_shop       object
bet_type_category     object
handle               float64
revenue              float64
betcount             float64
dtype: object

### Setup Dummy Variables and Modeling Dataframe

With the final dataset now completed we need to ensure that all of our data columns are in a numerical data type. This means casting our categorical features as dummies so that they are recoginized as numeric data types to ensure our regression model can make predictions. With pandas we use "get_dummies" on the dataframe to cast all object columns to boolean numerical type columns.

In [5]:
model_df = pd.get_dummies(final_wager_df)

In [6]:
model_df.head()

Unnamed: 0,userid,age,handle,revenue,betcount,race_month_2018-01-01,race_month_2018-02-01,race_month_2018-03-01,race_month_2018-04-01,race_month_2018-05-01,address_CA,address_FL,address_KY,address_NY,gender_Female,gender_Male,internet_/_shop_Desktop,internet_/_shop_Mobile,bet_type_category_Exotic,bet_type_category_WPS
0,199,18,200.0,20.0,5.0,1,0,0,0,0,0,0,0,1,0,1,0,1,0,1
1,199,18,25.0,2.5,1.0,1,0,0,0,0,0,0,0,1,0,1,1,0,0,1
2,199,18,42.0,4.2,1.0,0,1,0,0,0,0,0,0,1,0,1,0,1,0,1
3,531,18,205.0,20.5,1.0,1,0,0,0,0,0,0,0,1,1,0,0,1,0,1
4,887,18,205.0,20.5,2.0,1,0,0,0,0,0,0,0,1,0,1,0,1,0,1


In [7]:
model_df.drop('userid', axis=1, inplace=True)

In [8]:
model_df.shape

(34904, 19)

#### Save the Data

Saving a final copy of the modeling dataframe.

In [9]:
model_df.to_csv('../data/model_df.csv')

### Setup X and y

With our final dataframe now completed we need to identify both the feature dataframe which will be used to train our model (known as X) and the target variable that we are predicting (known as y).
- X : will be all features in the data provided minus the betcount
- y : will be just the betcount values as this is what we are trying to predict per each wager made

In [10]:
X = model_df.drop('betcount', axis=1) 
y = model_df['betcount']

### Train / Test / Split

To create a batch of data for both training our model and then testing our model we perfom a train, test, and split on our identified X and y variables. This is created at random.
- I will use the default split of 75% training data and 25% testing data

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [12]:
X_train.shape, y_train.shape

((26178, 18), (26178,))

In [13]:
X_test.shape, y_test.shape

((8726, 18), (8726,))

#### Save the Data

Saving the train and test split data in order for future use.

In [15]:
X_train.to_csv('../data/X_train.csv', index=False)

In [16]:
X_test.to_csv('../data/X_test.csv', index=False)

In [17]:
with open('../pickles/y_train.pkl', 'wb+') as f:
    pickle.dump(y_train, f)

In [18]:
with open('../pickles/y_test.pkl', 'wb+') as f:
    pickle.dump(y_test, f)

### Scale the Data

To ensure the model can make accurate predictions on the target variable we need each feature to be placed on the same scale so that features are not over weighted unjustly.

In [19]:
ss = StandardScaler()

In [20]:
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

#### Save the Data

In [21]:
with open('../pickles/standard_scaler.pkl', 'wb+') as f:
    pickle.dump(ss, f)

In [22]:
with open('../data/X_train_sc.csv', 'w+') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerows(X_train_sc)

In [23]:
with open('../data/X_test_sc.csv', 'w+') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerows(X_test_sc)

### Modeling Function

To ensure efficiency in the modeling process I have created a function to assist me. 

##### Explanation of Function

1. **Modeling Function:** All of my models are going to utilize grid searching in order to find the optimal hyper-parameters to use within the model for predictions, so I will instantiate GridSearchCV within the function as gs. Next, the function will call upon the fit method in oder to fit the necessary data that will be used to train the model and ultimately make predictions. Upon completion of the fitting a saved pickle of the model will be saved with the key of the dictionary from the corresponding pipeline (model abbreviation) and the time stamp (in seconds) to be able to uniquely identify which model the pickle belongs.
    - The following arguments will be required to run this function:
        1. pipe : each of my models will have a pipeline setup prior to fitting a model
        2. params : each of my models will have hyper-parameters setup for the identified model in the pipeline to be used for grid searching for the best hyper-parameters
        3. X_train : the features from the data used for training the model during fit
        4. y_train : the target varibale values (total bets per wager) to be used for training the model during fit

In [24]:
def modeling_func(pipe, params, X_train, y_train, cv=3):
    gs = GridSearchCV(pipe, param_grid=params, cv=cv)
    gs.fit(X_train, y_train)
    with open(f'../pickles/{"_".join(pipe.named_steps.keys())}{int(time.time())}.pkl', 'wb+') as f:
        pickle.dump(gs, f)
    return gs

### Baseline Model and Metrics

#### Average number of bets made per wager

To get a baseline for the target variable (number of bets made per wager) I will take the average of the y_train and y_test variables which contain the actual total number of bets made with each wager so that I have an idea of what the average of my model predictions should be targeting. This is known as the naive model.

#### R2 Score and Root Mean Squared Error

Lets also understand some baseline metrics in the target variable. These metrics will provide some insights into what my model should be targeting or performing better than:
- **R2:** Is the explained variation in the target variable from the features provided in the model. In other words the variation in the model from the line of best fit. This will be explained as a percentage and we'll want to target a value closer to one.
- **RMSE (Root Mean Square Error):** Is the average distance from the line of best fit. This can be interpreted in the same units as the target variable (number of total bets). This value should be lower as we want the average distance from our line of best fit to be small.

##### Interpretation of Baseline Model and Metrics

The average number of bets being made in the train and test batches are almost identical at around 7.5 bets made per wager so when we make the final predictions from the production model we'll get the mean and compare how close it is to 7.5 bets.

The R2 score on both the train and test models against the baseline predictions (total bets per wager) is giving a baseline variance of 0 meaning that the baseline R2 is the worst model possible. This means that if our model predictions are explaning any variance better than zero we are already in a better spot than the baseline. I will be optimizing for an R2 as close to one as possible.

The RMSE score on both the train and test models against the baseline actuals (total bets per wager) is very similar at around 18.24 and 19.79. Meaning that on average the baseline model is off by 18.24 total bets from the line of best fit. The goal will be to optimize my model to obtian a RMSE lower than 19.79.

In [25]:
print(f'Average Bets Made Train Data: {y_train.mean()}')
print(f'Average Bets Made Test Data: {y_test.mean()}')

Average Bets Made Train Data: 7.429559171823668
Average Bets Made Test Data: 7.6410726564290625


In [26]:
print(f'R2 score Train Data: {r2_score(y_train, [y_train.mean()] * len(y_train))}')
print(f'R2 score Test Data: {r2_score(y_test, [y_test.mean()] * len(y_test))}')

R2 score Train Data: 0.0
R2 score Test Data: 0.0


In [27]:
print(f'RMSE score Train Data: {np.sqrt(mean_squared_error(y_train, [y_train.mean()] * len(y_train)))}')
print(f'RMSE score Test Data: {np.sqrt(mean_squared_error(y_test, [y_test.mean()] * len(y_test)))}')

RMSE score Train Data: 18.248556172903967
RMSE score Test Data: 19.79890837425836


### GridSearch with ElasticNet

#### GridSearchCV

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.


#### ElasticNet

The elastic net regularization technique imposes the lasso and ridge penalties (L1 and L2) on the estimates to those that were identified as performing the worst in the model. In this technique the elastic net will choose the optimal method (Lasso or Ridge) if one is found to be more optimal than the other.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [58]:
# gs = pd.read_pickle('../pickles/enet1542340254.pkl')

#### Pipeline and Parameters

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [42]:
pipe = Pipeline([
    ('enet', ElasticNet())
])

In [43]:
params = {
    'enet__alpha': np.logspace(-3,3,9),
    'enet__l1_ratio': [2.5, 3, 3.5, 4, 4.5]
}

#### ElasticNet Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the ElasticNet model is scoring as follows:
- Train R2: 0.0
- Test R2: -0.0001

##### Interpretation

The model score on the test data is negative indicating that this dataset is not well suited for linear models. These scores are not beating the baseline metrics and thus will not be the model I will choose for this data. As we saw with the correlation heat map earlier there wasn't any strong linera connections to the total betcounts per wager so this makes sense that the ElasticNet liner model is not perfoming well.

**NOTE:** To run a model uncomment the below line of code

In [None]:
# gs = modeling_func(pipe, params, X_train_sc, y_train)

In [59]:
gs.best_params_

{'enet__alpha': 0.1778279410038923, 'enet__l1_ratio': 3}

In [60]:
gs.score(X_test_sc, y_test)

-0.00011412837368274253

In [61]:
gs.score(X_train_sc, y_train)

0.0

### GridSeach with GradientBoostRegressor

#### GridSearchCV

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.

#### Gradient Boost

When boosting a model the model is building multiple simple models and learning from these models to be more approximate when predicting. These simple models are referred to as weak models or weak learners. 

Gradient Boosting looks at these weak models sequentially and trains on the residuals or errors in order to give more importance to the less accurate predictions and once completed uses what was learned from these predictions to combine with the strong predictions to have a better overall approximation.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [None]:
# gs = pd.read_pickle('../pickles/gb1542340446.pkl')

#### Pipeline and Parameters

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [48]:
pipe = Pipeline([
    ('gb', GradientBoostingRegressor())
])

In [49]:
params_gb = {
    'gb__n_estimators':[200, 210, 220],
    'gb__max_depth':[3, 5, 7]
}

#### Gradient Boost Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the GradietnBoost model is scoring as follows:
- Train R2 Score: 0.7851
- Test R2 Score: 0.3825

##### Interpretation

The train score is now explaining around 78.51% of the variance in the predictions from the features in the dataset along with the test explaining around 38.25% of the variance. This model is definitely overfit, as the training score is greater than the testing score, but this model is out performing the baseline model and is the best we've seen so far.

**NOTE:** To run a model uncomment the below line of code

In [50]:
# gs = modeling_func(pipe, params_gb, X_train_sc, y_train)

In [53]:
gs.best_params_

{'gb__max_depth': 5, 'gb__n_estimators': 200}

In [54]:
gs.score(X_train_sc, y_train)

0.7851044357317457

In [55]:
gs.score(X_test_sc, y_test)

0.3825093662132626

### GridSearch with AdaBoostRegressor

#### GridSearchCV

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.

#### AdaBoost

Again, boosting a model is the process of building multiple simple models and learning from these models to be more approximate when predicting. These simple models are referred to as weak models or weak learners.

AdaBoost will work similarily to Gradient Boost in that it looks at these weak learners to train itself except it will modify the weights attached to the less accurate predictions and then combine what its learned back to the stronger predictions to make a better overall approximation.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [None]:
# gs = pd.read_pickle('../pickles/ada1542340860.pkl')

#### Pipeline and Parameters

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [64]:
pipe = Pipeline([
    ('ada', AdaBoostRegressor())
])

In [65]:
params_ada = {
    'ada__n_estimators':[120, 140, 160]
}

#### AdaBoost Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the AdaBoost model is scoring as follows:
- Train R2 Score: -0.0232
- Test R2 Score: 0.0090

##### Interpretation

The model scores have returned to the lowest we've seen when doing the ElasticNet modeling which means we are not performing much better than the baseline model and thus will not use this model.

**NOTE:** To run a model uncomment the below line of code

In [66]:
# gs = modeling_func(pipe, params_ada, X_train_sc, y_train)

In [67]:
gs.best_params_

{'ada__n_estimators': 140}

In [68]:
gs.score(X_train_sc, y_train)

-0.023289600567534485

In [69]:
gs.score(X_test_sc, y_test)

0.009018999396452787

### GridSearch with RandomForestRegressor

#### GridSearchCV

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.

#### RandomForest

This ensemble modeling technique will create decision trees from a random subset of features in the dataset and use averaging on those trees to improve the accuracy of the model.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [None]:
# gs_rf = pd.read('../pickles/rf1542341132.pkl')

#### Pipeline and Parameters

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [70]:
pipe = Pipeline([
    ('rf', RandomForestRegressor())
])

In [71]:
params_rf = {
    'rf__n_estimators':[140, 150, 160, 170]
}

#### Random Forest Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the Random Forest model is scoring as follows:
- Train Accuracy Score: 0.9017
- Test Accuracy Score: 0.3189

##### Interpretation

This model is showing an R2 of 0.9017 on the train data which means that 90.17% of the variance in our predictions against the actual values (total bets per wager) can be explained by the features in model and the R2 score of 0.3189 on the test data means that 31.89% of the variance in our predictions against the actual values (total bet per wager) can be explained by the features in the model. The Random Forest model has the highest R2 score on the train data that we've seen so far but ideally we are focused on obtaining the best score on the test data as this is the unseen data we are making predictions from so it's close but doesn't beat the score obtained from the Gradient Boost model.

**NOTE:** To run a model uncomment the below line of code

In [72]:
# gs = modeling_func(pipe, params_rf, X_train, y_train)

In [73]:
gs.best_params_

{'rf__n_estimators': 150}

In [74]:
gs.score(X_train, y_train)

0.9017104271150153

In [75]:
gs.score(X_test, y_test)

0.31893094456550153

### GridSearch with SupportVectorRegressor

#### GridSearchCV

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.

#### SupportVectorRegressor

Using the training data the support vector regressor will setup two categories and determine how the data should be categorized. When applied to the unseen testing data the model will categorize the data into these identified categories. The model learns from these categories and then builds the optimal model.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [None]:
gs_svr = pd.read_pickle('../pickles/svr1542325470.pkl')

#### Pipeline and Parameters

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [76]:
pipe = Pipeline([
    ('svr', SVR())
])

In [77]:
params_svr = {
    'svr__C':[.05, .5, 1],
    'svr__epsilon':[.01, .1, .5]
}

#### Support Vector Regressor Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the Support Vector model is scoring as follows:
- Train Accuracy Score: 0.1185
- Test Accuracy Score: 0.1019

##### Interpretation

This model is showing an R2 of 0.1185 on the train data which means that 11.85% of the variance in our predictions against the actual values (total bets per wager) can be explained by the features in model and the R2 score of 0.1019 on the test data means that 10.19% of the variance in our predictions against the actual values (total bet per wager) can be explained by the features in the model. The model is definitely performing better than the baseline and better than both the AdaBoost ensemble model and ElasticNet linear model but didn't out perform two of the ensemble models, RandomForest and GradientBoost.

**NOTE:** To run a model uncomment the below line of code

In [78]:
gs = modeling_func(pipe, params_svr, X_train_sc, y_train)

In [79]:
gs.best_params_

{'svr__C': 1, 'svr__epsilon': 0.5}

In [80]:
gs.score(X_train_sc, y_train)

0.11859259968811098

In [81]:
gs.score(X_test_sc, y_test)

0.10199837095709108

##### Moving to 03-Production_Model