# Modeling

With our data cleaned, explored and now pre-processed for modeling I can now begin to run some models to predict the home run exit velocities for each home run that was hit in the 2015, 2016 and 2017 seasons.

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import pickle
import csv

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso, LinearRegression, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error 
from sklearn import preprocessing
from scipy.stats import ttest_ind

import time

import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

### Load the Data

Loading in the following data and pickles:
- Scaled data as X_train_sc and X_test_sc
- Original X_train and X_test data frames
- Original y_train and y_test target variable arrays
- Pickled Standard Scaler as ss

In [9]:
X_train_sc = pd.read_csv('../data/X_train_sc.csv', header=None)
X_test_sc = pd.read_csv('../data/X_test_sc.csv', header=None)

In [10]:
X_train = pd.read_csv('../data/X_train.csv')
X_test = pd.read_csv('../data/X_test.csv')

In [11]:
y_train = pd.read_pickle('../pickles/y_train.pkl')
y_test = pd.read_pickle('../pickles/y_test.pkl')

In [12]:
ss = pd.read_pickle('../pickles/standard_scaler.pkl')

### Significant Statistical Evidence

Prior to beginning the modeling process I want to statistically check if there is significant evidence that launch speeds have changed from season to season between the 2015, 2016 and 2017 seasons.

##### Interpretation (2015 to 2016)

Using a .05 alpha (or 95% confidence interval), p=0.026 < a=.05 meaning we can reject the **H0** and conclude that there is significant evidence that the launch speeds between 2015 and 2016 is not equal.

##### Interpretation (2016 to 2017)

Using a .05 alpha (or 95% confidence interval), p=0.0027 < a=.05 meaning we can reject the **H0** and conclude that there is significant evidence that the launch speeds between 2016 and 2017 is not equal.


Now, that we can see statistical evidence of differences between the seasons lets run some models to see what is influencing a batters launch speed for home run hits

#### Hypothesis Testing

- **H0**: Launch Speed 2015 = Launch Speed 2016
- **H1**: Launch Speed 2015 != Launch Speed 2016

In [34]:
ttest_ind(y_train[(X_train.game_year_2016 != 1) & (X_train.game_year_2017 != 1)], y_train[X_train.game_year_2016== 1])

Ttest_indResult(statistic=-2.2139224419171395, pvalue=0.026862949014065347)

#### Hypothesis Testing

- **H0**: Launch Speed 2016 = Launch Speed 2017
- **H1**: Launch Speed 2016 != Launch Speed 2017

In [35]:
ttest_ind(y_train[X_train.game_year_2016 == 1], y_train[X_train.game_year_2017== 1])

Ttest_indResult(statistic=3.000670607720154, pvalue=0.0027015589253692584)

### Baseline Model and Metrics

#### Average Launch Speeds

To get a baseline for the target variable (launch speed) I will take the average of the y_train and y_test variables which contain the actual launch speeds for each home run that was hit since 2015. This will give me an idea of what the average of my model predictions should be targeting.

#### R2 Score and Root Mean Squared Error

Lets also understand some baseline metrics in the target variable. This will then provide some insight into what my model should be targeting or performing better than:
- **R2:** Is the explained variation in the target variable from the features provided in the model. In other words the variation in the model from the line of best fit. This will be explained as a percentage and we'll want to target a value closer to one.
- **RMSE (Root Mean Square Error):** Is the average distance from the line of best fit. This can be interpreted in the same units as the target variable (launch speed). This value should be lower as we want the average distance from our line of best fit to be small.

##### Interpretation of Baseline Model and Metrics

The average launch speeds across the train and test batches are almost identical at around 103 mph so when we make the final predictions from the production model we'll get the mean and compare how close it is to 103 mph.

The R2 score on both the train and test models against the baseline predictions (average launch speed) is giving a baseline variance of 0 meaning that the baseline R2 is the worst possible model possible. This means that if our model predictions are explaning any variance better than zero we are already in a better spot than the baseline. I will be optimizing for an R2 as close to one as possible.

The RMSE score on both the train and test models against the baseline predictions (average launch speed) is very similar at around 4.4 and 4.3. Meaning that against the baseline predictins on average the predictions are off by 4.3 mph from the line of best fit. I will attempt to optimize my model to obtian a RMSE lower than 4.3 mph.

In [11]:
print(f'Average Launch Speed Train Data: {y_train.mean()}')
print(f'Average Launch Speed Test Data: {y_test.mean()}')

Average Launch Speed Train Data: 103.21479039479041
Average Launch Speed Test Data: 103.2904052734375


In [15]:
print(f'R2 score Train Data: {r2_score(y_train, [y_train.mean()] * len(y_train))}')
print(f'R2 score Test Data: {r2_score(y_test, [y_test.mean()] * len(y_test))}')

R2 score Train Data: 0.0
R2 score Test Data: 0.0


In [16]:
print(f'RMSE score Train Data: {np.sqrt(mean_squared_error(y_train, [y_train.mean()] * len(y_train)))}')
print(f'RMSE score Test Data: {np.sqrt(mean_squared_error(y_test, [y_test.mean()] * len(y_test)))}')

RMSE score Train Data: 4.439441224645179
RMSE score Test Data: 4.365981019821426


### Modeling Functions

To ensure efficiency in the modeling process I have created a function to assist me in that process. 

##### Explanation of Function

1. **Modeling Function:** All of my models are going to utilize grid searching in order to find the optimal parameters to use within the model for predictions so I will instantiate GridSearchCV within the function as gs. Next, the function will call upon the fit method in oder to fit the necessary data that will be used to train the model and ultimately make predictions. Upon completion of the fitting a saved pickle of the model will be saved with the key of the dictionary from the corresponding pipeline (model abbreviation) and the time stamp (in seconds) to be able to uniquely identify which model the pickle belongs.
    - The following arguments will be required to run this function:
        1. pipe : each of my models will have a pipeline setup prior to use with the model to use
        2. params : each of my models will have parameters setup for the identified model in the pipeline to be used for grid searching oaver for the best parameters
        3. X_train : the training batch of data to be used for training the model during fit
        4. y_train : the training target varibale values to be used for training the model during fit

In [1]:
def modeling_func(pipe, params, X_train, y_train, cv=3):
    gs = GridSearchCV(pipe, param_grid=params, cv=cv)
    gs.fit(X_train, y_train)
    with open(f'../pickles/{"_".join(pipe.named_steps.keys())}{int(time.time())}.pkl', 'wb+') as f:
        pickle.dump(gs, f)
    return gs

## Linear Regression Models

To start the modeling process I want to call upon specific models that are focused on multiple linear regression to get an understanding about if any of these modeling techniques are the optimal models to make the predictions on my data. (NOTE: From EDA I saw that less than half of my features had strong linear correlations to the target variable (launch speed) so these model may not be the correct models for this data but to confirm this I will run these models.

### GridSearch with Ridge (L2)

#### GridSearchCV

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.

#### Ridge (L2)

Ridge regression is a regularization technique that allows the model to take on more weight related to variance. This technique is used when we have strong multicolinearity amougnst our selected feature variables. The Ridge regression imposes a penalty on the estimates to those that were identified as performing the worst in the model by taking the sum of the square coefficients. In this technique a feature will never be zeroed out but just be brought closer and closer to zero to help the model predictions.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [24]:
gs_rd = pd.read_pickle('../pickles/rid1539795639.pkl')

#### Pipeline and Parameter

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [22]:
pipe_rd = Pipeline([
    ('rid', Ridge())
])

In [23]:
params_rd = {
    'rid__alpha':np.logspace(-1, 3, 9)
}

#### Ridge Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the Ridge model is scoring as follows:
- Train R2: .515
- Test R2: .504

##### Interpretation

Both my train and test scores are indicating that 51% and 50% of the variations in the target variable (launch speed) are being explained by the features in my data. This is better than the baseline R2 which means its better but the explained variance is still pretty low and is also considered over fit becasue the train score is larger than the test score. I want to find a model that will score higher in explained variance so I will move on to the next model.

**NOTE:** To run a model uncomment the below line of code

In [61]:
# gs_rd = modeling_func(pipe_rd, params_rd, X_train_sc, y_train)

In [25]:
gs_rd.best_params_

{'rid__alpha': 31.622776601683793}

In [26]:
gs_rd.best_estimator_

Pipeline(memory=None,
     steps=[('rid', Ridge(alpha=31.622776601683793, copy_X=True, fit_intercept=True,
   max_iter=None, normalize=False, random_state=None, solver='auto',
   tol=0.001))])

In [27]:
train = gs_rd.score(X_train_sc, y_train)
test = gs_rd.score(X_test_sc, y_test)

In [28]:
print(f'Model Train Score: {train}')
print(f'Model Test Score: {test}')

Model Train Score: 0.5158479079381048
Model Test Score: 0.5041979420168723


### GridSearch with Lasso (L1)

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.

#### Lasso (L1)

Lasso regression is a regularization technique that allows the model to take on more weight related to variance. This technique is used when we have strong multicolinearity amougnst our selected feature variables. The lasso regression imposes a penalty on the estimates to those that were identified as performing the worst in the model by taking the absolute value. In this technique these features will be zeroed out and not used in the model to provide better predictions.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [29]:
gs_la = pd.read_pickle('../pickles/la1539796474.pkl')

#### Pipeline and Parameter

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [30]:
pipe_la = Pipeline([
    ('la', Lasso())
])

In [72]:
params_la = {
    'la__alpha':np.logspace(-3, 3, 7)
}

#### Lasso Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the Ridge model is scoring as follows:
- Train R2: .514
- Test R2: .504

##### Interpretation

Both my train and test scores are indicating that 51% and 50% of the variations in the target variable (launch speed) are being explained by the features in my data. This is better than the baseline R2 which means its better but the explained variance is still pretty low and is also considered over fit becasue the train score is larger than the test score. This model also scored .001% worse in explained variance than the Ridge model and in my search for a model that is better and will score higher in explained variance I will move on to the next model.

**NOTE:** To run a model uncomment the below line of code

In [None]:
# gs_la = modeling_func(pipe_la, params_la, X_train_sc, y_train)

In [25]:
gs_la.best_params_

{'la__alpha': 0.01}

In [26]:
gs_la.best_estimator_

Pipeline(memory=None,
     steps=[('la', Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False))])

In [27]:
gs_la.score(X_train_sc, y_train)

0.5141935832528688

In [28]:
gs_la.score(X_test_sc, y_test)

0.5040682614801717

### GridSearch with ElasticNet

#### GridSearchCV

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.


#### ElasticNet

The elastic net regularization technique imposes the lasso and ridge penalties (L1 and L2) on the estimates to those that were identified as performing the worst in the model. In this technique the elastic net will choose the optimal method (Lasso or Ridge) if one is found to be more optimal the the other.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [29]:
gs_enet = pd.read_pickle('../pickles/enet1539795204.pkl')

#### Pipeline and Parameters

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [39]:
pipe_enet = Pipeline([
    ('enet', ElasticNet())
])

In [40]:
params_enet = {
    'enet__alpha': np.logspace(-3, 3, 7),
    'enet__l1_ratio': [.0001, .3, .5, .7, .9, 1]
}

#### ElasticNet Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the Ridge model is scoring as follows:
- Train R2: .514
- Test R2: .504

##### Interpretation

Both my train and test scores are indicating that 51% and 50% of the variations in the target variable (launch speed) are being explained by the features in my data. This is better than the baseline R2 which means its better but the explained variance is still pretty low and is also considered over fit becasue the train score is larger than the test score. 

The intersting thing about the ElasticNet is that from the parameters I gave for the GridSearch to tune over the parameters chosen indicate that the model decided that the same parameters as the Lasso model were the most optimal meaning that my ElasticNet model will yield the same scores as the Lasso model that I ran earlier. Again, I want to get better so moving onto the next model. 

**NOTE:** To run a model uncomment the below line of code

In [None]:
# gs_enet = modeling_func(pipe_enet, params_enet, X_train_sc, y_train)

In [30]:
gs_enet.best_params_

{'enet__alpha': 0.01, 'enet__l1_ratio': 1}

In [31]:
gs_enet.best_estimator_

Pipeline(memory=None,
     steps=[('enet', ElasticNet(alpha=0.01, copy_X=True, fit_intercept=True, l1_ratio=1,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False))])

In [32]:
gs_enet.score(X_train_sc, y_train)

0.5141935832528688

In [33]:
gs_enet.score(X_test_sc, y_test)

0.5040682614801717

## Ensemble Models

Now that we've confirmed that the multiple linear regression models are not scoring as well as I'd like lets move into ensemble modeling. Ensemble modeling will run two or models and then using the models create an accuracy score for the target variable based on the features provided.

### GridSearch with RandomForestRegressor

#### GridSearchCV

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.

#### RandomForest

This ensemble modeling technique will create decision trees from a random subset of features in the dataset and use averaging on those trees to improve the accuracy of the model.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [13]:
gs_rf = pd.read_pickle('../pickles/rf1539714652.pkl')

#### Pipeline and Parameters

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [62]:
pipe_rf = Pipeline([
    ('rf', RandomForestRegressor())
])

In [75]:
params_rf = {
    'rf__n_estimators':[130, 140, 150],
    'rf__max_depth':[20, 25, 30]
}

#### Random Forest Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the Random Forest model is scoring as follows:
- Train Accuracy Score: .941
- Test Accuracy Score: .603

##### Interpretation

As, expected with the Random Forest model we can see an extreme case of overfitting. The train score looks great and is explaining 94% of the variance from the line of best fit but the test score of 60 % explained variance which is almost 30% less than the train score is evidence that this model is way to over fit and is not the production model I want. Lets move onto the next ensemble model.

**NOTE:** To run a model uncomment the below line of code

In [14]:
# gs_rf = modeling_func(pipe_rf, params_rf, X_train, y_train)

In [15]:
gs_rf.best_params_

{'rf__max_depth': 25, 'rf__n_estimators': 140}

In [16]:
gs_rf.best_estimator_

Pipeline(memory=None,
     steps=[('rf', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=25,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=140, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False))])

In [17]:
train = gs_rf.score(X_train, y_train)
test = gs_rf.score(X_test, y_test)

In [18]:
print(f'Model Train Score {train}')
print(f'Model Test Score {test}')

Model Train Score 0.941821592737002
Model Test Score 0.603107782648939


### GridSeach with GradientBoostRegressor

#### GridSearchCV

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.

#### Gradient Boost

When boosting a model the model is building multiple simple models and learning from these models to be more approximate when predicting. These simple models are referred to as weak model or weak learners. 

Gradient Boosting looks at these weak models sequentially and trains on the residuals or errors in order to give more importance to the less accurate predictions and once completed uses what was learned from these predictions to combine with the strong predictions to have a better overall approximation.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [150]:
gs_gb = pd.read_pickle('../pickles/gbr1539715514.pkl')

#### Pipeline and Parameters

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [22]:
pipe_gb = Pipeline([
    ('gbr', GradientBoostingRegressor())
])

In [103]:
params_gb = {
    'gbr__n_estimators':[200, 210, 220],
    'gbr__max_depth':[3, 5, 7]
}

#### Gradient Boost Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the Random Forest model is scoring as follows:
- Train Accuracy Score: .68
- Test Accuracy Score: .616

##### Interpretation

Now we are beginning to see a better model. The train score is now explaining around 68% of the variance in the predictions from the features in the dataset along with the test explaining around 61% of the variance. You can still quantify this model as over fit as the training score is greater than the testing score but the difference is not to far off and in my opinion this is the best model up to this point. Lets review one more.

**NOTE:** To run a model uncomment the below line of code

In [104]:
# gs_gb = modeling_func(pipe_gb, params_gb, X_train_sc, y_train)

In [151]:
gs_gb.best_params_

{'gbr__max_depth': 3, 'gbr__n_estimators': 210}

In [152]:
gs_gb.best_estimator_

Pipeline(memory=None,
     steps=[('gbr', GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=210, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False))])

In [153]:
test = gs_gb.score(X_test_sc, y_test)
train = gs_gb.score(X_train_sc, y_train)

In [154]:
print(f'Model Train {train}')
print(f'Model Test {test}')

Model Train 0.680384206966165
Model Test 0.6165711841636452


### GridSearch with AdaBoostRegressor

#### GridSearchCV

GridSearchCV is a technique that searches for the optimal hyper-parameters provided during the instantiating of the GridSearchCV model. Using its built in cross validation it can search over the grid of the provided hyperparameters to evaluate the performance of each and then use the parameter(s) it found to be the best when making the predictions.

#### AdaBoost

Again, boosting a model is the process of building multiple simple models and learning from these models to be more approximate when predicting. These simple models are referred to as weak model or weak learners.

AdaBoost will work similarily to Gradient Boost in that it looks at these weak learners to train except it will modify the weights attached to the less accurate predictions and then combine what its learned back to the stronger predictions to make a better overall approximation.

**NOTE:** The below line of code can be run to use the same pickled model that I used during my modeling process

In [164]:
gs_ada = pd.read_pickle('../pickles/ada1539716634.pkl')

#### Pipeline and Parameters

Pipelines are the sklearn method that take in a sequential list of steps that end with the appropriate estimator or model that you are planning to run.

Parameters should be a dictionary with the keys referencing the different steps and parameters you are looking to tune followed by the values for that parameter that you are looking to tune through the search.

In [120]:
pipe_ada = Pipeline([
    ('ada', AdaBoostRegressor())
])

In [121]:
params_ada = {
    'ada__n_estimators':[40, 50, 60]
}

#### AdaBoost Model

**NOTE:** Returns the coefficient of determination R2 of the prediction.

The train and test batches with the Random Forest model is scoring as follows:
- Train Accuracy Score: .483
- Test Accuracy Score: .452

##### Interpretation

Interesting. This model is definitely the least over fit of the ensemble models that I've run but it's scoring very low. The train score is only explaining about 48 % of the variance in the launch speed predictions against the features in the dataset with the testing score only explaining 45% of the variance.

At this point I believe the Gradient Boosting Model to be the best of the models I've run and this will be my production level model.

**NOTE:** To run a model uncomment the below line of code

In [122]:
# gs_ada = modeling_func(pipe_ada, params_ada, X_train_sc, y_train)

In [165]:
gs_ada.best_params_

{'ada__n_estimators': 50}

In [166]:
gs_ada.best_estimator_

Pipeline(memory=None,
     steps=[('ada', AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
         n_estimators=50, random_state=None))])

In [167]:
train = gs_ada.score(X_train_sc, y_train)
test = gs_ada.score(X_test_sc, y_test)

In [168]:
print(f'Model Train Score: {train}')
print(f'Model Test Score: {test}')

Model Train Score: 0.4836915544103218
Model Test Score: 0.452611043715773


##### On to the Production Level Model: 05-Production_Model