# Pre-Processing and Training Data

# 1.1 Contents<a id='1.1_Contents'></a>
* [1 Pre-Processing and Training Data](#1_Pre-Processing_and_Training_Data)
  * [1.1 Imports](#1.1_Imports)
  * [1.2 Load The Data](#1.2_Load_The_Data)
  * [1.3 Train Test Split](#1.3_Train_Test_Split)
  * [1.4 Metrics](#1.4_Metrics)
      * [1.4.1 R-Squared](#1.4.1_R-Squared)
      * [1.4.2 Mean Absolute Error](#1.4.2_Mean_Absolute_Error)
      * [1.4.3 Mean Squared Error](#1.4.3_Mean_Squared_Error)
  * [1.5 Initial Models](#1.5_Initial_Models)
      * [1.5.1 Initial Mean Model](#1.5.1_Initial_Mean_Model) 
      * [1.5.2 Pipelines](#1.5.1_Pipelines) 
  * [1.6 Refining the Linear Model](#1.6_Refining_the_Linear_Model)
      * [1.6.1 Define the Pipeline](#1.6.1_Define_the_Pipeline) 
      * [1.6.2 Fit the Pipeline](#1.6.2_Fit_the_Pipeline) 
      * [1.6.3 Assess Performance on Train and Test Set](#1.6.3_Assess_Performance_on_Train_and_Test_Set) 
      * [1.6.4 Assessing Performance Using Cross-Validation](#1.6.4_Assessing_Performance_Using_Cross-Validation) 
      * [1.6.5 Hyperparameter Search Using GridSearchCV](#1.6.5_Hyperparameter_Search_Using_GridSearchCV) 
  * [1.7 Random Forest Model](#1.7_Random_Forest_Model)
      * [1.7.1 Define the Pipeline](#1.7.2_Define_the_Pipeline) 
      * [1.7.2 Fit and Assess Performance Using Cross-Validation](#1.7.2_Fit_and_Assess_Performance_Using_Cross-Validation)
      * [1.7.3 Hyperparameter Search Using GridSearchCV](#1.7.3_Hyperparameter_Search_Using_GridSearchCV) 
  * [1.8 Final Model Selection](#1.8_Final_Model_Selection)
      * [1.8.1 Linear Regression Model Performance](#1.8.1_Linear_Regression_Model_Performance)
      * [1.8.2 Random Forest Regression Model_Performance](#1.8.2_Random_Forest_Regression_Model_Performance)
      * [1.8.3 Conclusion](#1.8.3_Conclusion)
  * [1.9 Save Best Model Object From Pipeline](#1.9_Save_Best_Model_Object_From_Pipeline)
  * [1.10 Summary](#1.10_Summary)

## 1.1 Imports

In [167]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime

from sb_utils import save_file

## 1.2 Load Data

In [14]:
df = pd.read_excel(r'C:\Users\asiu200\OneDrive - Comcast\Python\Springboard\Data Wrangling.xlsx')
df.head().T

Unnamed: 0,0,1,2,3,4
Traffic_Date,2014-12-22 00:00:00,2014-12-22 00:00:00,2014-12-22 00:00:00,2014-12-22 00:00:00,2014-12-22 00:00:00
STORE_NAME,"3351 - Albuquerque, NM (XF)","3352 - Lakewood, CO (XF)","3353 - Colorado Springs, CO (XF)","3354 - Thornton, CO (XF)","3356 - Boulder, CO (XF)"
STORE_CITY_NAME,Albuquerque,Lakewood,Colorado Springs,Thornton,Boulder
STORE_STATE_CODE,NM,CO,CO,CO,CO
Door_Swings,656,452,562,594,369


In [18]:
df.rename(columns={'STORE_NAME' : 'store_name', 'Traffic_Date' : 'date', 'STORE_CITY_NAME' : 'city', 'STORE_STATE_CODE' : 'state','Door_Swings' : 'door_swings'}, inplace=True)

## 1.3 Train/Test Split

In [23]:
len(df) *.7, len(df) * .3

(50363.6, 21584.399999999998)

In [37]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='city'),df.door_swings, test_size=0.3, 
                                                    random_state=23)

In [38]:
X_train.shape, X_test.shape

((50363, 4), (21585, 4))

In [39]:
y_train.shape, y_test.shape

((50363,), (21585,))

In [41]:
names_list = ['store_name', 'date', 'state']
names_train = X_train[names_list]
names_test = X_test[names_list]
X_train.drop(columns=names_list, inplace=True)
X_test.drop(columns=names_list, inplace=True)
X_train.shape, X_test.shape

((50363, 1), (21585, 1))

In [42]:
X_train.dtypes

door_swings    int64
dtype: object

In [43]:
X_test.dtypes

door_swings    int64
dtype: object

In [45]:
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train, y_train)
dumb_reg.constant_

array([[239.13577428]])

## 1.4 Metrics

### 1.4.1 R-Squared

In [61]:
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

(0.0, -0.00016786945988411794)

### 1.4.2 Mean Absolute Error

In [62]:
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

(120.1363249523311, 121.13820984737669)

On average, I might be off around 121 door swings if I guessed the door swings based on an average of known values

### 1.4.3 Mean Squared Error

In [63]:
mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)

(25617.051239510794, 26162.273041019143)

I got a slightly better MSE on my test set versus my train test.

## 1.5 Initial Models

### 1.5.1 Initial Mean Model

In [173]:
train_mean = y_train.mean()
train_mean

239.13577427873636

### 1.5.2 Pipelines

In [81]:
pipe = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler(), 
    LinearRegression()
)

In [82]:
type(pipe)

sklearn.pipeline.Pipeline

In [83]:
hasattr(pipe, 'fit'), hasattr(pipe, 'predict')

(True, True)

In [84]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

In [85]:
y_tr_pred = pipe.predict(X_train)
y_te_pred = pipe.predict(X_test)

In [86]:
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

(1.0, 1.0)

In [87]:
median_r2

(1.0, 1.0)

In [88]:
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

(4.0120135370827726e-14, 4.063639497736452e-14)

In [89]:
median_mae

(4.0120135370827726e-14, 4.063639497736452e-14)

## 1.6 Refining the Linear Model

### 1.6.1 Define the Pipeline

In [111]:
pipe = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler(),
    SelectKBest(f_regression,  k='all'),
    LinearRegression()
)

All the k features needed to be added or else an error would populate.

### 1.6.2 Fit the Pipeline

In [113]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('selectkbest',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x0000024BBF3F58B0>)),
                ('linearregression', LinearRegression())])

### 1.6.3 Assess Performance on Train and Test Set

In [114]:
y_tr_pred = pipe.predict(X_train)
y_te_pred = pipe.predict(X_test)

In [115]:
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

(1.0, 1.0)

In [116]:
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

(4.0120135370827726e-14, 4.063639497736452e-14)

### 1.6.4 Assessing Performance Using Cross-Validation

In [117]:
cv_results = cross_validate(pipe15, X_train, y_train, cv=5)

In [118]:
cv_scores = cv_results['test_score']
cv_scores

array([1., 1., 1., 1., 1.])

In [119]:
np.mean(cv_scores), np.std(cv_scores)

(1.0, 0.0)

In [120]:
np.round((np.mean(cv_scores) - 2 * np.std(cv_scores), np.mean(cv_scores) + 2 * np.std(cv_scores)), 2)


array([1., 1.])

### 1.6.5 Hyperparameter Search Using GridSearchCV

In [122]:
k = [k+1 for k in range(len(X_train.columns))]
grid_params = {'selectkbest__k': k}

In [123]:
lr_grid_cv = GridSearchCV(pipe, param_grid=grid_params, cv=5, n_jobs=-1)

In [124]:
lr_grid_cv.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('simpleimputer',
                                        SimpleImputer(strategy='median')),
                                       ('standardscaler', StandardScaler()),
                                       ('selectkbest',
                                        SelectKBest(k='all',
                                                    score_func=<function f_regression at 0x0000024BBF3F58B0>)),
                                       ('linearregression',
                                        LinearRegression())]),
             n_jobs=-1, param_grid={'selectkbest__k': [1]})

In [125]:
score_mean = lr_grid_cv.cv_results_['mean_test_score']
score_std = lr_grid_cv.cv_results_['std_test_score']
cv_k = [k for k in lr_grid_cv.cv_results_['param_selectkbest__k']]

In [126]:
lr_grid_cv.best_params_

{'selectkbest__k': 1}

In [129]:
selected = lr_grid_cv.best_estimator_.named_steps.selectkbest.get_support()

In [130]:
coefs = lr_grid_cv.best_estimator_.named_steps.linearregression.coef_
features = X_train.columns[selected]
pd.Series(coefs, index=features).sort_values(ascending=False)

door_swings    160.053276
dtype: float64

## 1.7 Random Forest Model 

### 1.7.1 Define the Pipeline

In [148]:
RF_pipe = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler(),
    RandomForestRegressor(random_state=47)
)

### 1.7.2 Fit and Assess Peformance Using Cross-Validation

In [149]:
rf_default_cv_results = cross_validate(RF_pipe, X_train, y_train, cv=5)

In [150]:
rf_cv_scores = rf_default_cv_results['test_score']
rf_cv_scores

array([0.99920761, 0.99998361, 0.99999986, 0.99999987, 0.99999831])

In [151]:
np.mean(rf_cv_scores), np.std(rf_cv_scores)

(0.9998378499475165, 0.0003151777859367049)

### 1.7.3 Hyperparameter Search Using GridSearchCV

In [135]:
n_est = [int(n) for n in np.logspace(start=1, stop=3, num=20)]
grid_params = {
        'randomforestregressor__n_estimators': n_est,
        'standardscaler': [StandardScaler(), None],
        'simpleimputer__strategy': ['mean', 'median']
}
grid_params

{'randomforestregressor__n_estimators': [10,
  12,
  16,
  20,
  26,
  33,
  42,
  54,
  69,
  88,
  112,
  143,
  183,
  233,
  297,
  379,
  483,
  615,
  784,
  1000],
 'standardscaler': [StandardScaler(), None],
 'simpleimputer__strategy': ['mean', 'median']}

In [136]:
rf_grid_cv = GridSearchCV(RF_pipe, param_grid=grid_params, cv=5, n_jobs=-1)

In [137]:
rf_grid_cv.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('simpleimputer',
                                        SimpleImputer(strategy='median')),
                                       ('standardscaler', StandardScaler()),
                                       ('randomforestregressor',
                                        RandomForestRegressor(random_state=47))]),
             n_jobs=-1,
             param_grid={'randomforestregressor__n_estimators': [10, 12, 16, 20,
                                                                 26, 33, 42, 54,
                                                                 69, 88, 112,
                                                                 143, 183, 233,
                                                                 297, 379, 483,
                                                                 615, 784,
                                                                 1000],
                         'simpleimputer__strategy': [

In [138]:
rf_grid_cv.best_params_

{'randomforestregressor__n_estimators': 12,
 'simpleimputer__strategy': 'mean',
 'standardscaler': StandardScaler()}

In [139]:
rf_best_cv_results = cross_validate(rf_grid_cv.best_estimator_, X_train, y_train, cv=5)
rf_best_scores = rf_best_cv_results['test_score']
rf_best_scores

array([0.99922992, 0.99998347, 0.99999964, 0.9999997 , 0.9999979 ])

In [140]:
np.mean(rf_best_scores), np.std(rf_best_scores)

(0.9998421266613965, 0.00030616478820763685)

## 1.8 Final Model Selection

### 1.8.1 Linear Regression Model Performance

In [153]:
lr_neg_mae = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, 
                            scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)

In [154]:
lr_mae_mean = np.mean(-1 * lr_neg_mae['test_score'])
lr_mae_std = np.std(-1 * lr_neg_mae['test_score'])
lr_mae_mean, lr_mae_std

(3.919733222165685e-13, 1.832713876948819e-13)

In [155]:
mean_absolute_error(y_test, lr_grid_cv.best_estimator_.predict(X_test))

4.063639497736452e-14

### 1.8.2 Random Forest Regression Model Performance

In [156]:
rf_neg_mae = cross_validate(rf_grid_cv.best_estimator_, X_train, y_train, 
                            scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)

In [157]:
rf_mae_mean = np.mean(-1 * rf_neg_mae['test_score'])
rf_mae_std = np.std(-1 * rf_neg_mae['test_score'])
rf_mae_mean, rf_mae_std


(0.015995148803987815, 0.016814414170379963)

In [158]:
mean_absolute_error(y_test, rf_grid_cv.best_estimator_.predict(X_test))

0.010049417033433788

### 1.8.3 Conclusion

I will be using the linear regression because it has a lower cross validation mean absolute error.

## 1.9 Save Best Model Object From Pipeline

In [168]:
best_model = rf_grid_cv.best_estimator_
best_model.version = '1.0'
best_model.pandas_version = (pd.__version__)
best_model.numpy_version = (np.__version__)
best_model.sklearn_version = (sklearn_version)
best_model.X_columns = [col for col in X_train.columns]
best_model.build_datetime = (datetime.datetime.now())

In [172]:
modelpath = '../data'
save_file(best_model, 'door_swings_model.pkl', modelpath)

Directory ../data was created.
Writing file.  "../data\door_swings_model.pkl"


## 1.10 Summary

We ran a 70/30 train test split on the data to estimate the performance of our learning model. Before we ran the test, we calculated the mean door swings of 239. We had no missing data sets, so we did not need to fill in any missing values. The best number of features was one; we cannot collect more data because this is all my company's data on door swings.