# Modeling for App

In this section, I aim to simplify the model and "pickle" said model for a simple application demo of the model. We want to simplify the model because the application will require a user to input features. If the user has to enter a large number of features, the application is more frustrating and less useful.

**Please Note: Restarting this notebook will cause errors in the code since all the data is not stored locally.**

### Contents

[Simplifying the Model](#Simplifying-the-Model)<br>
[Pickling Model](#Pickling-Model)

In [1]:
import pandas as pd
import s3fs

from sklearn.model_selection import train_test_split
import xgboost as xgb

import pickle

import warnings
warnings.filterwarnings("ignore")

In [2]:
fs = s3fs.S3FileSystem(anon=False,key='AWS KEY',secret='AWS SECRET KEY')

key = 'nfl_play_by_play_with_weather_model.csv'
bucket = 'nfl-play-by-play-capstone'

df = pd.read_csv(fs.open('{}/{}'.format(bucket, key),
                         mode='rb')).drop(columns=['Unnamed: 0'])

### Simplifying the Model

In this section, I will choose a select few features that have the most predictive power. The original model has a little over 100 features (about 60 of which just indicate who is one offense and who is on defense). Reducing the features to only 8 achieves almost exactly the same accuracy.

In [3]:
X = df[['down','qtr','game_seconds_remaining','half_seconds_remaining','quarter_seconds_remaining','ydstogo',
        'score_differential','yardline_100']].values
y = df['effective_run']
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

In [4]:
xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42, 
                              eval_metric=["error"], max_depth = 8, subsample=0.7)
xgb_model.fit(X_train, y_train, early_stopping_rounds=5,eval_set=[(X_test, y_test)])

[0]	validation_0-error:0.308283
Will train until validation_0-error hasn't improved in 5 rounds.
[1]	validation_0-error:0.307261
[2]	validation_0-error:0.304791
[3]	validation_0-error:0.304541
[4]	validation_0-error:0.302945
[5]	validation_0-error:0.302608
[6]	validation_0-error:0.302733
[7]	validation_0-error:0.302496
[8]	validation_0-error:0.302346
[9]	validation_0-error:0.302022
[10]	validation_0-error:0.302359
[11]	validation_0-error:0.302359
[12]	validation_0-error:0.301797
[13]	validation_0-error:0.30191
[14]	validation_0-error:0.302072
[15]	validation_0-error:0.302034
[16]	validation_0-error:0.30176
[17]	validation_0-error:0.301523
[18]	validation_0-error:0.301386
[19]	validation_0-error:0.301024
[20]	validation_0-error:0.300874
[21]	validation_0-error:0.300912
[22]	validation_0-error:0.300762
[23]	validation_0-error:0.300475
[24]	validation_0-error:0.300587
[25]	validation_0-error:0.300787
[26]	validation_0-error:0.300637
[27]	validation_0-error:0.300563
[28]	validation_0-error

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, eval_metric=['error'],
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=8,
       min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=42,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.7, verbosity=1)

In [5]:
print(f"Test Accuracy Score: {xgb_model.score(X_test,y_test)}")
print(f"Train Accuracy Score: {xgb_model.score(X_train,y_train)}")

Test Accuracy Score: 0.7011961632968705
Train Accuracy Score: 0.7125109659608261


As you can see, the simple model achieves almost the same accuracy with much less information. The less inputs we have, the easier the application will run.

### Pickling Model

Pickling the model is only one line of code, but essentially this stores the model so we can call it elsewhere and make predictions.

In [6]:
pickle.dump(xgb_model, open('../app/model.p', 'wb'))