## Model Evaluation - Random Forest Models on Diff Data
USA World Series Results,
Run on "Diff" data

# @To Do

- [ ] Randomize data and rebuild model
    * Limit to very simple tuning, so as not to overfit
    * n_estimators = 100 to 3-400
    * 5-fold or 6-fold CV
    * max_features = 5 or 6
- [ ] Merge new data from validation set into full data set
- [ ] Explore relationship between Posession Time + Attacking Rucks + Passes

In [56]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split

In [57]:
#Import Data - USA's differential data
df = pd.read_csv('../data/output/new_features_diffdata.csv')
df.head()

#Import validation data
#valdf = pd.read_csv('../data/output/new_features_diffdata_validate.csv')
#valdf.head()

Unnamed: 0,Opp,Tournament,Poss_Time_Diff,Score_Diff,Conv_Diff,Tries_Diff,Passes_Diff,Contestable_KO_Win_pct_Diff,PenFK_Against_Diff,RuckMaul_Diff,...,-99 : -75,-74 : -25,-24 : -1,0 : 25,26 : 50,51 : 75,76 : 100,101 : 125,126 : 150,Result
0,AUSTRALIA,2015_Cape_Town,13.96648,-10.638298,-14.285714,0.25,25.925926,-50.0,0.0,0.0,...,0.0,-12.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,WALES,2015_Cape_Town,7.471264,15.555556,14.285714,0.083333,27.868852,25.0,-20.0,-100.0,...,0.0,0.0,0.0,12.5,0.0,0.0,0.0,0.0,0.0,1
2,KENYA,2015_Cape_Town,-33.136095,-44.444444,-33.333333,-0.75,-10.638298,-16.666667,66.666667,60.0,...,0.0,0.0,-5.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,NEW ZEALAND,2015_Cape_Town,51.758794,33.333333,33.333333,0.0,76.119403,-75.0,-50.0,-100.0,...,-37.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,FIJI,2015_Cape_Town,12.880562,-20.833333,-25.0,0.266667,38.461538,-66.666667,-33.333333,-33.333333,...,0.0,-12.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


## Randomize Data

In [58]:
#Shuffle dataframes before running model to prevent overfitting
from sklearn.utils import shuffle
df = shuffle(df)
#shuffle validation set
#valdf = shuffle(valdf)

In [59]:
#Diagnostic
#df.info()
list(df.columns)
#df.head()

['Opp',
 'Tournament',
 'Poss_Time_Diff',
 'Score_Diff',
 'Conv_Diff',
 'Tries_Diff',
 'Passes_Diff',
 'Contestable_KO_Win_pct_Diff',
 'PenFK_Against_Diff',
 'RuckMaul_Diff',
 'Ruck_Win_pct_Diff',
 'Cards_diff',
 'Lineout_Win_Pct_Diff',
 'Scrum_Win_Pct_Diff',
 '-175 : -150',
 '-149 : -125',
 '-124 : -100',
 '-99 : -75',
 '-74 : -25',
 '-24 : -1',
 '0 : 25',
 '26 : 50',
 '51 : 75',
 '76 : 100',
 '101 : 125',
 '126 : 150',
 'Result']

### Pre-processing data

In [60]:
#Create a list of features to drop that are unneccessary or will bias the prediction
droplist = ['Opp', 'Score_Diff', 'Tries_Diff','Tournament', 'Conv_Diff','-175 : -150', '-149 : -125','-124 : -100', '-99 : -75', '-74 : -25','-24 : -1','0 : 25','26 : 50','51 : 75','76 : 100','101 : 125','126 : 150']

rf_data = df.drop((droplist), axis=1)

#Drop rows with Result == "2" (Ties). This label messes up classification models
rf_data.drop(rf_data[rf_data.Result == 2].index, inplace=True)

In [61]:
#rf_data.head()
#Check to insure 'Result' only contains 2 values (W, L)
#rf_data['Result'].describe()
#rf_data.describe()

In [62]:
#list(rf_data.columns) 

In [63]:
#Pull out the variable we're trying to predict: 'Result'
X = rf_data.drop('Result',axis=1)
y = rf_data['Result']
#X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30)

### Train/Test Split

*Commented out, because if using standard scaler in a pipeline, it does it for you - see notes below.*

In [64]:
#Split into train/test/validate sets
#OR, keep as is and use new data for validate
#156 rows in original dataframe
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=77)

<span style="color:red">## NOTE:</span>  
https://stackoverflow.com/questions/51459406/apply-standardscaler-in-pipeline-in-scikit-learn-sklearn

When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.

What happens can be discribed as follows:  

* Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
* Step 1: the scaler is fitted on the TRAINING data
* Step 2: the scaler transforms TRAINING data
* Step 3: the models are fitted/trained using the transformed TRAINING data
* Step 4: the scaler is used to transform the TEST data
* Step 5: the trained models predict using the transformed TEST data

***Note:*** You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).

Your code should look like this:

<code>pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)</code>

Once you run this code (when you call **grid.fit(X, y)**), you can access the outcome of the grid search in the result object returned from grid.fit(). The **best_score_ member** provides access to the best score observed during the optimization procedure and the **best_params_** describes the combination of parameters that achieved the best results.

**IMPORTANT:** if you want to keep a validation dataset of the original dataset use this:

<code>X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation 
    = train_test_split(X, y, test_size=0.15, random_state=1)</code>
    
Then use:

<code>grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)</code>

### Create a transformation pipeline
Pipeline with Scaling and Random Forest Classifier

In [65]:
from sklearn.pipeline import Pipeline
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

#create the pipeline
scale_pipeline = Pipeline([
    ('std_scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])

# fit the pipeline
#scale_pipeline.fit(X_train, y_train)

### Check the Pipeline test accuracy

In [66]:
# Pipeline test accuracy
#print('Test accuracy: %.3f' % scale_pipeline.score(X_test, y_test))

# Pipeline estimator params; estimator is stored as step 2 ([1]), second item ([1])
#print('\nModel hyperparameters:\n', scale_pipeline.steps[1][1].get_params())

### Grid Search with Cross Validation
Random search allowed us to narrow down the range for each hyperparameter. Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define.  

### Hyperparameters
* n_estimators = number of trees in the foreset
* max_features = max number of features considered for splitting a node
* max_depth = max number of levels in each decision tree
* min_samples_split = min number of data points placed in a node before the node is split
* min_samples_leaf = min number of data points allowed in a leaf node
* bootstrap = method for sampling data points (with or without replacement)

In [67]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'classifier__bootstrap': [True, False],
    'classifier__max_depth': [60, 80, 100],
    'classifier__max_features': ['auto', 4, 5, 6],
    'classifier__min_samples_leaf': [1, 2, 3, 4, 5],
    'classifier__min_samples_split': [2, 5, 8, 10, 12],
    'classifier__n_estimators': [10, 20, 40, 60, 100], # [100, 200, 300, 400]
    'classifier__criterion': ['gini', 'entropy']
}

## Random Forest
If ***not*** using pipelines

In [68]:
#from sklearn.ensemble import RandomForestClassifier

#Fit RF Classifier model
#rf = RandomForestClassifier(random_state=101)

#from pprint import pprint
# Look at parameters used by our current forest
#print('Default Parameters currently in use:\n')
#pprint(rf.get_params())

### Execute GridSearch

In [None]:
# execute gridsearch and get best score
rf_grid = GridSearchCV(scale_pipeline, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2, scoring= 'accuracy')

# fit on ALL grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically 
# split the data into training and testing data (this happen internally).
rf_grid.fit(X, y)

print(rf_grid.best_score_)
print(rf_grid.cv_results_)

In [71]:
print(rf_grid.best_params_)

{'classifier__bootstrap': True, 'classifier__criterion': 'gini', 'classifier__max_depth': 60, 'classifier__max_features': 4, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 8, 'classifier__n_estimators': 10}


In [81]:
# Print pipeline estimator
# Pipeline estimator params; estimator is stored as step 2 ([1]), second item ([1])
# print('\nModel hyperparameters:\n', scale_pipeline.steps[1][1].get_params())
print('\nModel hyperparameters:\n', scale_pipeline.steps)


Model hyperparameters:
 [('std_scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]


In [None]:
import cPickle
# save the classifier "rfc"
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(rfc, fid)    

# load it again
#with open('my_dumped_classifier.pkl', 'rb') as fid:
#    gnb_loaded = cPickle.load(fid)

In [None]:
#Get parameters for pipeline object
#scale_pipeline.estimator.get_params().keys()

In [None]:
print("Base Model")
print(base_train_acc)
print(base_test_acc)

#get predictions with best parameters
grid_predict = best_grid.predict(X_test)

grid_train_acc = accuracy_score(y_train, best_grid.predict(X_train))
grid_test_acc = accuracy_score(y_test, grid_predict)
print("\n")
print("Grid Search Model")
print(grid_train_acc)
print(grid_test_acc)

#print('Improvement of {:0.2f}%.'.format( 100 * (grid_test_acc - base_test_acc) / base_test_acc))

### Output
**Base Model**  
1.0  
0.45652173913

**Grid Search Model**  
0.895238095238  
0.565217391304

In [None]:
print('Improvement of {:0.2f}%.'.format( 100 * (grid_test_acc - base_test_acc) / base_test_acc))

### Grid Search Accuracy Results
**Base Model**
```
1.0
0.45652173913
```
**Grid Search Model**
```
1.0
0.50
```

***Improvement of 9.52%. to 50%***

In [None]:
grid_search.best_params_

In [None]:
# examine the best model#
print('Best Estimator:')
print(grid_search.best_estimator_)
print()
print('Best Score:')
print(grid_search.best_score_)
print()
print('Best Parameters:')
print(grid_search.best_params_)

**Best Estimator**
```
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=60, max_features=4, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=101, verbose=0, warm_start=False)
```

**Best Score**
```
0.704761904762
```
**Best Parameters**
```
{'bootstrap': True, 'max_depth': 60, 'max_features': 4, 'min_samples_leaf': 5, 'min_samples_split': 2, 'n_estimators': 100}
```

### Use new parameters from gridsearch to create and fit model

In [None]:
#Fit classifier with new model parameters from gridsearch
rfc = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=60, max_features=4, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=101, verbose=0, warm_start=False)

#Fit model
rfc.fit(X_train, y_train)

#Predict Classifier
rfc_pred = rfc.predict(X_test)

## Random Forest Model Eval

In [None]:
#Accuracy
rfc_acc = accuracy_score(y_test, rfc_pred)
print(rfc_acc)


In [None]:
#Find Feature Importances
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)

print("Feature Importance")
print(feature_importances)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

#Output confusion matrix
print("Confusion Matrix")
print(confusion_matrix(y_test,rfc_pred))

#import libraries to ignore UndefinedMetricWarning
import warnings
import sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

#get the model's accuracy score
accuracy_score(y_test, rfc_pred)
print("\n")
print("Classification Report")
print(classification_report(y_test,rfc_pred))

#print accuracy score
print("\n")
print("Accuracy Score")
print(rfc.score(X_test, y_test))

## Predict on Validation Set

In [None]:
#Run Prediction Classifier on validation data (val_X, val_y)
rfc_val_pred = rfc.predict(val_X)

In [None]:
#Accuracy
rfc_val_acc = accuracy_score(val_y, rfc_val_pred)
print(rfc_val_acc)

In [None]:
#Output confusion matrix
print("Confusion Matrix")
print(confusion_matrix(val_y, rfc_val_pred))

#import libraries to ignore UndefinedMetricWarning
import warnings
import sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

#get the model's accuracy score
accuracy_score(val_y, rfc_val_pred)
print("\n")
print("Classification Report")
print(classification_report(val_y, rfc_val_pred))

#print accuracy score
print("\n")
print("Accuracy Score")
print(rfc.score(val_X, val_y))