# Data Mining - Mini Lab
### Team 2 - Patricia Goresen, Jeffrey Lancon, Brychan Manry, George Sturrock
#### June 17, 2018
------

## Introduction
#### Data Description
The sujbect matter for this Lab assignment is team level data from the Lahman baseball database.  The database was described in detail during Lab 1.  The team summarized team level statistics generated in Lab 1 is used as the input data set for this study.  The team level data was summarized from approximately 30,000 rows of data from team level statistics dating back to 1970 and payroll data.  
#### Objective
The ultimate objective is to find the best model to predict if a team will make the Playoffs given the available statistical data.  To meet this objective, this section will examine three different models to determine which produce the best accuracy, recall and precision scores.  The three models are:  
    - GridSearchCV Logistic Regression with manual variable reduction
    - GridSearchCV Logistic Regression with recursive feature elimination
    - Support Vector Machine
#### Approach
First, the input data set will have categorical features with little value (ball park name and disparate database identifiers) removed.  Features with near zero variance (such as games played) will also be removed.  Features which introduce leakage (such as Wins and how far a team progressed in the playoffs) will be removed as well.  The data will then be split into a explanatory ("X") and response ("Y") dataframes to feed into the different models.   

The "GridSearchCV Logistic Regression with manual feature reduction" model will start begin remaining explanatory variables and use correlation scores, variable inflation factors and significance scores to manually reduce the number of features input into the regression function.

The "GridSearchCV Logistic Regression with recursive feature elimination" and "Support Vector Machine" models will also have the remaining explanatory variables input into it's pipeline for analysis.  The team will allow the recursive feature elimination function and Support Vector Machine to determine which features to include in the end model on with no intervention.  

Model accuracy, recall and precision will be used to determine which model yields the best results to predict whether or not a team will make the Major League Baseball Playoffs.  

------

## Create Models
### Data Preparation

In [341]:
import pandas as pd
import numpy as np
from sklearn.model_selection import ShuffleSplit, cross_validate
from sklearn.linear_model import LogisticRegression

team = pd.read_csv('~/7331_MiniLab/data/teams2Plus.csv')

#Convert Y/N playoff flag to 1/0 indicator
team['Playoff'] = team['Playoff'].map({'Y':1, 'N':0})

#Drop records with missing values in the Playoff column
team = team[np.isfinite(team['Playoff'])]
team.Playoff = team.Playoff.astype(int)

#Store all franchise IDs per row for future references
allfranchID = team['franchID']

#Create Y Response Variable DF
teamY = team['Playoff']

#Drop Categorial Columns with no predictive ability
team = team.drop(['teamIDBR', 'teamIDlahman45', 'teamIDretro', 'G', 'teamID', 'Ghome', 'name', 'park', 'lgID', 'divID', 'salary', 'attendance', 'Playoff'], axis=1)

#Drop Columns which introduce leakage
team = team.drop(['LgWin', 'DivWin', 'WCWin', 'WSWin', 'W', 'L', 'Rank'], axis=1)

#Create Cross Validation Object with 10 folds
## Not necessary for this data set, but will code for practice
cv = ShuffleSplit(n_splits = 10, test_size=0.80, random_state=0)

#Also create Test set for 2017
team2017 = team.loc[team['yearID'] == 2017]
franchid2017 = team2017['franchID']

#Drop last categorial column now that it has been preserved
team = team.drop(['franchID'], axis=1)
team2017 = team2017.drop(['franchID'], axis=1)

#Create X Explanatory Variables DF
teamX = team
teamXRfecv = team
teamXSVM = team

print("Team DF")
team.info()
#teamX_colNames = list(teamX)

print("Team 2017")
team2017.info()

Team DF
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1296 entries, 0 to 1323
Data columns (total 37 columns):
Unnamed: 0    1296 non-null int64
yearID        1296 non-null int64
R             1296 non-null int64
AB            1296 non-null int64
H             1296 non-null int64
2B            1296 non-null int64
3B            1296 non-null int64
HR            1296 non-null int64
BB            1296 non-null float64
SO            1296 non-null float64
SB            1296 non-null float64
CS            1296 non-null float64
HBP           1296 non-null float64
SF            1296 non-null float64
RA            1296 non-null int64
ER            1296 non-null int64
ERA           1296 non-null float64
CG            1296 non-null int64
SHO           1296 non-null int64
SV            1296 non-null int64
IPouts        1296 non-null int64
HA            1296 non-null int64
HRA           1296 non-null int64
BBA           1296 non-null int64
SOA           1296 non-null int64
E             1296 no

In [342]:
#Last check for NA values
team.isnull().sum()

Unnamed: 0    0
yearID        0
R             0
AB            0
H             0
2B            0
3B            0
HR            0
BB            0
SO            0
SB            0
CS            0
HBP           0
SF            0
RA            0
ER            0
ERA           0
CG            0
SHO           0
SV            0
IPouts        0
HA            0
HRA           0
BBA           0
SOA           0
E             0
DP            0
FP            0
BPF           0
PPF           0
WHIP          0
KBB           0
KAB           0
Bavg          0
Slug          0
OBP           0
OPS           0
dtype: int64

#### Colinearity

In [446]:
#Drop highly correlated, insignificant and high VIF columns.
teamX = team.drop(['2B', '3B', 'BBA', 'DP', 'HR', 'yearID', 'WHIP', 'HA', 'HBP', 'Slug', 'SF', 'OPS', 'Bavg', 'SOA', 'KAB', 'SHO', 'FP', 'E', 'ER', 'IPouts', 'SO', 'BPF', 'PPF', 'Unnamed: 0', 'ERA', 'H'], axis=1)

#Create correlation matrix
teamCorrMat = teamX.corr()

# Highest Correlation Pairs
corrPairs = teamCorrMat.unstack().sort_values(kind="quicksort")
#- REMOVE DUPLICATES
corrPairs = corrPairs[::2]
corrPairs = corrPairs[corrPairs.index.get_level_values(0) != corrPairs.index.get_level_values(1)]
with pd.option_context('display.max_rows',10):
    print(corrPairs)

CG   SV    -0.521343
HRA  CG    -0.517123
KBB  CG    -0.450180
CS   KBB   -0.382289
RA   CG    -0.330862
              ...   
BB   R      0.591727
SB   CS     0.655180
BB   OBP    0.668055
HRA  RA     0.743288
OBP  R      0.814649
Length: 55, dtype: float64


#### Scale Data

In [447]:
from sklearn.preprocessing import StandardScaler

#Scale data
scaler = StandardScaler()
teamX_scaled = scaler.fit_transform(teamX)
teamXRfecv_scaled = scaler.fit_transform(teamXRfecv)
teamXSVM_scaled = scaler.fit_transform(teamXSVM)

#Save as data frames
df_teamX_scaled = pd.DataFrame(teamX_scaled)
df_teamXRfecv_scaled = pd.DataFrame(teamXRfecv_scaled)
df_teamXSVM_scaled = pd.DataFrame(teamXSVM_scaled)


#### Variance Inflation Factors (VIF)

In [448]:
#Credit to:
###https://stats.stackexchange.com/questions/155028/how-to-systematically-remove-collinear-variables-in-python
###https://etav.github.io/python/vif_factor_python.html

from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

df2_vif = pd.DataFrame()
df2_vif["VIF Factor"] = [vif(df_teamX_scaled.values, i) for i in range(df_teamX_scaled.shape[1])]
#df2_vif["features"] = df_teamX_scaled.columns
df2_vif["features"] = teamX.columns
df2_vif

Unnamed: 0,VIF Factor,features
0,8.763863,R
1,3.429119,AB
2,2.103875,BB
3,2.018539,SB
4,2.42929,CS
5,5.6843,RA
6,3.163924,CG
7,2.226288,SV
8,3.865377,HRA
9,2.573488,KBB


#### Check Feature Significance

In [449]:
#Logistic Regression Summary table with full model fit prior to scaling, cross validation or recursive 
#feature elimination.
#Cursory check to verify feature significance

import statsmodels.api as sm
logit_model = sm.Logit(teamY, teamX)
result = logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.244775
         Iterations 8
                        Results: Logit
Model:              Logit            No. Iterations:   8.0000  
Dependent Variable: Playoff          Pseudo R-squared: 0.487   
Date:               2018-06-16 19:41 AIC:              656.4562
No. Observations:   1296             BIC:              713.2937
Df Model:           10               Log-Likelihood:   -317.23 
Df Residuals:       1285             LL-Null:          -618.03 
Converged:          1.0000           Scale:            1.0000  
-----------------------------------------------------------------
         Coef.    Std.Err.      z      P>|z|     [0.025    0.975]
-----------------------------------------------------------------
R        0.0278     0.0024   11.6525   0.0000    0.0231    0.0324
AB      -0.0027     0.0005   -5.6880   0.0000   -0.0036   -0.0018
BB      -0.0015     0.0018   -0.8383   0.4019   -0.0050    0.0020
SB       0.0049

### Logistic Regresssion

#### Classifier Evaluation

In [450]:
#Credit To:  https://github.com/jakemdrew/EducationDataNC/blob/master/2017/Models/2017ComparingSegregatedHighSchoolCampuses.ipynb

from sklearn.model_selection import cross_validate
#from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

results = []

def EvaluateClassifierEstimator(classifierEstimator, X, y, cv, model):
   
    #Perform cross validation 
    scores = cross_validate(classifierEstimator, teamX, teamY, scoring=['accuracy','precision','recall']
                            , cv=cv, return_train_score=True)

    Accavg = scores['test_accuracy'].mean()
    Preavg = scores['test_precision'].mean()
    Recavg = scores['test_recall'].mean()

    print_str = "The average accuracy for all cv folds is: \t\t\t {Accavg:.5}"
    print_str2 = "The average precision for all cv folds is: \t\t\t {Preavg:.5}"
    print_str3 = "The average recall for all cv folds is: \t\t\t {Recavg:.5}"

    print(print_str.format(Accavg=Accavg))
    print(print_str2.format(Preavg=Preavg))
    print(print_str3.format(Recavg=Recavg))
    print('*********************************************************')

    print('Cross Validation Fold Mean Error Scores')
    scoresResults = pd.DataFrame()
    scoresResults['Accuracy'] = scores['test_accuracy']
    scoresResults['Precision'] = scores['test_precision']
    scoresResults['Recall'] = scores['test_recall']
    
    results.append({'Model': model, 'Accuracy': Accavg, 'Precision': Preavg, 'Recall': Recavg})

    return scoresResults

def EvaluateClassifierEstimator2(classifierEstimator, X, y, cv):
    
    #Perform cross validation 
    from sklearn.model_selection import cross_val_predict
    predictions = cross_val_predict(classifierEstimator, teamX, teamY, cv=cv)
    
    #model evaluation 
    from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
    
    #pass true test set values and predictions to classification_report
    classReport = classification_report(Y,predictions)
    confMat = confusion_matrix(Y,predictions)
    acc = accuracy_score(Y,predictions)
    
    print (classReport)
    print (confMat)
    print (acc)

#### GridSearchCV Logistic Regression with Manual Feature Reduction

In [451]:
#Logisitic regression 10-fold cross-validation 
from sklearn.linear_model import LogisticRegression
regEstimator = LogisticRegression()

parameters = { 'penalty':['l2']
              ,'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
              ,'class_weight': ['balanced', 'none']
              ,'random_state': [0]
              ,'solver': ['lbfgs']
              ,'max_iter':[100,500]
             }

#Create a grid search object using the  
from sklearn.model_selection import GridSearchCV
regGridSearch = GridSearchCV(estimator=regEstimator
                   , n_jobs=8 # jobs to run in parallel
                   , verbose=1 # low verbosity
                   , param_grid=parameters
                   , cv=cv # KFolds = 10
                   , scoring='accuracy')

#Perform hyperparameter search to find the best combination of parameters for our data
#regGridSearch.fit(teamX, teamY)
regGridSearch.fit(df_teamX_scaled, teamY)

Fitting 10 folds for each of 28 candidates, totalling 280 fits


[Parallel(n_jobs=8)]: Done  88 tasks      | elapsed:    0.8s
[Parallel(n_jobs=8)]: Done 280 out of 280 | elapsed:    2.3s finished


GridSearchCV(cv=ShuffleSplit(n_splits=10, random_state=0, test_size=0.8, train_size=None),
       error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=8,
       param_grid={'penalty': ['l2'], 'random_state': [0], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'solver': ['lbfgs'], 'max_iter': [100, 500], 'class_weight': ['balanced', 'none']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [452]:
#Diplay the top model parameters
regGridSearch.best_estimator_.coef_

array([[ 2.50459148, -0.70137689, -0.14700291,  0.20347241, -0.21723472,
        -1.94741016,  0.60250008,  1.04109258, -0.50570959,  0.46057074,
         0.23396854]])

In [453]:
#Use the best parameters for our Linear Regression object
classifierEst = regGridSearch.best_estimator_

#Evaluate the regression estimator above using our pre-defined cross validation and scoring metrics. 
EvaluateClassifierEstimator(classifierEst, df_teamX_scaled, teamY, cv, "manual")

The average accuracy for all cv folds is: 			 0.88187
The average precision for all cv folds is: 			 0.73837
The average recall for all cv folds is: 			 0.55804
*********************************************************
Cross Validation Fold Mean Error Scores


Unnamed: 0,Accuracy,Precision,Recall
0,0.887175,0.721854,0.592391
1,0.888139,0.722892,0.631579
2,0.88621,0.741259,0.566845
3,0.87946,0.776,0.5
4,0.877531,0.698795,0.601036
5,0.874638,0.698113,0.57513
6,0.868852,0.723077,0.484536
7,0.888139,0.804878,0.518325
8,0.885246,0.76259,0.552083
9,0.883317,0.734266,0.558511


In [454]:
#Predictions using Grid Search CV
#print("Plain GridSearch Prediction")
#print(regGridSearch.predict(teamX))
#print(regGridSearch.predict_proba(teamX))
#print(regGridSearch.predict(df_teamX_scaled))
#print(regGridSearch.predict_proba(df_teamX_scaled))

#Is there a difference between .predict and .best_estimator_.predict?  Nope.
print("Best Estimator GridSearch Prediction")
#print(regGridSearch.best_estimator_.predict(teamX))
#print(regGridSearch.best_estimator_.predict_proba(teamX))
print(regGridSearch.best_estimator_.predict(df_teamX_scaled))
print(regGridSearch.best_estimator_.predict_proba(df_teamX_scaled))

Best Estimator GridSearch Prediction
[0 1 0 ... 0 0 1]
[[0.99360415 0.00639585]
 [0.10024467 0.89975533]
 [0.72467299 0.27532701]
 ...
 [0.99641667 0.00358333]
 [0.99623721 0.00376279]
 [0.24545058 0.75454942]]


#### GridSearchCV Logistic Regression with Recursive Feature Elimination

In [455]:
#Credit to:  Jake Drew NC Education Data Set Analysis

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import ShuffleSplit


print("RFECV Logistic Regression 1st Pass")
rfecvEstimator = LogisticRegression()

parameters = { 'penalty':['l2']
              ,'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
              ,'class_weight': ['balanced', 'none']
              ,'random_state': [0]
              ,'solver': ['lbfgs']
              ,'max_iter':[100,500]
             }

#Create a grid search object using the  
from sklearn.model_selection import GridSearchCV
rfecvGridSearch = GridSearchCV(estimator=rfecvEstimator
                   , n_jobs=8 # jobs to run in parallel
                   , verbose=1 # low verbosity
                   , param_grid=parameters
                   , cv=cv # KFolds = 10
                   , scoring='accuracy')

#Perform hyperparameter search to find the best combination of parameters for our data using RFECV
rfecvGridSearch.fit(df_teamXRfecv_scaled, teamY)

#Use the best parameters for our RFECV Linear Regression object
rfecvClassifierEst = rfecvGridSearch.best_estimator_

print("Logistic Regression Second Pass")
#Recursive Feature Elimination
rfecv = RFECV(estimator=rfecvClassifierEst, step=1, cv=cv, scoring='accuracy', verbose=1)
#X_BestFeatures = rfecv.fit_transform(teamX, teamY)
X_BestFeatures = rfecv.fit_transform(df_teamXRfecv_scaled, teamY)

print("Ranking", rfecv.ranking_)
print("Support", rfecv.support_)
print("Number of Features:", rfecv.n_features_)

#create a pipeline to scale all of the data and perform logistic regression during each grid search step.
pipe = make_pipeline(StandardScaler(), LogisticRegression())

#Define a range of hyper parameters for grid search
parameters = { 'logisticregression__penalty':['l2']
              ,'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
              ,'logisticregression__class_weight': ['balanced','none']
              ,'logisticregression__random_state': [0]
              ,'logisticregression__solver': ['lbfgs']
              ,'logisticregression__max_iter':[100,500]
             }

#Perform the grid search using accuracy as a metric during cross validation.
grid = GridSearchCV(pipe, parameters, cv=cv, scoring='accuracy')

#Use the best features from recursive feature elimination during the grid search
#grid.fit(teamX, teamY)
grid.fit(df_teamXRfecv_scaled, teamY)

RFECV Logistic Regression 1st Pass
Fitting 10 folds for each of 28 candidates, totalling 280 fits


[Parallel(n_jobs=8)]: Done  56 tasks      | elapsed:    0.7s
[Parallel(n_jobs=8)]: Done 280 out of 280 | elapsed:    7.2s finished


Logistic Regression Second Pass
Fitting estimator with 37 features.
Fitting estimator with 36 features.
Fitting estimator with 35 features.
Fitting estimator with 34 features.
Fitting estimator with 33 features.
Fitting estimator with 32 features.
Fitting estimator with 31 features.
Fitting estimator with 30 features.
Fitting estimator with 29 features.
Fitting estimator with 28 features.
Fitting estimator with 27 features.
Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 featur

Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.
Fitting estimator with 37 features.
Fitting estimator with 36 features.
Fitting estimator with 35 features.


GridSearchCV(cv=ShuffleSplit(n_splits=10, random_state=0, test_size=0.8, train_size=None),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'logisticregression__max_iter': [100, 500], 'logisticregression__random_state': [0], 'logisticregression__class_weight': ['balanced', 'none'], 'logisticregression__penalty': ['l2'], 'logisticregression__solver': ['lbfgs']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [456]:
#Use the best parameters from RFECV for our Linear Regression object
rfecvClassifierEst = grid.best_estimator_

#Evaluate the regression estimator above using our pre-defined cross validation and scoring metrics. 
#EvaluateClassifierEstimator(classifierEst, teamX, teamY, cv)
EvaluateClassifierEstimator(rfecvClassifierEst, df_teamXRfecv_scaled, teamY, cv, 'Rfecv')

The average accuracy for all cv folds is: 			 0.86635
The average precision for all cv folds is: 			 0.8098
The average recall for all cv folds is: 			 0.36002
*********************************************************
Cross Validation Fold Mean Error Scores


Unnamed: 0,Accuracy,Precision,Recall
0,0.867888,0.805195,0.336957
1,0.876567,0.81,0.426316
2,0.868852,0.786517,0.374332
3,0.855352,0.823529,0.28866
4,0.867888,0.769231,0.414508
5,0.869817,0.778846,0.419689
6,0.864031,0.907692,0.304124
7,0.864031,0.8125,0.340314
8,0.862102,0.826667,0.322917
9,0.866924,0.777778,0.37234


In [457]:
#print(grid.best_estimator_.predict(teamX))
#print(grid.best_estimator_.predict_proba(teamX))
print(grid.best_estimator_.predict(df_teamXRfecv_scaled))
print(grid.best_estimator_.predict_proba(df_teamXRfecv_scaled))

[0 1 0 ... 0 0 1]
[[0.98580662 0.01419338]
 [0.16887826 0.83112174]
 [0.75980895 0.24019105]
 ...
 [0.99476135 0.00523865]
 [0.99295833 0.00704167]
 [0.42323329 0.57676671]]


### Support Vector Machine

In [458]:
#SVM for consolidated team level baseball data created in Lab 1.
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn import metrics as mt

#scaler = StandardScaler()

#teamX_scaled = scaler.fit_transform(teamX)

#train the model just as before
svm_clf = SVC(C=0.5, kernel='rbf', degree=3, gamma='auto') # get object
svm_clf.fit(df_teamXSVM_scaled, teamY)  # train object

y_hat = svm_clf.predict(df_teamXSVM_scaled)

acc = mt.accuracy_score(teamY,y_hat)
conf = mt.confusion_matrix(teamY,y_hat)
prec = mt.precision_score(teamY, y_hat)
recall = mt.recall_score(teamY, y_hat)
print('accuracy:', acc )
print('precision:', prec)
print('recall:', recall)
print(conf)

results.append({'Model': 'SVM', 'Accuracy': acc, 'Precision': prec, 'Recall': recall})

accuracy: 0.8858024691358025
precision: 0.835820895522388
recall: 0.47058823529411764
[[1036   22]
 [ 126  112]]


In [459]:
#look at the support vectors
print(svm_clf.support_vectors_.shape)
print(svm_clf.support_.shape)
print(svm_clf.n_support_ )


(485, 37)
(485,)
[258 227]


In [460]:
# SVM based Prediction
print(y_hat)

[0 1 0 ... 0 0 1]


### Create Model Summary
All three planned models were sucessfully implemented as planned.  All models utilized cross validation to control results.  Stochastic Gradient Descent was not utilized for the support vector machine model as the size of the data set did not warrant it use.  Good results were achieved by all models.  The Support Vector Machine model ultimately produced the best results.  The results are summarized in the table below.

In [461]:
df_results = pd.DataFrame(results)
df_results = df_results[['Model', 'Accuracy', 'Precision', 'Recall']]
df_results

Unnamed: 0,Model,Accuracy,Precision,Recall
0,manual,0.881871,0.738372,0.558044
1,Rfecv,0.866345,0.809795,0.360016
2,SVM,0.885802,0.835821,0.470588


#### Key References

https://github.com/eclarson/DataMiningNotebooks/blob/master/04.%20Logits%20and%20SVM.ipynb
https://github.com/jakemdrew/EducationDataNC/blob/master/2017/Models/2017ComparingSegregatedHighSchoolCampuses.ipynb (Logit)

