## Modeling Results ##

***Description of notebook:***

In alignment with the benchmarking notebook (for this workflow, the benchmarking was done on the full data set and not the sample used for modeling), I modeled the sample on these four models: Logistic Regression, Decision Tree, K Nearest Neighbors, and Support Vector Classifier.

The full steps of modeling:
1. Train Test Split
2. Min Max Scaler
3. Deskewing (Boxcox)
4. PCA (5 components)
5. Standard Scaler
6. Model

Steps 5 and 6 were built into a pipeline and gridsearched on to tune hyperparameters.

The best performing model was K Nearest Neighbors with an ROC AUC Score of .740 and Log Loss of 8.635.

### Results ###

**Logistic Regression:**

*ROC AUC Score:* 0.501

*Log Loss:* 16.605

**Decision Tree:**

*ROC AUC Score:* 0.618

*Log Loss:* 12.620

**K Nearest Neighbors:**

*ROC AUC Score:* 0.740

*Log Loss:* 8.635

**Support Vector Classifier:**

*ROC AUC Score:* 0.458

*Log Loss:* 17.934

In [1]:
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel, RFE, RFECV, SelectKBest, chi2, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from scipy.stats import boxcox
from sklearn.metrics import roc_auc_score, classification_report, accuracy_score, log_loss
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [2]:
df = pd.read_pickle('data/df_elite_1.p')
# df = pd.read_pickle('data/df_elite_2.p')
# df = pd.read_pickle('data/df_elite_3.p')

In [3]:
df.head()

Unnamed: 0,28,48,64,105,128,153,241,281,318,336,...,378,433,442,451,453,455,472,475,493,target
574,468,444,444,558,484,610,477,550,442,418,...,439,578,620,469,685,358,533,455,713,-1
243,490,458,602,331,465,460,455,458,514,604,...,456,470,469,482,379,580,467,430,379,-1
487,493,475,571,430,473,558,449,505,540,644,...,482,543,429,487,466,583,447,417,480,-1
468,471,526,404,700,495,558,542,506,452,372,...,543,530,539,471,609,401,500,540,631,1
708,486,544,423,782,502,444,510,447,503,368,...,581,459,521,482,547,374,492,502,575,1


In [4]:
df.shape

(260, 21)

In [5]:
predictors = df[df.columns[0:20]]
target = df[df.columns[20]]

In [6]:
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = .2, random_state = 42)

Min Max Scaling (as contingency against 0's and negatives)

In [7]:
min_max = MinMaxScaler(feature_range=(0.0001, 1))
X_train_sc = pd.DataFrame(min_max.fit_transform(X_train))
X_test_sc = pd.DataFrame(min_max.fit_transform(X_test))

Deskewing

In [8]:
def box_cox(train_df, test_df):
    '''Input X_train and X_test to get those dataframes deskewed'''
    X_train_bc = pd.DataFrame()
    X_test_bc = pd.DataFrame()
    for col in train_df.columns:
        box_cox_train, lmbda = boxcox(train_df[col])
        box_cox_test = boxcox(test_df[col], lmbda)
        X_train_bc[col] = pd.Series(box_cox_train)
        X_test_bc[col] = pd.Series(box_cox_test)
    
    return X_train_bc, X_test_bc

In [9]:
X_train_bc, X_test_bc = box_cox(X_train_sc, X_test_sc)

PCA

In [10]:
pca = PCA(n_components = 5)
X_train_comp = pca.fit_transform(X_train_bc)
X_test_comp = pca.transform(X_test_bc)

I could have put the standard scaler here and taken it out of pipelines

### Logistic Regression ###

In [11]:
scaler = StandardScaler()
log_reg = LogisticRegression()
pipe_log_reg = Pipeline([
    ('scaler', scaler), 
    ('log_reg', log_reg)
])

In [12]:
log_reg_params = {
    'log_reg__penalty' : ['l1', 'l2'],
    'log_reg__C' : np.logspace(-10,-1,10)
}

In [13]:
grd_log_reg = GridSearchCV(pipe_log_reg, log_reg_params, cv = 5)

In [14]:
grd_log_reg.fit(X_train_comp, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('log_reg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'log_reg__penalty': ['l1', 'l2'], 'log_reg__C': array([  1.00000e-10,   1.00000e-09,   1.00000e-08,   1.00000e-07,
         1.00000e-06,   1.00000e-05,   1.00000e-04,   1.00000e-03,
         1.00000e-02,   1.00000e-01])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [15]:
grd_log_reg.best_params_

{'log_reg__C': 1e-10, 'log_reg__penalty': 'l2'}

In [16]:
grd_log_reg.score(X_train_comp, y_train)

0.71634615384615385

In [17]:
grd_log_reg.score(X_test_comp, y_test)

0.51923076923076927

In [18]:
print("Accuracy Score:", accuracy_score(y_test, grd_log_reg.predict(X_test_comp)))

Accuracy Score: 0.519230769231


In [19]:
print("ROC AUC Score:", roc_auc_score(y_test, grd_log_reg.predict(X_test_comp)))

ROC AUC Score: 0.501499250375


In [20]:
print("Log Loss:", log_loss(y_test, grd_log_reg.predict(X_test_comp)))

Log Loss: 16.6054116122


*Accuracy Score:* 0.519

*ROC AUC Score:* 0.501

*Log Loss:* 16.605

**Results:** The data isn't linear, so this model doesn't perform well.

### Decision Tree 

In [21]:
scaler = StandardScaler()
dt_clf = DecisionTreeClassifier()
pipe_dt_clf = Pipeline([
    ('scaler', scaler), 
    ('dt_clf', dt_clf)
])

In [22]:
dt_clf_params = {
    'dt_clf__criterion' : ['gini', 'entropy'],
    'dt_clf__min_samples_split' : range(2,11)
}

In [23]:
grd_dt_clf = GridSearchCV(pipe_dt_clf, dt_clf_params, cv = 5)

In [24]:
grd_dt_clf.fit(X_train_comp, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('dt_clf', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'dt_clf__criterion': ['gini', 'entropy'], 'dt_clf__min_samples_split': range(2, 11)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [25]:
grd_dt_clf.best_params_

{'dt_clf__criterion': 'entropy', 'dt_clf__min_samples_split': 6}

In [26]:
grd_dt_clf.score(X_train_comp, y_train)

0.95673076923076927

In [27]:
grd_dt_clf.score(X_test_comp, y_test)

0.63461538461538458

In [28]:
print("Accuracy Score:", accuracy_score(y_test, grd_dt_clf.predict(X_test_comp)))

Accuracy Score: 0.634615384615


In [29]:
print("ROC AUC Score:", roc_auc_score(y_test, grd_dt_clf.predict(X_test_comp)))

ROC AUC Score: 0.61844077961


In [30]:
print("Log Loss:", log_loss(y_test, grd_dt_clf.predict(X_test_comp)))

Log Loss: 12.6201220514


*Accuracy Score:* 0.635

*ROC AUC Score:* 0.618

*Log Loss:* 12.620

### K Nearest Neighbors 

In [31]:
scaler = StandardScaler()
knn = KNeighborsClassifier()
pipe_knn = Pipeline([
    ('scaler', scaler), 
    ('knn', knn)
])

In [32]:
knn_params = {
    'knn__n_neighbors' : range(1,11),
    'knn__weights' : ['uniform', 'distance'],
    'knn__leaf_size' : [2, 5, 10, 15, 20, 25, 30, 35]
}

In [33]:
grd_knn = GridSearchCV(pipe_knn, knn_params, cv = 5)

In [34]:
grd_knn.fit(X_train_comp, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('knn', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'knn__n_neighbors': range(1, 11), 'knn__weights': ['uniform', 'distance'], 'knn__leaf_size': [2, 5, 10, 15, 20, 25, 30, 35]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [36]:
grd_knn.best_params_

{'knn__leaf_size': 2, 'knn__n_neighbors': 3, 'knn__weights': 'distance'}

In [37]:
grd_knn.score(X_train_comp, y_train)

1.0

In [38]:
grd_knn.score(X_test_comp, y_test)

0.75

In [39]:
print("Accuracy Score:", accuracy_score(y_test, grd_knn.predict(X_test_comp)))

Accuracy Score: 0.75


In [40]:
print("ROC AUC Score:", roc_auc_score(y_test, grd_knn.predict(X_test_comp)))

ROC AUC Score: 0.73988005997


In [41]:
print("Log Loss:", log_loss(y_test, grd_knn.predict(X_test_comp)))

Log Loss: 8.63481711372


*Accuracy Score:* 0.750

*ROC AUC Score:* 0.740

*Log Loss:* 8.635

### Support Vector Classifier 

In [42]:
scaler = StandardScaler()
svc = SVC()
pipe_svc = Pipeline([
    ('scaler', scaler), 
    ('svc', svc)
])

In [43]:
svc_params = {
    'svc__C' : np.logspace(-10,-1,10),
    'svc__kernel' : ['rbf', 'linear', 'poly']
}

In [44]:
grd_svc = GridSearchCV(pipe_svc, svc_params, cv = 5)

In [45]:
grd_svc.fit(X_train_comp, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'svc__C': array([  1.00000e-10,   1.00000e-09,   1.00000e-08,   1.00000e-07,
         1.00000e-06,   1.00000e-05,   1.00000e-04,   1.00000e-03,
         1.00000e-02,   1.00000e-01]), 'svc__kernel': ['rbf', 'linear', 'poly']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [46]:
grd_svc.best_params_

{'svc__C': 0.01, 'svc__kernel': 'linear'}

In [47]:
grd_svc.score(X_train_comp, y_train)

0.72115384615384615

In [48]:
grd_svc.score(X_test_comp, y_test)

0.48076923076923078

In [49]:
print("Accuracy Score:", accuracy_score(y_test, grd_svc.predict(X_test_comp)))

Accuracy Score: 0.480769230769


In [50]:
print("ROC AUC Score:", roc_auc_score(y_test, grd_svc.predict(X_test_comp)))

ROC AUC Score: 0.458020989505


In [51]:
print("Log Loss:", log_loss(y_test, grd_svc.predict(X_test_comp)))

Log Loss: 17.9338568427


*Accuracy Score:* 0.481

*ROC AUC Score:* 0.458

*Log Loss:* 17.934