## Modeling Results ##

***Description of notebook:***

In alignment with the benchmarking notebook, I modeled the sample on these four models: Logistic Regression, Decision Tree, K Nearest Neighbors, and Support Vector Classifier.

The full steps of modeling:
1. Train Test Split
2. Min Max Scaler
3. Deskewing (Boxcox)
4. PCA (5 components)
5. Standard Scaler
6. Model

Steps 5 and 6 were built into a pipeline and gridsearched on to tune hyperparameters.

The best performing model was K Nearest Neighbors with an ROC AUC Score of .744 and Log Loss of 8.844.

### Results ###

**Logistic Regression:**

*ROC AUC Score:* 0.588

*Log Loss:* 14.208

**Decision Tree:**

*ROC AUC Score:* 0.658

*Log Loss:* 11.800

**K Nearest Neighbors:**

*ROC AUC Score:* 0.744

*Log Loss:* 8.844

**Support Vector Classifier:**

*ROC AUC Score:* 0.711

*Log Loss:* 9.969

In [1]:
% run __init__.py

In [2]:
df = pd.read_pickle('data/elite_cook_df_1.p')
# df = pd.read_pickle('data/elite_cook_df_2.p')
# df = pd.read_pickle('data/elite_cook_df_3.p')

In [3]:
df.head()

Unnamed: 0_level_0,feat_257,feat_269,feat_308,feat_315,feat_336,feat_341,feat_395,feat_504,feat_526,feat_639,...,feat_701,feat_724,feat_736,feat_769,feat_808,feat_829,feat_867,feat_920,feat_956,target
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2.281363,-7.723766,2.714832,-5.48451,-0.132036,-1.595268,-2.47107,3.052163,-2.941691,4.063693,...,7.306688,2.522409,-3.659442,1.333602,1.103701,0.58646,-2.226438,1.503807,4.029951,0
1,2.121323,-1.699388,1.057814,-1.591032,0.134624,-0.391734,-2.183157,0.747105,0.304999,1.371453,...,0.776931,2.126583,-1.507735,1.199454,-0.620077,2.250227,-0.015265,-0.975844,1.588398,1
2,1.415074,4.546333,2.662465,1.619146,-1.696918,0.740744,-2.675854,-1.896792,2.461117,-1.756148,...,-5.63871,0.676838,-1.709226,0.178925,-0.924365,3.118753,3.521908,-4.303822,-0.800297,1
3,-1.381832,2.080511,-0.362144,2.247193,-2.073514,-1.33743,-0.293574,-1.079409,1.860341,-2.603941,...,-3.284469,-0.542681,-0.039577,0.869894,0.508969,-1.037677,2.104805,-0.94114,-2.426835,0
4,0.382663,-0.370281,-1.425611,-0.347839,0.252554,-2.26602,-1.37955,-2.961905,1.344314,-1.465974,...,0.118513,2.685094,0.376503,0.385132,-1.534524,-1.938277,-0.788077,1.947159,-1.075181,0


In [4]:
df.shape

(6600, 21)

In [5]:
predictors = df[df.columns[0:20]]
target = df[df.columns[20]]

In [6]:
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = .2, random_state = 42)

Min Max Scaling (as contingency against 0's and negatives)

In [7]:
min_max = MinMaxScaler(feature_range=(0.0001, 1))
X_train_sc = pd.DataFrame(min_max.fit_transform(X_train))
X_test_sc = pd.DataFrame(min_max.fit_transform(X_test))

Deskewing

In [8]:
def box_cox(train_df, test_df):
    '''Input X_train and X_test to get those dataframes deskewed'''
    X_train_bc = pd.DataFrame()
    X_test_bc = pd.DataFrame()
    for col in train_df.columns:
        box_cox_train, lmbda = boxcox(train_df[col])
        box_cox_test = boxcox(test_df[col], lmbda)
        X_train_bc[col] = pd.Series(box_cox_train)
        X_test_bc[col] = pd.Series(box_cox_test)
    
    return X_train_bc, X_test_bc

In [9]:
X_train_bc, X_test_bc = box_cox(X_train_sc, X_test_sc)

PCA

In [10]:
pca = PCA(n_components = 5)
X_train_comp = pca.fit_transform(X_train_bc)
X_test_comp = pca.transform(X_test_bc)

I could have put the standard scaler here and taken it out of pipelines

### Logistic Regression ###

In [11]:
scaler = StandardScaler()
log_reg = LogisticRegression()
pipe_log_reg = Pipeline([
    ('scaler', scaler), 
    ('log_reg', log_reg)
])

In [12]:
log_reg_params = {
    'log_reg__penalty' : ['l1', 'l2'],
    'log_reg__C' : np.logspace(-10,-1,10)
}

In [13]:
grd_log_reg = GridSearchCV(pipe_log_reg, log_reg_params, cv = 5)

In [14]:
grd_log_reg.fit(X_train_comp, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('log_reg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'log_reg__penalty': ['l1', 'l2'], 'log_reg__C': array([  1.00000e-10,   1.00000e-09,   1.00000e-08,   1.00000e-07,
         1.00000e-06,   1.00000e-05,   1.00000e-04,   1.00000e-03,
         1.00000e-02,   1.00000e-01])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [15]:
grd_log_reg.best_params_

{'log_reg__C': 0.10000000000000001, 'log_reg__penalty': 'l2'}

In [16]:
grd_log_reg.score(X_train_comp, y_train)

0.60643939393939394

In [17]:
grd_log_reg.score(X_test_comp, y_test)

0.58863636363636362

In [18]:
print("Accuracy Score:", accuracy_score(y_test, grd_log_reg.predict(X_test_comp)))

Accuracy Score: 0.588636363636


In [19]:
print("ROC AUC Score:", roc_auc_score(y_test, grd_log_reg.predict(X_test_comp)))

ROC AUC Score: 0.58848336532


In [20]:
print("Log Loss:", log_loss(y_test, grd_log_reg.predict(X_test_comp)))

Log Loss: 14.2081808031


*Accuracy Score:* 0.589

*ROC AUC Score:* 0.588

*Log Loss:* 14.208

**Results:** The data isn't linear, so this model doesn't perform well.

### Decision Tree 

In [21]:
scaler = StandardScaler()
dt_clf = DecisionTreeClassifier()
pipe_dt_clf = Pipeline([
    ('scaler', scaler), 
    ('dt_clf', dt_clf)
])

In [22]:
dt_clf_params = {
    'dt_clf__criterion' : ['gini', 'entropy'],
    'dt_clf__min_samples_split' : range(2,11)
}

In [23]:
grd_dt_clf = GridSearchCV(pipe_dt_clf, dt_clf_params, cv = 5)

In [24]:
grd_dt_clf.fit(X_train_comp, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('dt_clf', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'dt_clf__criterion': ['gini', 'entropy'], 'dt_clf__min_samples_split': range(2, 11)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [25]:
grd_dt_clf.best_params_

{'dt_clf__criterion': 'gini', 'dt_clf__min_samples_split': 4}

In [26]:
grd_dt_clf.score(X_train_comp, y_train)

0.98030303030303034

In [27]:
grd_dt_clf.score(X_test_comp, y_test)

0.65833333333333333

In [28]:
print("Accuracy Score:", accuracy_score(y_test, grd_dt_clf.predict(X_test_comp)))

Accuracy Score: 0.658333333333


In [29]:
print("ROC AUC Score:", roc_auc_score(y_test, grd_dt_clf.predict(X_test_comp)))

ROC AUC Score: 0.658314125933


In [30]:
print("Log Loss:", log_loss(y_test, grd_dt_clf.predict(X_test_comp)))

Log Loss: 11.8008873196


*Accuracy Score:* 0.658

*ROC AUC Score:* 0.658

*Log Loss:* 11.800

### K Nearest Neighbors 

In [31]:
scaler = StandardScaler()
knn = KNeighborsClassifier()
pipe_knn = Pipeline([
    ('scaler', scaler), 
    ('knn', knn)
])

In [32]:
knn_params = {
    'knn__n_neighbors' : range(1,11),
    'knn__weights' : ['uniform', 'distance'],
    'knn__leaf_size' : [2, 5, 10, 15, 20, 25, 30, 35]
}

In [33]:
grd_knn = GridSearchCV(pipe_knn, knn_params, cv = 5)

In [34]:
grd_knn.fit(X_train_comp, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('knn', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'knn__n_neighbors': range(1, 11), 'knn__weights': ['uniform', 'distance'], 'knn__leaf_size': [2, 5, 10, 15, 20, 25, 30, 35]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [35]:
grd_knn.best_params_

{'knn__leaf_size': 2, 'knn__n_neighbors': 10, 'knn__weights': 'distance'}

In [36]:
grd_knn.score(X_train_comp, y_train)

1.0

In [37]:
grd_knn.score(X_test_comp, y_test)

0.7439393939393939

In [38]:
print("Accuracy Score:", accuracy_score(y_test, grd_knn.predict(X_test_comp)))

Accuracy Score: 0.743939393939


In [39]:
print("ROC AUC Score:", roc_auc_score(y_test, grd_knn.predict(X_test_comp)))

ROC AUC Score: 0.743914085529


In [40]:
print("Log Loss:", log_loss(y_test, grd_knn.predict(X_test_comp)))

Log Loss: 8.84412541775


*Accuracy Score:* 0.744

*ROC AUC Score:* 0.744

*Log Loss:* 8.844

### Support Vector Classifier 

In [41]:
scaler = StandardScaler()
svc = SVC()
pipe_svc = Pipeline([
    ('scaler', scaler), 
    ('svc', svc)
])

In [42]:
svc_params = {
    'svc__C' : np.logspace(-10,-1,10),
    'svc__kernel' : ['rbf', 'linear', 'poly']
}

In [43]:
grd_svc = GridSearchCV(pipe_svc, svc_params, cv = 5)

In [44]:
grd_svc.fit(X_train_comp, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'svc__C': array([  1.00000e-10,   1.00000e-09,   1.00000e-08,   1.00000e-07,
         1.00000e-06,   1.00000e-05,   1.00000e-04,   1.00000e-03,
         1.00000e-02,   1.00000e-01]), 'svc__kernel': ['rbf', 'linear', 'poly']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [45]:
grd_svc.best_params_

{'svc__C': 0.10000000000000001, 'svc__kernel': 'rbf'}

In [46]:
grd_svc.score(X_train_comp, y_train)

0.72840909090909089

In [47]:
grd_svc.score(X_test_comp, y_test)

0.71136363636363631

In [48]:
print("Accuracy Score:", accuracy_score(y_test, grd_svc.predict(X_test_comp)))

Accuracy Score: 0.711363636364


In [49]:
print("ROC AUC Score:", roc_auc_score(y_test, grd_svc.predict(X_test_comp)))

ROC AUC Score: 0.711129119643


In [50]:
print("Log Loss:", log_loss(y_test, grd_svc.predict(X_test_comp)))

Log Loss: 9.96929281018


*Accuracy Score:* 0.711

*ROC AUC Score:* 0.711

*Log Loss:* 9.969