## Modeling Results ##

***Description of notebook:***

I ran my best model fit and scored from the sample of the data from Josh Cook's database on the full dataset.

The full steps of modeling:
1. Train Test Split
2. Min Max Scaler
3. Deskewing (Boxcox)
4. PCA (5 components)
5. Standard Scaler
6. Model

Steps 5 and 6 were built into a pipeline and gridsearched on to tune hyperparameters.

As mentioned, K Nearest Neighbors performed best. Results below:

### Results ###

**K Nearest Neighbors:**

*ROC AUC Score:* 0.757

*Log Loss:* 8.389

## Modeling Code ##

In [1]:
% run __init__.py

In [2]:
cd ..

/home/jovyan/Documents/GA_DSI/Projects/project_3


In [3]:
feats = pd.read_pickle('data/twenty_feats.p')

In [4]:
feats = list(feats[0])
print(feats)

['feat_257', 'feat_269', 'feat_308', 'feat_315', 'feat_336', 'feat_341', 'feat_395', 'feat_504', 'feat_526', 'feat_639', 'feat_681', 'feat_701', 'feat_724', 'feat_736', 'feat_769', 'feat_808', 'feat_829', 'feat_867', 'feat_920', 'feat_956']


Use feats to pull full josh data for only those columns

In [5]:
con = pg2.connect(host='34.211.227.227',
                  dbname='postgres',
                  user='postgres')
cur = con.cursor(cursor_factory=RealDictCursor)
cur.execute('SELECT feat_257, feat_269, feat_308, feat_315, feat_336, feat_341, feat_395, feat_504, feat_526, \
feat_639, feat_681, feat_701, feat_724, feat_736, feat_769, feat_808, feat_829, feat_867, feat_920, \
feat_956, target FROM madelon;')
results = cur.fetchall()
con.close()

In [6]:
df = pd.DataFrame(results)

In [7]:
df.shape

(200000, 21)

In [8]:
predictors = df[df.columns[0:20]]
target = df[df.columns[20]]

In [9]:
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = .2, random_state = 42)

Min Max Scaling (as contingency against 0's and negatives)

In [10]:
min_max = MinMaxScaler(feature_range=(0.0001, 1))
X_train_sc = pd.DataFrame(min_max.fit_transform(X_train))
X_test_sc = pd.DataFrame(min_max.fit_transform(X_test))

Deskewing

In [11]:
def box_cox(train_df, test_df):
    '''Input X_train and X_test to get those dataframes deskewed'''
    X_train_bc = pd.DataFrame()
    X_test_bc = pd.DataFrame()
    for col in train_df.columns:
        box_cox_train, lmbda = boxcox(train_df[col])
        box_cox_test = boxcox(test_df[col], lmbda)
        X_train_bc[col] = pd.Series(box_cox_train)
        X_test_bc[col] = pd.Series(box_cox_test)
    
    return X_train_bc, X_test_bc

In [12]:
X_train_bc, X_test_bc = box_cox(X_train_sc, X_test_sc)

PCA

In [13]:
pca = PCA(n_components = 5)
X_train_comp = pca.fit_transform(X_train_bc)
X_test_comp = pca.transform(X_test_bc)

I could have put the standard scaler here and taken it out of pipelines

### K Nearest Neighbors 

In [34]:
scaler = StandardScaler()
knn = KNeighborsClassifier()
pipe_knn = Pipeline([
    ('scaler', scaler), 
    ('knn', knn)
])

In [35]:
knn_params = {
    'knn__n_neighbors' : range(1,11, 2),
    'knn__weights' : ['uniform', 'distance'],
    'knn__leaf_size' : [2, 10, 30]
}

In [36]:
grd_knn = GridSearchCV(pipe_knn, knn_params, cv = 5)

In [37]:
grd_knn.fit(X_train_comp, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('knn', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'knn__n_neighbors': range(1, 11, 2), 'knn__weights': ['uniform', 'distance'], 'knn__leaf_size': [2, 10, 30]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [38]:
grd_knn.best_params_

{'knn__leaf_size': 2, 'knn__n_neighbors': 9, 'knn__weights': 'distance'}

In [39]:
grd_knn.score(X_train_comp, y_train)

1.0

In [40]:
grd_knn.score(X_test_comp, y_test)

0.75714999999999999

In [41]:
print("Accuracy Score:", accuracy_score(y_test, grd_knn.predict(X_test_comp)))

Accuracy Score: 0.75715


In [42]:
print("ROC AUC Score:", roc_auc_score(y_test, grd_knn.predict(X_test_comp)))

ROC AUC Score: 0.757149956893


In [43]:
print("Log Loss:", log_loss(y_test, grd_knn.predict(X_test_comp)))

Log Loss: 8.38783859879


*Accuracy Score:* 0.757

*ROC AUC Score:* 0.757

*Log Loss:* 8.389