## Benchmarking Results ##

***Description of Notebook:***

Built naive models to get a prediction benchmark for our classification problem. The models used are:

1. Logistic Regression
2. Decision Tree Classifier
3. K Nearest Neighbors
4. Support Vector Classifier

The best performing model was a close race between Decision Tree Classifier and Support Vector Classifier.

Decision Tree Classifier had an ROC AUC Score of .629 and Log Loss of 12.821.

Support Vector Classifier had an ROC AUC Score of .624 and Log Loss of 12.978.

### Results

**Logistic Regression:**

*ROC AUC Score:* 0.539

*Log Loss:* 15.909

**Decision Tree Classifier:**

*ROC AUC Score:* 0.629

*Log Loss:* 12.821

**K Nearest Neighbors:**

*ROC AUC Score:* 0.600

*Log Loss:* 13.816

**Support Vector Classifier:**

*ROC AUC Score:* 0.624

*Log Loss:* 12.978

## Benchmarking Code ##

** Data Import **

In [1]:
!conda install psycopg2 --yes

Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /opt/conda:

The following packages will be UPDATED:

    psycopg2: 2.7.3.1-py36_0 conda-forge --> 2.7.3.2-py36_0 conda-forge

psycopg2-2.7.3 100% |################################| Time: 0:00:00 972.03 kB/s


In [2]:
import psycopg2 as pg2
from psycopg2.extras import RealDictCursor
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score, classification_report, accuracy_score, log_loss

** Data Import **

Database connection to read in Josh's data

In [14]:
con = pg2.connect(host='34.211.227.227',
                  dbname='postgres',
                  user='postgres')
cur = con.cursor(cursor_factory=RealDictCursor)
cur.execute('SELECT * FROM madelon WHERE _id BETWEEN 0 AND 6599;')
results = cur.fetchall()
con.close()

In [None]:
# con = pg2.connect(host='34.211.227.227',
#                   dbname='postgres',
#                   user='postgres')
# cur = con.cursor(cursor_factory=RealDictCursor)
# cur.execute('SELECT * FROM madelon WHERE _id BETWEEN 6600 AND 13199;')
# results = cur.fetchall()
# con.close()

In [None]:
# con = pg2.connect(host='34.211.227.227',
#                   dbname='postgres',
#                   user='postgres')
# cur = con.cursor(cursor_factory=RealDictCursor)
# cur.execute('SELECT * FROM madelon WHERE _id BETWEEN 13200 AND 19799;')
# results = cur.fetchall()
# con.close()

In [18]:
cook_df_1 = pd.DataFrame(results)
# cook_df_2 = pd.DataFrame(results)
# cook_df_3 = pd.DataFrame(results)

In [19]:
cook_df_1.head()
# cook_df_2.head()
# cook_df_3.head()

Unnamed: 0,_id,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,...,feat_991,feat_992,feat_993,feat_994,feat_995,feat_996,feat_997,feat_998,feat_999,target
0,0,-0.679428,-0.184313,1.841026,-1.212077,-1.472139,0.010865,0.7147,-0.987905,1.206416,...,-0.491755,-0.868924,-1.41834,1.058468,-0.234507,-0.315312,0.411613,1.287231,1.194678,0
1,1,0.150712,0.698544,-0.994987,0.389926,0.333144,-0.396281,-2.188077,0.427875,0.802195,...,-1.10967,1.519362,-1.157226,-0.025068,1.38971,0.546423,-0.499608,1.213644,-0.220163,1
2,2,-1.357123,-2.08206,1.903423,0.223738,-0.389684,1.29108,-0.523464,0.305309,1.36803,...,1.155164,0.081404,-0.824899,-0.856366,1.27139,-1.263599,1.437224,0.636419,0.139863,1
3,3,0.717731,-0.12549,0.366056,-1.624306,-0.71049,-1.141389,-0.034528,-1.023395,0.587004,...,-1.788001,-0.862344,-1.9383,1.528354,-2.054189,-0.050716,-1.112139,-1.14479,1.462363,0
4,4,-2.056765,-1.188354,0.147163,-0.007732,-0.057824,0.480209,-0.067577,0.680607,0.39083,...,0.057895,1.581505,0.976392,-0.220416,1.659178,1.686492,-2.091279,-1.469088,-0.890519,0


In [36]:
cook_df_1.set_index('_id', inplace = True)
# cook_df_2.set_index('_id', inplace = True)
# cook_df_3.set_index('_id', inplace = True)

** Select the data set to use in work flow here. **

In [37]:
df = cook_df_1.copy()
df = cook_df_2.copy()
df = cook_df_3.copy()

In [39]:
df.shape

(6600, 1001)

# Step 1 - Benchmarking #

In [43]:
predictors = df[df.columns[0:1000]]
target = df[df.columns[1000]]

In [45]:
predictors.shape

(6600, 1000)

In [46]:
target.shape

(6600,)

In [47]:
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = .2, random_state = 42)

### Logistic Regression

In [48]:
log_reg = LogisticRegression(penalty = 'l1', C = 10000)

In [49]:
log_reg.fit_transform(X_train, y_train)



array([[-0.5764655 , -0.69398912,  0.35921516, ...,  1.12641776,
         0.57979678, -3.01241554],
       [-1.02699425,  1.27652373,  0.97818   , ...,  0.63760335,
         0.7773052 , -1.00966964],
       [-0.02762255,  2.97106963, -0.30504455, ...,  1.57809425,
         0.30991587, -1.00164615],
       ..., 
       [ 0.38292939,  0.17790263, -0.42013424, ..., -0.75279079,
         2.61547914, -1.1704748 ],
       [-1.63021295,  0.13168501, -0.4358341 , ...,  0.02279938,
        -1.12472333, -2.16645939],
       [ 0.43925178, -0.85994045,  0.69398546, ...,  0.28433154,
         1.20477566,  0.67916881]])

In [50]:
print("Accuracy Score:", accuracy_score(y_test, log_reg.predict(X_test)))

Accuracy Score: 0.539393939394


In [51]:
print("ROC AUC Score:", roc_auc_score(y_test, log_reg.predict(X_test)))

ROC AUC Score: 0.539330021396


In [52]:
print("Log Loss:", log_loss(y_test, log_reg.predict(X_test)))

Log Loss: 15.9089617579


*Accuracy Score:* 0.539

*ROC AUC Score:* 0.539

*Log Loss:* 15.909

In [54]:
print(classification_report(y_test, log_reg.predict(X_test)))

             precision    recall  f1-score   support

          0       0.54      0.52      0.53       658
          1       0.54      0.56      0.55       662

avg / total       0.54      0.54      0.54      1320



### Decision Tree Classifier

In [55]:
dt_clf = DecisionTreeClassifier()

In [56]:
dt_clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [57]:
print("Accuracy Score:", accuracy_score(y_test, dt_clf.predict(X_test)))

Accuracy Score: 0.628787878788


In [58]:
print("ROC AUC Score:", roc_auc_score(y_test, dt_clf.predict(X_test)))

ROC AUC Score: 0.628715598858


In [59]:
print("Log Loss:", log_loss(y_test, dt_clf.predict(X_test)))

Log Loss: 12.8213699461


*Accuracy Score:* 0.629

*ROC AUC Score:* 0.629

*Log Loss:* 12.821

In [60]:
print(classification_report(y_test, dt_clf.predict(X_test)))

             precision    recall  f1-score   support

          0       0.63      0.60      0.62       658
          1       0.62      0.65      0.64       662

avg / total       0.63      0.63      0.63      1320



### K Nearest Neighbor ###

In [61]:
knn = KNeighborsClassifier(n_neighbors = 5)

In [62]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [63]:
print("Accuracy Score:", accuracy_score(y_test, knn.predict(X_test)))

Accuracy Score: 0.6


In [64]:
print("ROC AUC Score:", roc_auc_score(y_test, knn.predict(X_test)))

ROC AUC Score: 0.599950412768


In [65]:
print("Log Loss:", log_loss(y_test, knn.predict(X_test)))

Log Loss: 13.815676535


*Accuracy Score:* 0.600

*ROC AUC Score:* 0.600

*Log Loss:* 13.816

In [66]:
print(classification_report(y_test, knn.predict(X_test)))

             precision    recall  f1-score   support

          0       0.60      0.58      0.59       658
          1       0.60      0.62      0.61       662

avg / total       0.60      0.60      0.60      1320



### Support Vector Classifier

In [67]:
svc = SVC(C = 10000)
svc.fit(X_train, y_train)

SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [68]:
print("Accuracy Score:", accuracy_score(y_test, svc.predict(X_test)))

Accuracy Score: 0.624242424242


In [69]:
print("ROC AUC Score:", roc_auc_score(y_test, svc.predict(X_test)))

ROC AUC Score: 0.624160919751


In [70]:
print("Log Loss:", log_loss(y_test, svc.predict(X_test)))

Log Loss: 12.978367413


*Accuracy Score:* 0.624

*ROC AUC Score:* 0.624

*Log Loss:* 12.978

In [71]:
df.to_pickle('data/cook_df_1.p')
# df.to_pickle('data/cook_df_2.p')
# df.to_pickle('data/cook_df_3.p')