# Exploratory Analysis

### Instructions
Step 0:  
A few words of caution:   
1) Read all the way through the instructions.    
2) Models must be built using Python.  
3) No additional data may be added or used.   
4) Not all data must be used to build an adequate model, but making use of complex variables will help us identify high-performance candidates.  
5) The predictions returned should be the class probabilities for belonging to the positive class, not the class itself (i.e. a decimal value, not just 1 or 0). Be sure to output a prediction for EACH of the 10,000 rows in the test dataset.    

Step 1:  
Clean and prepare your data: There are several entries where values have been deleted to simulate dirty data. Please clean the data with whatever method(s) you believe is best/most suitable. Note that some of the missing values are truly blank (unknown answers).

Step 2:  
Build your models: Please build two distinctly different machine learning/statistical models to predict the value for y. When writing the code associated with each model, please have the first part produce and save the model, followed by a second part that loads and applies the model.

Step 3:  
Create predictions on the test dataset using both of your trained models.  The predictions should be the class probabilities for belonging to the positive class (labeled ë1í).  Be sure to output a prediction for EACH of the 10,000 rows in the test dataset.  Save the results of the two models in a separate CSV files titled ìresults1.csvî and ìresults2.csvî.  A result file should each have a single column representing the output from one model. 

Step 4:  
Submit your work: In addition to the two result files (CSV format), please submit all of your code for cleaning, prepping, and modeling your data (text, html, or PDF preferred), and a brief write-up comparing the pros and cons of the two modeling techniques you used (PDF preferred).
Please do not submit the original data back to us. Your work will be scored on techniques used (appropriateness and complexity), model performance - measured by AUC - on the data hold out, an understanding of the two techniques you compared in your write-up, and your overall code.

In [227]:
import numpy as np
import pandas as pd

from sklearn import model_selection

# models
from sklearn import manifold
from sklearn import naive_bayes
from sklearn import svm
from sklearn import ensemble

# plots
import cufflinks as cf
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)
cf.set_config_file(offline=True, world_readable=True)

# to make this notebook's output identical at every run
np.random.seed(42)

## Preprocess Data

In [299]:
X = pd.read_csv('exercise_01_train.csv')
X.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x91,x92,x93,x94,x95,x96,x97,x98,x99,y
0,10.142889,-15.67562,3.583176,-22.397489,27.221894,-34.110924,-0.072829,-0.544444,0.997601,-2.691778,...,1.916575,5.24082,euorpe,2.43117,0.454074,-18.572032,-14.291524,0.178579,18.11017,0
1,-52.21463,5.847135,-10.902843,-14.132351,20.588574,36.107322,0.115023,0.276093,-0.699168,-0.972708,...,0.370941,-3.794542,asia,2.592326,31.921833,3.317139,10.037003,-1.93087,-3.486898,0
2,67.7185,2.064334,12.394186,-18.667102,47.465504,-50.373658,0.253707,1.068968,2.939713,2.691218,...,1.449817,12.470532,asia,7.143821,9.40149,-10.604968,7.643215,-0.842198,-79.358236,0
3,-28.003111,8.565128,-8.592092,5.91896,-3.224154,78.315783,-0.879845,1.176889,-2.414752,0.589646,...,-3.274733,3.48445,asia,-4.998195,-20.31281,14.818524,-9.180674,1.356972,14.475681,0
4,80.703016,30.736353,-30.101857,-21.20114,-91.946233,-47.469246,-0.646831,-0.578398,0.980849,-1.426112,...,-0.644261,4.082783,asia,-0.012556,-29.334324,1.734433,-12.262072,-0.043228,-19.003881,0


In [300]:
"There are {} rows and {} columns in the train set".format(X.shape[0], X.shape[1])

'There are 40000 rows and 101 columns in the train set'

### Class Label

In [301]:
y_counts = X.y.value_counts()
data = [go.Bar(
        x=y_counts.index,
        y=y_counts.values  )]

fig = go.Figure(data=data, layout={'title':'Y Label: Value Counts'})

py.offline.iplot(fig)

### Data

In [169]:
X.dtypes.value_counts()

float64    94
object      6
int64       1
dtype: int64

#### Categorical, Object Columns

Let's investigate what the object columns are

In [302]:
objects = X.select_dtypes(include='object')
objects = objects.assign(y=X.y)
objects.head()

Unnamed: 0,x34,x35,x41,x45,x68,x93,y
0,bmw,wed,$-54.1,0.0%,Jun,euorpe,0
1,nissan,thur,$-229.32,0.01%,July,asia,0
2,Honda,wed,$243.68,-0.01%,July,asia,0
3,Toyota,thur,$126.15,0.02%,May,asia,0
4,bmw,thurday,$877.39,-0.02%,July,asia,0


Columns x45 and x68 are actually numerical -- let's clean these and add those back to the other numerical columns

In [303]:
X.x41.value_counts()[:5]

$-511.36    4
$-369.55    4
$-370.55    4
$156.29     4
$-100.95    4
Name: x41, dtype: int64

In [304]:
X['x41'] = pd.to_numeric(X.x41.str.replace('$', ''))

In [305]:
X.x45.value_counts()

0.01%     9546
-0.01%    9545
0.0%      7958
-0.0%     7600
0.02%     2389
-0.02%    2388
-0.03%     284
0.03%      258
-0.04%      15
0.04%       12
Name: x45, dtype: int64

In [306]:
X['x45'] = pd.to_numeric(X.x45.str.replace('%', ''))

In [307]:
objects = objects.drop(['x41', 'x45'], axis=1)
objects.head()

Unnamed: 0,x34,x35,x68,x93,y
0,bmw,wed,Jun,euorpe,0
1,nissan,thur,July,asia,0
2,Honda,wed,July,asia,0
3,Toyota,thur,May,asia,0
4,bmw,thurday,July,asia,0


For the remaining categorical columns, let's relabel these with more appropriate names

In [308]:
old_labels = ['x34', 'x35', 'x68', 'x93', 'y']
labels = ['car_manufacturer', 'day', 'month', 'market', 'y']

In [309]:
objects.columns = labels
objects.head()

Unnamed: 0,car_manufacturer,day,month,market,y
0,bmw,wed,Jun,euorpe,0
1,nissan,thur,July,asia,0
2,Honda,wed,July,asia,0
3,Toyota,thur,May,asia,0
4,bmw,thurday,July,asia,0


Let's plot the categorical counts against the label to predict

In [310]:
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].car_manufacturer.value_counts()
y_label_1 = y_index.loc[1].car_manufacturer.value_counts()

# create trace1 
trace1 = go.Bar(
                x = y_label_0.index,
                y = y_label_0.values,
                name = "Label 0")
# create trace2 
trace2 = go.Bar(
                x = y_label_1.index,
                y = y_label_1.values,
                name = "Label 1")

data = [trace1, trace2]
layout = go.Layout(barmode="group", title="Car Manufacturer Counts by Label")
fig = go.Figure(data=data, layout = layout)
py.offline.iplot(fig)

Now let's explore the value distribution

In [212]:
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].day.value_counts()
y_label_1 = y_index.loc[1].day.value_counts()

# create trace1 
trace1 = go.Bar(
                x = y_label_0.index,
                y = y_label_0.values,
                name = "Label 0")

# create trace2 
trace2 = go.Bar(
                x = y_label_1.index,
                y = y_label_1.values,
                name = "Label 1")

data = [trace1, trace2]
layout = go.Layout(barmode="group", title="Day of the Week Counts by Label")
fig = go.Figure(data=data, layout = layout)
py.offline.iplot(fig)

In [211]:
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].month.value_counts()
y_label_1 = y_index.loc[1].month.value_counts()

# create trace1 
trace1 = go.Bar(
                x = y_label_0.index,
                y = y_label_0.values,
                name = "Label 0")

# create trace2 
trace2 = go.Bar(
                x = y_label_1.index,
                y = y_label_1.values,
                name = "Label 1")

data = [trace1, trace2]
layout = go.Layout(barmode="group", title="Month Counts by Label")
fig = go.Figure(data=data, layout = layout)
py.offline.iplot(fig)

In [210]:
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].market.value_counts()
y_label_1 = y_index.loc[1].market.value_counts()

# create trace1 
trace1 = go.Bar(
                x = y_label_0.index,
                y = y_label_0.values,
                name = "Label 0")

# create trace2 
trace2 = go.Bar(
                x = y_label_1.index,
                y = y_label_1.values,
                name = "Label 1")

data = [trace1, trace2]
layout = go.Layout(barmode="group", title="Market Counts by Label")
fig = go.Figure(data=data, layout = layout)
py.offline.iplot(fig)

#### Numerical Data

In [223]:
X.dtypes.value_counts()

float64    96
object      4
int64       1
dtype: int64

In [224]:
nums = X.select_dtypes(exclude='object')
nums.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x90,x91,x92,x94,x95,x96,x97,x98,x99,y
0,10.142889,-15.67562,3.583176,-22.397489,27.221894,-34.110924,-0.072829,-0.544444,0.997601,-2.691778,...,-151.134483,1.916575,5.24082,2.43117,0.454074,-18.572032,-14.291524,0.178579,18.11017,0
1,-52.21463,5.847135,-10.902843,-14.132351,20.588574,36.107322,0.115023,0.276093,-0.699168,-0.972708,...,-58.955871,0.370941,-3.794542,2.592326,31.921833,3.317139,10.037003,-1.93087,-3.486898,0
2,67.7185,2.064334,12.394186,-18.667102,47.465504,-50.373658,0.253707,1.068968,2.939713,2.691218,...,-74.014931,1.449817,12.470532,7.143821,9.40149,-10.604968,7.643215,-0.842198,-79.358236,0
3,-28.003111,8.565128,-8.592092,5.91896,-3.224154,78.315783,-0.879845,1.176889,-2.414752,0.589646,...,165.859181,-3.274733,3.48445,-4.998195,-20.31281,14.818524,-9.180674,1.356972,14.475681,0
4,80.703016,30.736353,-30.101857,-21.20114,-91.946233,-47.469246,-0.646831,-0.578398,0.980849,-1.426112,...,-174.486251,-0.644261,4.082783,-0.012556,-29.334324,1.734433,-12.262072,-0.043228,-19.003881,0


In [237]:
nums.describe()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x90,x91,x92,x94,x95,x96,x97,x98,x99,y
count,39986.0,39990.0,39994.0,39990.0,39994.0,39990.0,39993.0,39987.0,39994.0,39993.0,...,39993.0,39998.0,39994.0,39989.0,39993.0,39985.0,39993.0,39995.0,39987.0,40000.0
mean,8.259955,-3.249786,1.030666,-0.747566,0.28382,-1.77351,-0.000232,-0.016107,-0.651093,-0.014688,...,-14.274809,0.01139,0.003948,-0.05051,-0.007572,-0.629241,-1.986671,0.036482,1.486887,0.201175
std,38.374182,15.171131,24.732185,15.22573,42.240018,42.1241,1.065955,3.382644,2.947472,1.906496,...,154.038206,3.311041,8.763944,4.97969,19.23821,16.915222,14.375663,5.633052,36.926796,0.400884
min,-140.780478,-64.493908,-105.388182,-63.804916,-158.195975,-169.237259,-4.13349,-12.96697,-12.037625,-7.4462,...,-674.004008,-12.807938,-38.121111,-21.578977,-87.669573,-77.010252,-57.709983,-23.588876,-154.559512,0.0
25%,-17.800204,-13.45858,-15.565461,-11.078276,-28.246509,-30.391354,-0.723098,-2.299081,-2.628856,-1.299759,...,-116.645845,-2.218739,-5.925508,-3.43518,-12.895717,-11.948902,-11.686033,-3.770599,-23.559519,0.0
50%,8.354662,-3.386601,1.132995,-0.714888,0.292788,-1.753365,0.001105,-0.003556,-0.659223,-0.02817,...,-11.471306,-0.006726,0.009306,-0.037111,0.124945,-0.481374,-2.026059,0.041838,1.465346,0.0
75%,33.82978,6.881661,17.677615,9.552404,28.719663,26.844781,0.715844,2.259972,1.322101,1.263469,...,90.101751,2.238996,5.909011,3.299108,12.988509,10.793171,7.61166,3.8401,26.548474,0.0
max,177.399176,62.906822,99.394915,59.338352,179.342581,170.894497,5.311653,16.619445,14.994937,7.300186,...,603.911528,14.982369,35.785334,20.983463,78.785164,70.182932,60.481075,22.759016,143.126382,1.0


In [276]:
nums_subset_plus_y = nums.iloc[:500,:5]
nums_subset_plus_y = nums_subset_plus_y.assign(y=nums.y)
nums_subset_plus_y.dropna(inplace=True)
nums_subset_plus_y.head(1)

Unnamed: 0,x0,x1,x2,x3,x4,y
0,10.142889,-15.67562,3.583176,-22.397489,27.221894,0


In [286]:
nums_subset_plus_y.iplot(title='Line Chart of First 10 Columns')

In [285]:
nums_subset_plus_y.iplot(kind='box', title='Histogram of First 10 Columns')

In [287]:
nums['x0'].iplot(kind='hist', title='Distribution Chart of First Column, x0')

In [289]:
nums_subset_plus_y.head(100).scatter_matrix() 

In [290]:
# Create distplot with custom bin_size
fig = ff.create_distplot([nums_subset_plus_y[c] for c in nums_subset_plus_y.columns[:3]], nums_subset_plus_y.columns[:3], bin_size=.25)

fig['layout'].update(title='Example Distplot w/First 3 Columns')

# Plot!
py.iplot(fig)

#### Clean Data

Pull clean copy of data and reclean for modeling

In [335]:
X = pd.read_csv('exercise_01_train.csv')
X.head(1)

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x91,x92,x93,x94,x95,x96,x97,x98,x99,y
0,10.142889,-15.67562,3.583176,-22.397489,27.221894,-34.110924,-0.072829,-0.544444,0.997601,-2.691778,...,1.916575,5.24082,euorpe,2.43117,0.454074,-18.572032,-14.291524,0.178579,18.11017,0


##### Missing Data

In [324]:
(X.isnull().sum()/X.shape[0]).sort_values(ascending=False)[:10] # % are missing data/null

x96    0.000375
x0     0.000350
x55    0.000350
x18    0.000350
x62    0.000325
x99    0.000325
x13    0.000325
x21    0.000325
x69    0.000325
x7     0.000325
dtype: float64

In [325]:
X.isnull().sum().sort_values(ascending=False)[:10]

x96    15
x0     14
x55    14
x18    14
x62    13
x99    13
x13    13
x21    13
x69    13
x7     13
dtype: int64

In [336]:
def handle_missing_data(df, drop=True, impute=False):
    if drop:
        return df.dropna()
    # add impute instructions

##### Relabel Data

In [337]:
def relabel_data(df):
    df['x41'] = pd.to_numeric(df.x41.str.replace('$', ''))
    df['x45'] = pd.to_numeric(df.x45.str.replace('%', ''))
    return df

In [338]:
X = handle_missing_data(X, drop=True)
X = relabel_data(X)

## Model

Let's baseline with some scikit-learn models to gauge how much tuning we need to do

In [339]:
y = X.pop('y')
X = X.select_dtypes(exclude='object')
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=42)

In [350]:
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score # to compute AUC performance metric

from sklearn.model_selection import cross_val_predict

# models
from sklearn.linear_model import SGDClassifier

#### Baseline
Let's get the performance of a generic machine learning model

In [340]:
sgd_clf = SGDClassifier(max_iter=5, tol=-np.infty, random_state=42)
sgd_clf.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=-inf, verbose=0, warm_start=False)

In [343]:
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train, cv=3)

In [354]:
"AUC Score: {}".format( roc_auc_score(y_train, y_train_pred) )

'AUC Score: 0.7186713922143503'

In [347]:
confusion_matrix(y_train, y_train_pred)

array([[19988,  3481],
       [ 2457,  3473]])

In [353]:
print(classification_report(y_train, y_train_pred))

             precision    recall  f1-score   support

          0       0.89      0.85      0.87     23469
          1       0.50      0.59      0.54      5930

avg / total       0.81      0.80      0.80     29399



#### Multiclassifier Performance
Reference: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

In [356]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [358]:
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

In [357]:
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

In [None]:
for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    pred = clf.score(X_test, y_test)
    print('Classifier: ', name)
    print('AUC Score: ', roc_auc_score(y_train, y_train_pred) )
    print('Confusion Matrix: ', confusion_matrix(y_train, y_train_pred) )
    print('Classification Report: ', classification_report(y_train, y_train_pred) )

Classifier:  Nearest Neighbors


#### Grid Search