# Homework 5: Training a classifier

## Introduction

In this homework, you will have the opportunity to train your best classifier on a data set provided on Kaggle for a current competition: 
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction 

The data has slightly been transformed from its original form and down-sampled but you can find a description of the data here:
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data

You can also find inspiration on how to improve your model here:
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/kernels

The "id" column is a unique identifier and has to be used as an index. The columns you want to predict is the column "target". All the other columns can be used in training.

## The files

You will 3 files with this homework:
- train_data.csv
- test_data.csv
- submission_example.csv

The first file will be used for training and contains all the attributes, the target and a unique identifier. The second file will be used to assess your classifier and contains only the attributes along with a unique identifier. The last file is an example of a submission.

## Machine Learning Libraries 

There are many Machine Learning (ML) Libraries available for python. The most known one is obviously [scikit-learn](http://scikit-learn.org/stable/supervised_learning.html) where many classifier algorithms are available for you to try. Currently [XGBoost](http://xgboost.readthedocs.io/en/latest/python/python_intro.html), [LightGBM](https://github.com/Microsoft/LightGBM) and [H2O](http://h2o-release.s3.amazonaws.com/h2o/rel-weierstrass/7/index.html) are amount the best performing ML libraries for supervised learning tasks. You are invited and advised to try all packages to find what works best for the data in this homework. 

Very popular, are also all the Artificial Neural Network libraries that can be more difficult to tame:
- [TensorFlow](https://www.tensorflow.org/)
- [Pytorch](http://pytorch.org/)
- [Keras](https://keras.io/)
- [MXNet](http://mxnet.incubator.apache.org/)
- ...

## Training a Classifier

When it comes training a classifier there are different techniques to consider to improving performance:
- Tuning the hyperparameters: find the right algorithm parameters. 
- Features selection: filter away the features that may be detrimental to the performance.
- Data transformation: transform the data to make it easier for a learner to learn from. 
- Features augmentation: create new features from the old ones.
- ...

It will be important for you to cross-validate your results to find the optimal parameters and features. There are many functions in scikit-learn that can help you achieve that:
- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) are two examples of functions that can help you tune your models
- Scikit-learn has a variety of methods for [feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)
- There are many ways to [preprocess a data set](http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features). [Box-cox transformation](https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.boxcox.html) is also a classic method.
- It can be interesting to use unsupervised learning techniques to [create new features](http://scikit-learn.org/stable/unsupervised_learning.html)

Fundamental problems you will need to solve when it comes to preprocessing the data is to find a way to deal with missing values and categorical variables.

## The grading

The grading will be two parts:
- In the first part, you will explain in details what you have tried and what got you your best performance. You will describe the difficulties you encountered and how you solved them. The format of this small report does not need to be fancy. (10 points) 
- The second part, you will submit your predictions of a test set. You will be able to submit as many times as you want and only your best score will be retained. The performance will be assessed with AUC. (10 points)

To establish your grade of the second part, rules of the game are simple. This following code

```
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

train_df = pd.read_csv("./train_data.csv", index_col=["id"])
test_X = pd.read_csv("./test_data.csv", index_col=["id"])
test_Y = pd.read_csv("./test_target.csv", index_col=["id"])

col_cat = train_df.select_dtypes(include=[object]).columns
for col in col_cat:
    le = LabelEncoder()
    train_df[col] = le.fit_transform(train_df[col].fillna("-1"))
    test_X[col] = le.transform(test_X[col].fillna("-1"))
    
train_df.fillna(-1, inplace=True)
test_X.fillna(-1, inplace=True)

clf = LogisticRegression(n_jobs=-1, C=1e6)
y_train = train_df["target"]
X_train =  train_df.drop("target", 1)
clf.fit(X_train, y_train)

metrics.roc_auc_score(test_Y, clf.predict_proba(test_X)[:, 1])
```

yields a AUC = 0.619522. If you get this AUC or lower you get 0 points. After a quick but more in-depth train, I was able to obtain AUC = 0.64486707. If you get this AUC or higher, you get the full points. Every intermediary performance will give you an intermediary grade according to the following scheme:

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import rankdata
bins = np.linspace(0.619522, 0.64486707, 10)
mean_bins = (bins[1:] +  bins[:-1]) / 2
pd.DataFrame({"Intervals": pd.cut(mean_bins, bins, precision=6), "Grade": rankdata(mean_bins)})

Unnamed: 0,Grade,Intervals
0,1.0,"(0.619522, 0.622338]"
1,2.0,"(0.622338, 0.625154]"
2,3.0,"(0.625154, 0.62797]"
3,4.0,"(0.62797, 0.630786]"
4,5.0,"(0.630786, 0.633603]"
5,6.0,"(0.633603, 0.636419]"
6,7.0,"(0.636419, 0.639235]"
7,8.0,"(0.639235, 0.642051]"
8,9.0,"(0.642051, 0.644867]"


Everyday, I will publish on Piazza your best performance to see where you stand.

## Submission

You will need to submit a .csv file with exactly the following format (you can see the example in the current folder):

# Start/Load

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
import xgboost as xgb
import pandas as pd
import numpy as np

train_df = pd.read_csv("./train_data.csv", index_col=["id"])
test_X = pd.read_csv("./test_data.csv", index_col=["id"])

col_cat = train_df.select_dtypes(include=[object]).columns
for col in col_cat:
    le = LabelEncoder()
    train_df[col] = le.fit_transform(train_df[col].fillna("-1"))
    test_X[col] = le.transform(test_X[col].fillna("-1"))

train_df.fillna(-1, inplace=True)
test_X.fillna(-1, inplace=True)

clf = LogisticRegression(n_jobs=-1, C=1e6)
y_train = train_df["target"]
X_train =  train_df.drop("target", 1)
clf.fit(X_train, y_train)

submission_example = pd.DataFrame({"prediction": clf.predict_proba(test_X)[:, 1]}, index=test_X.index) 
submission_example.to_csv("./submission_example.csv")

  " = {}.".format(self.n_jobs))


# Create Y

In [3]:
y=train_df["target"]

In [4]:
train_df = pd.read_csv("./train_data.csv", index_col=["id"])
test_X = pd.read_csv("./test_data.csv", index_col=["id"])

col_cat = train_df.select_dtypes(include=[object]).columns
for col in col_cat:
    le = LabelEncoder()
    train_df[col] = le.fit_transform(train_df[col].fillna("-1"))
    test_X[col] = le.transform(test_X[col].fillna("-1"))

train_df.fillna(-1, inplace=True)
test_X.fillna(-1, inplace=True)

clf = LogisticRegression(n_jobs=-1, C=1e6)
y_train = train_df["target"]
X_train =  train_df.drop("target", 1)
clf.fit(X_train, y_train)

submission_example = pd.DataFrame({"prediction": clf.predict_proba(test_X)[:, 1]}, index=test_X.index) 
submission_example.to_csv("./submission_example.csv")


train_df
test_X

  " = {}.".format(self.n_jobs))


Unnamed: 0_level_0,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
40371,4.0,1,10.0,1,4,0.0,1.0,0.0,0.0,0.0,...,8.0,4.0,3.0,11.0,0.0,1.0,1.0,1.0,0.0,1.0
16433,5.0,1,8.0,2,4,0.0,0.0,0.0,1.0,0.0,...,7.0,2.0,2.0,4.0,0.0,0.0,1.0,0.0,1.0,0.0
57666,0.0,3,4.0,1,4,0.0,0.0,0.0,1.0,0.0,...,8.0,0.0,3.0,8.0,0.0,0.0,1.0,0.0,0.0,0.0
35804,1.0,1,5.0,1,4,0.0,1.0,0.0,0.0,0.0,...,7.0,2.0,4.0,10.0,0.0,1.0,1.0,1.0,0.0,0.0
13058,0.0,1,4.0,2,4,1.0,0.0,0.0,0.0,0.0,...,7.0,1.0,1.0,2.0,0.0,0.0,0.0,1.0,0.0,1.0
36236,2.0,1,5.0,1,4,1.0,0.0,0.0,0.0,0.0,...,6.0,2.0,2.0,7.0,1.0,0.0,1.0,0.0,0.0,0.0
19000,1.0,3,1.0,1,4,0.0,0.0,0.0,1.0,0.0,...,8.0,1.0,3.0,5.0,0.0,1.0,1.0,0.0,0.0,0.0
21203,0.0,1,7.0,2,4,1.0,0.0,0.0,0.0,0.0,...,10.0,0.0,1.0,6.0,0.0,1.0,1.0,0.0,1.0,0.0
22959,3.0,1,4.0,2,4,0.0,0.0,0.0,1.0,0.0,...,5.0,1.0,2.0,7.0,0.0,1.0,0.0,0.0,0.0,0.0
28411,5.0,3,5.0,1,4,0.0,1.0,0.0,0.0,0.0,...,8.0,2.0,2.0,5.0,0.0,1.0,1.0,1.0,0.0,0.0


## One Hot Encode

In [5]:
col_cat = train_df.select_dtypes(include=[object]).columns
train_df_enc = train_df.copy()
test_df_enc= test_X.copy
for col in col_cat:
    le = LabelEncoder()
    
    train_df_enc[col] = le.fit_transform(train_df_enc[col])
    test_df_enc[col] = le.fit_transform(test_df_enc[col])
        
train_df_enc
test_df_enc

<bound method NDFrame.copy of        ps_ind_01  ps_ind_02_cat  ps_ind_03  ps_ind_04_cat  ps_ind_05_cat  \
id                                                                         
40371        4.0              1       10.0              1              4   
16433        5.0              1        8.0              2              4   
57666        0.0              3        4.0              1              4   
35804        1.0              1        5.0              1              4   
13058        0.0              1        4.0              2              4   
36236        2.0              1        5.0              1              4   
19000        1.0              3        1.0              1              4   
21203        0.0              1        7.0              2              4   
22959        3.0              1        4.0              2              4   
28411        5.0              3        5.0              1              4   
58790        1.0              1        8.0              2 

In [6]:
train_df_oh = pd.get_dummies(train_df, drop_first=True)
train_df_oh

Unnamed: 0_level_0,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
63851,0.0,4.0,1,4.0,1,4,0.0,1.0,0.0,0.0,...,5.0,2.0,4.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0
31961,1.0,1.0,3,1.0,2,4,0.0,0.0,1.0,0.0,...,3.0,1.0,3.0,5.0,0.0,0.0,1.0,0.0,0.0,0.0
61724,0.0,0.0,1,2.0,2,4,1.0,0.0,0.0,0.0,...,6.0,3.0,3.0,7.0,0.0,1.0,1.0,0.0,1.0,0.0
48001,1.0,2.0,1,7.0,1,4,0.0,0.0,1.0,0.0,...,9.0,0.0,2.0,14.0,0.0,1.0,1.0,0.0,1.0,0.0
45624,1.0,0.0,1,2.0,2,4,0.0,1.0,0.0,0.0,...,4.0,2.0,2.0,3.0,0.0,1.0,0.0,0.0,1.0,0.0
45111,0.0,1.0,3,1.0,2,4,1.0,0.0,0.0,0.0,...,3.0,0.0,3.0,6.0,0.0,1.0,0.0,0.0,1.0,0.0
34413,0.0,2.0,1,2.0,2,4,1.0,0.0,0.0,0.0,...,7.0,1.0,1.0,11.0,0.0,1.0,0.0,1.0,1.0,0.0
29451,0.0,3.0,1,5.0,2,4,0.0,1.0,0.0,0.0,...,9.0,0.0,3.0,11.0,0.0,1.0,1.0,1.0,0.0,1.0
52857,1.0,1.0,1,5.0,2,4,1.0,0.0,0.0,0.0,...,5.0,1.0,3.0,7.0,0.0,0.0,1.0,0.0,0.0,0.0
38219,0.0,1.0,3,1.0,2,4,1.0,0.0,0.0,0.0,...,7.0,1.0,2.0,7.0,0.0,0.0,1.0,0.0,1.0,0.0


In [7]:
train_df_oh.fillna(train_df_oh.median(), inplace=True)
train_df_enc.fillna(train_df_enc.median(), inplace=True)
train_df_enc.isnull().any()
train_df_enc.drop("target",axis=1, inplace=True)

# Processing

In [8]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.grid_search import GridSearchCV
from xgboost import XGBClassifier



In [9]:
# param_test1 = {'n_estimators':np.arange(20,81,10)}
# gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1,min_samples_split=500,
#                                                                min_samples_leaf=50,max_depth=8,
#                                                                max_features='sqrt',subsample=0.8,
#                                                                random_state=10), 
#                         param_grid = param_test1, scoring='roc_auc',
#                         n_jobs=4,iid=False, cv=5)
# gsearch1.fit(train_df_enc,y)
# gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [10]:
# param_test1 = {'max_depth':np.arange(1,10,1)}
# gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1,
#                                                                n_estimators=50,
#                                                                min_samples_split=500,
#                                                                min_samples_leaf=50,
#                                                                max_features='sqrt',subsample=0.8,
#                                                                random_state=10), 
#                         param_grid = param_test1, scoring='roc_auc',
#                         n_jobs=4,iid=False, cv=10)
# gsearch1.fit(train_df_enc,y)
# gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [11]:
#for testing
xgb_clf = XGBClassifier(learning_rate=0.1,
                        nthread=-1,
                        subsample=1,
                        max_depth=6,
                        gamma=0.1,
                        min_child_weight=1,
                        seed=1)

param_grid = {"n_estimators": np.arange(1,100,10)}

gridsearch = GridSearchCV(estimator=xgb_clf, 
                          param_grid=param_grid, 
                          scoring="roc_auc",
                          n_jobs=-1,
                          iid=False, 
                          cv=3)
gridsearch.fit(train_df_enc,y_train)
gridsearch.grid_scores_, gridsearch.best_params_, gridsearch.best_score_

([mean: 0.60094, std: 0.00919, params: {'n_estimators': 1},
  mean: 0.62033, std: 0.01039, params: {'n_estimators': 11},
  mean: 0.62432, std: 0.01052, params: {'n_estimators': 21},
  mean: 0.62622, std: 0.01076, params: {'n_estimators': 31},
  mean: 0.62779, std: 0.01000, params: {'n_estimators': 41},
  mean: 0.62790, std: 0.00938, params: {'n_estimators': 51},
  mean: 0.62750, std: 0.00874, params: {'n_estimators': 61},
  mean: 0.62714, std: 0.00884, params: {'n_estimators': 71},
  mean: 0.62649, std: 0.00853, params: {'n_estimators': 81},
  mean: 0.62546, std: 0.00842, params: {'n_estimators': 91}],
 {'n_estimators': 51},
 0.627902235479878)

# Fit/Write

In [12]:
#Inputing optimized values (GBM) from tuning above
# clf= GradientBoostingClassifier(learning_rate=0.1,
#                                 n_estimators=50, 
#                                 min_samples_split=500,
#                                 min_samples_leaf=50,
#                                 max_depth=5,
#                                 max_features='sqrt',
#                                 subsample=0.8,
#                                 random_state=10)
# clf.fit(train_df_enc,y)
# submission_example = pd.DataFrame({"prediction": clf.predict_proba(test_X)[:, 1]}, index=test_X.index) 
# submission_example.to_csv("./submission_example.csv")

In [45]:
#Inputing optimized values (XGB) from tuning above
xgb_clf = XGBClassifier(max_depth=3,
                        learning_rate=0.222222222222,
                        n_estimators=51,
                        nthread=-1,
                        gamma=0.16,
                        min_child_weight=4,
                        subsample=.6,
                        seed=1)
xgb_clf.fit(train_df_enc,y_train)
submission_example = pd.DataFrame({"prediction": xgb_clf.predict_proba(test_X)[:,1]}, index=test_X.index) 
submission_example.to_csv("./xgb_submission.csv")

In [52]:
YOUR_FILE = "xgb_submission.csv"
TEST_FILE = "./test_data.csv"

submission_df = pd.read_csv(YOUR_FILE, index_col=["id"])
test_df = pd.read_csv(TEST_FILE, index_col=["id"])

fake_test_target = pd.DataFrame({"target": np.random.choice([0,1], size=test_df.shape[0])}, index=test_df.index)
submission_df["Truth"] = fake_test_target["target"]

metrics.roc_auc_score(submission_df["Truth"], submission_df["prediction"])

0.50414390086602223