## Manufacturing Performance Improvement

### Objective: Prediction of manufacturing Failure to improve production performance

### Dataset Introduction

The data for this competition represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).

The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, we have separated the files by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken.

In [1]:
import pandas as pd
import numpy as np 
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import *
from sklearn.ensemble import *
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

In [2]:
train_numeric = pd.read_csv('train_numeric.csv',nrows=10000)

#train_cate = pd.read_csv(DataPath+'train_categorical.csv',nrows=10000)

In [10]:
train_numeric.head()

Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,,,,,,,,,,0
1,6,,,,,,,,,,...,,,,,,,,,,0
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,,,,,,,,,,0
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,,,,,,,,,,0
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,,,,,,,,,,0


In [11]:
train_numeric.describe()

Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
count,10000.0,5733.0,5733.0,5733.0,5733.0,5733.0,5733.0,5733.0,5733.0,5733.0,...,249.0,249.0,249.0,249.0,249.0,501.0,501.0,501.0,501.0,10000.0
mean,9959.5985,-0.001898,-0.002599,0.001011,0.000744,-0.001164,0.004127,0.000347,0.002321,-0.000786,...,-8e-06,8e-06,4.8e-05,0.000189,0.001004,-8e-06,6e-06,6.2e-05,9.2e-05,0.0053
std,5722.930873,0.079458,0.091942,0.21364,0.213748,0.094814,0.164772,0.019482,0.104789,0.115022,...,8.9e-05,8.9e-05,0.000566,0.001168,0.250502,8.9e-05,0.0001,0.000967,0.00106,0.072612
min,4.0,-0.31,-0.399,-0.397,-0.397,-0.404,-0.566,-0.044,-0.232,-0.393,...,-0.001,0.0,0.0,0.0,-0.25,-0.001,0.0,0.0,0.0,0.0
25%,5035.5,-0.055,-0.064,-0.179,-0.179,-0.056,-0.066,-0.015,-0.072,-0.082,...,0.0,0.0,0.0,0.0,-0.25,0.0,0.0,0.0,0.0,0.0
50%,9974.5,0.003,0.004,-0.033,-0.034,0.031,0.07,0.0,-0.032,0.0,...,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0
75%,14896.25,0.056,0.063,0.294,0.294,0.074,0.116,0.015,0.088,0.076,...,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0
max,19923.0,0.278,0.28,0.567,0.566,0.292,0.206,0.089,0.488,0.393,...,0.0,0.001,0.008,0.014,0.25,0.0,0.002,0.018,0.017,1.0


In [12]:
train_date = pd.read_csv('train_date.csv', nrows=10000)

In [13]:
train_date.head()

Unnamed: 0,Id,L0_S0_D1,L0_S0_D3,L0_S0_D5,L0_S0_D7,L0_S0_D9,L0_S0_D11,L0_S0_D13,L0_S0_D15,L0_S0_D17,...,L3_S50_D4246,L3_S50_D4248,L3_S50_D4250,L3_S50_D4252,L3_S50_D4254,L3_S51_D4255,L3_S51_D4257,L3_S51_D4259,L3_S51_D4261,L3_S51_D4263
0,4,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,...,,,,,,,,,,
3,9,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,...,,,,,,,,,,
4,11,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,...,,,,,,,,,,


In [14]:
data_merge = pd.merge(train_numeric, train_date,on = 'Id')
data_merge.head()

Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_D4246,L3_S50_D4248,L3_S50_D4250,L3_S50_D4252,L3_S50_D4254,L3_S51_D4255,L3_S51_D4257,L3_S51_D4259,L3_S51_D4261,L3_S51_D4263
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,,,,,,,,,,
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,,,,,,,,,,
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,,,,,,,,,,


In [15]:
dataclean = data_merge.dropna(axis=1, thresh = int(len(data_merge)*0.5)) # drop columns (axis = 1) with NA
dataclean = dataclean.fillna(0)                                         # replace NA (in each row) with 0

In [16]:
dataclean.head()

Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S34_D3877,L3_S34_D3879,L3_S34_D3881,L3_S34_D3883,L3_S37_D3942,L3_S37_D3943,L3_S37_D3945,L3_S37_D3947,L3_S37_D3949,L3_S37_D3951
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,87.28,87.28,87.28,87.28,87.29,87.29,87.29,87.29,87.29,87.29
1,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1315.75,1315.75,1315.75,1315.75,1315.75,1315.75,1315.75,1315.75,1315.75,1315.75
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,1624.42,1624.42,1624.42,1624.42,1624.42,1624.42,1624.42,1624.42,1624.42,1624.42
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,1154.15,1154.15,1154.15,1154.15,1154.16,1154.16,1154.16,1154.16,1154.16,1154.16
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,606.01,606.01,606.01,606.01,606.02,606.02,606.02,606.02,606.02,606.02


In [17]:
# encoding the labels (aligning the labels in order)
 
le = preprocessing.LabelEncoder()
dataclean['Id'] = le.fit_transform(dataclean.Id)

In [21]:
# Splitting data into Training and testing by ignoring ID column as its Identical column
from sklearn.model_selection import train_test_split

featurelist =  list(dataclean.columns.values)
featurelist.remove('Id')        
featurelist.remove('Response')   
features_train,features_test,labels_train,labels_test = train_test_split(dataclean[featurelist],
                                                              dataclean['Response'], test_size=0.1, random_state=42)


Training data
features_train  # ind columns
labels_train  # dependent columns
Testing Data
features_test # ind columns
labels_test# dependent columns

### Naive Bayes classifier #########################

In [22]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.naive_bayes import BernoulliNB

In [23]:
naive_bayes = BernoulliNB()
naive_bayes.fit(features_test,labels_test)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [24]:
p_station = naive_bayes.predict_proba(features_test)
p_station

array([[2.17550337e-61, 1.00000000e+00],
       [1.00000000e+00, 8.75323780e-16],
       [9.95446638e-01, 4.55336196e-03],
       ...,
       [1.00000000e+00, 1.09617096e-13],
       [1.00000000e+00, 6.21435377e-25],
       [1.00000000e+00, 3.36844174e-11]])

In [25]:
# 0 = Not failure, 1  = Failure
pred = naive_bayes.predict(features_test)


array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,

In [26]:
labels_test.shape

(1000,)

In [27]:
pred.shape

(1000,)

In [28]:
accuracy = accuracy_score(labels_test ,pred)
accuracy

0.943

## Random Forest Classifier################

In [29]:
from sklearn.ensemble import RandomForestClassifier

In [30]:
clf = RandomForestClassifier(100, max_depth = 20, n_jobs = -1)

In [31]:
clf

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=20, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [32]:
clf.fit(features_train,labels_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=20, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [33]:
accuracy = accuracy_score(labels_test, pred)
accuracy

0.943

In [34]:
pred = clf.predict(features_test)
pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

### Optimizing classifier by Randomised Grid Search #######

In [35]:
# parameters dict for randomised grid search cv

from scipy.stats import randint

param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 11),
              "min_samples_split": randint(2, 11),
              "min_samples_leaf": randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

In [36]:
from sklearn.model_selection import RandomizedSearchCV

optimized_forest = RandomizedSearchCV(RandomForestClassifier(), param_distributions = param_dist, n_iter = 10)

optimized_forest

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'max_depth': [3, None], 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a0ccb0f28>, 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a0ccb0e48>, 'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a0ccb0828>, 'bootstrap': [True, False], 'criterion': ['gini', 'entropy']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,


In [37]:
import warnings
warnings.filterwarnings('once')

optimized_forest.fit(features_train, labels_train)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'max_depth': [3, None], 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a0ccb0f28>, 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a0ccb0e48>, 'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a0ccb0828>, 'bootstrap': [True, False], 'criterion': ['gini', 'entropy']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,


In [38]:
# describe the model with best accuracy
clf = optimized_forest.best_estimator_
clf

RandomForestClassifier(bootstrap=False, class_weight=None,
            criterion='entropy', max_depth=3, max_features=7,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=3,
            min_samples_split=6, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [39]:
pred = clf.predict(features_test)
pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [40]:
accuracy = accuracy_score(labels_test, pred)
accuracy

0.996

### Extra Tree Classifier################

In [41]:
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier

In [42]:
clf = ExtraTreesClassifier(n_estimators = 50, n_jobs = -1, min_samples_leaf= 10, verbose = 1)

In [43]:
clf.fit(features_train,labels_train)

[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    0.3s finished


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=10, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
           oob_score=False, random_state=None, verbose=1, warm_start=False)

In [44]:
pred = clf.predict(features_test)
pred

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done  50 out of  50 | elapsed:    0.0s finished


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [45]:
accuracy = accuracy_score(labels_test, pred)
accuracy

0.996

## xgboost with grid search ###########

In [46]:
import xgboost as xgb
from sklearn.grid_search import GridSearchCV

In [47]:
# Find the best values of 'max_depth' and 'min_child_weight'

cv_params = {'max_depth': [3,5,7], 'min_child_weight': [1,3,5]}

parameters = {'nthread':[4],                    
              'objective':['binary:logistic'],
              'learning_rate': [0.05], 
              'max_depth': [7],
              'min_child_weight': [11],
              'silent': [1],
              'subsample': [0.8],
              'colsample_bytree': [0.7],
              'n_estimators': [1000], 
              'missing':[-999],
              'seed': [1337]}

xgb_model = xgb.XGBClassifier()

optimized_GBM = GridSearchCV(xgb.XGBClassifier(parameters), 
                             cv_params, 
                             scoring = 'accuracy', 
                             cv = 5, 
                             n_jobs = 4,
                             verbose = 5
                            ) 

In [48]:
import warnings

warnings.filterwarnings(action='ignore', category=DeprecationWarning)

optimized_GBM.fit(features_train, labels_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=1 .................................
[CV] ........ max_depth=3, min_child_weight=1, score=0.994444 -  22.4s
[CV] max_depth=3, min_child_weight=1 .................................
[CV] ........ max_depth=3, min_child_weight=1, score=0.995003 -  22.7s
[CV] max_depth=3, min_child_weight=3 .................................
[CV] ........ max_depth=3, min_child_weight=1, score=0.993889 -  22.6s
[CV] max_depth=3, min_child_weight=3 .................................
[CV] ........ max_depth=3, min_child_weight=1, score=0.994444 -  22.9s
[CV] max_depth=3, min_child_weight=3 .................................
[CV] ........ max_depth=3, min_child_weight=3, score=0.994444 -  23.6s
[CV] max_depth=3,

[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  1.2min


[CV] ........ max_depth=3, min_child_weight=3, score=0.994441 -  26.1s
[CV] max_depth=3, min_child_weight=5 .................................
[CV] ........ max_depth=3, min_child_weight=5, score=0.994444 -  24.1s
[CV] max_depth=5, min_child_weight=1 .................................
[CV] ........ max_depth=3, min_child_weight=5, score=0.994444 -  23.9s
[CV] max_depth=5, min_child_weight=1 .................................
[CV] ........ max_depth=3, min_child_weight=5, score=0.994444 -  23.8s
[CV] max_depth=5, min_child_weight=1 .................................
[CV] ........ max_depth=3, min_child_weight=5, score=0.994441 -  23.9s
[CV] max_depth=5, min_child_weight=1 .................................
[CV] ........ max_depth=5, min_child_weight=1, score=0.993337 -  39.1s
[CV] max_depth=5, min_child_weight=1 .................................
[CV] ........ max_depth=5, min_child_weight=1, score=0.993889 -  35.4s
[CV] max_depth=5, min_child_weight=3 .................................
[CV] .

[Parallel(n_jobs=4)]: Done  45 out of  45 | elapsed:  6.4min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth={'nthread': [4], 'objective': ['binary:logistic'], 'learning_rate': [0.05], 'max_depth': [7], 'min_child_weight': [11], 'silent': [...0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params={}, iid=True, n_jobs=4,
       param_grid={'max_depth': [3, 5, 7], 'min_child_weight': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=5)

In [49]:
optimized_GBM.grid_scores_

[mean: 0.99433, std: 0.00042, params: {'max_depth': 3, 'min_child_weight': 1},
 mean: 0.99456, std: 0.00022, params: {'max_depth': 3, 'min_child_weight': 3},
 mean: 0.99444, std: 0.00000, params: {'max_depth': 3, 'min_child_weight': 5},
 mean: 0.99333, std: 0.00093, params: {'max_depth': 5, 'min_child_weight': 1},
 mean: 0.99422, std: 0.00044, params: {'max_depth': 5, 'min_child_weight': 3},
 mean: 0.99456, std: 0.00022, params: {'max_depth': 5, 'min_child_weight': 5},
 mean: 0.99322, std: 0.00113, params: {'max_depth': 7, 'min_child_weight': 1},
 mean: 0.99400, std: 0.00042, params: {'max_depth': 7, 'min_child_weight': 3},
 mean: 0.99456, std: 0.00022, params: {'max_depth': 7, 'min_child_weight': 5}]

##### Exploring the best values of parameters 

In [57]:
# Find the best values of 'learning_rate' and 'subsample'

cv_params = {'max_depth': [7], 
             'min_child_weight': [5], 
             'learning_rate': [0.1, 0.01], 
             'subsample': [0.7, 0.8, 0.9]
            }

parameters = {'nthread': [4],                    
              'objective':['binary:logistic'],
              'learning_rate': [0.05], 
              'max_depth': [6],
              'min_child_weight': [11],
              'silent': [1],
              'subsample': [0.8],
              'colsample_bytree': [0.7],
              'n_estimators': [1000], 
              'missing':[-999],
              'random_state': [1337]
             }

xgb_model = xgb.XGBClassifier()

optimized_GBM = GridSearchCV(xgb.XGBClassifier(parameters), 
                             cv_params, 
                             scoring = 'accuracy', 
                             cv = 5, 
                             n_jobs = 4,
                             verbose = 5
                            ) 

In [51]:
optimized_GBM.fit(features_train, labels_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.7 
[CV] learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.7 
[CV] learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.7 
[CV] learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.7 
[CV]  learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.7, score=0.994448 -  22.9s
[CV] learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.7 
[CV]  learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.7, score=0.994444 -  23.5s
[CV] learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.8 
[CV]  learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.7, score=0.994444 -  24.0s
[CV] learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.8 
[CV]  learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.7, score=0.994444 -  24.3s
[CV] learning_rate=0.1, max_depth=7, min_child_w

[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  1.3min


[CV]  learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.9, score=0.995003 -  29.3s
[CV] learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.9 
[CV]  learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.9, score=0.994444 -  30.2s
[CV] learning_rate=0.01, max_depth=7, min_child_weight=5, subsample=0.7 
[CV]  learning_rate=0.01, max_depth=7, min_child_weight=5, subsample=0.7, score=0.994448 -   7.9s
[CV] learning_rate=0.01, max_depth=7, min_child_weight=5, subsample=0.7 
[CV]  learning_rate=0.01, max_depth=7, min_child_weight=5, subsample=0.7, score=0.994444 -   7.7s
[CV] learning_rate=0.01, max_depth=7, min_child_weight=5, subsample=0.7 
[CV]  learning_rate=0.1, max_depth=7, min_child_weight=5, subsample=0.9, score=0.994444 -  29.1s
[CV] learning_rate=0.01, max_depth=7, min_child_weight=5, subsample=0.7 
[CV]  learning_rate=0.01, max_depth=7, min_child_weight=5, subsample=0.7, score=0.994444 -   8.3s
[CV] learning_rate=0.01, max_depth=7, min_child_wei

[Parallel(n_jobs=4)]: Done  30 out of  30 | elapsed:  2.3min remaining:    0.0s
[Parallel(n_jobs=4)]: Done  30 out of  30 | elapsed:  2.3min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth={'nthread': [4], 'objective': ['binary:logistic'], 'learning_rate': [0.05], 'max_depth': [6], 'min_child_weight': [11], 'silent': [...0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params={}, iid=True, n_jobs=4,
       param_grid={'max_depth': [7], 'min_child_weight': [5], 'learning_rate': [0.1, 0.01], 'subsample': [0.7, 0.8, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=5)

In [52]:
optimized_GBM.grid_scores_

[mean: 0.99444, std: 0.00000, params: {'learning_rate': 0.1, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.7},
 mean: 0.99456, std: 0.00022, params: {'learning_rate': 0.1, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.8},
 mean: 0.99456, std: 0.00022, params: {'learning_rate': 0.1, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.9},
 mean: 0.99456, std: 0.00022, params: {'learning_rate': 0.01, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.7},
 mean: 0.99456, std: 0.00022, params: {'learning_rate': 0.01, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.8},
 mean: 0.99456, std: 0.00022, params: {'learning_rate': 0.01, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.9}]

In [53]:
xgdmat = xgb.DMatrix(features_train, labels_train) # Create our DMatrix to make XGBoost more efficient

In [54]:
params = {'eta': 0.1, 'random_state': 0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic', 'max_depth':3, 'min_child_weight':1} 
# Grid Search CV optimized settings

cv_xgb = xgb.cv(params = params, dtrain = xgdmat, num_boost_round = 3000, nfold = 5,
                metrics = ['error'],           # Make sure you enter metrics inside a list or you may encounter issues!
                early_stopping_rounds = 100)   # Look for early stopping that minimizes error

In [55]:
cv_xgb.tail(5)

Unnamed: 0,test-error-mean,test-error-std,train-error-mean,train-error-std
0,0.005445,0.001238,0.005445,0.000309


Now that we have our best settings, let’s create this as an XGBoost object model that we can reference later.

In [58]:
best_params = {'eta': 0.1, 
               'random_state': 0, 
               'subsample': 0.9, 
               'colsample_bytree': 0.8, 
               'objective': 'binary:logistic', 
               'max_depth': 7, 
               'min_child_weight':5
              } 

final_gb = xgb.train(best_params, xgdmat, num_boost_round = 432)

In [61]:
final_gb.predict(xgdmat)

array([2.5035692e-03, 8.5630280e-05, 9.2141535e-03, ..., 1.3326209e-04,
       2.9597999e-04, 1.8479968e-03], dtype=float32)

In [62]:
%matplotlib inline
import seaborn as sns
sns.set(font_scale = 1.5)

Analyzing Performance on Test Data

The model has now been tuned using cross-validation grid search through the sklearn API and early stopping through the built-in XGBoost API. Now, we can see how it finally performs on the test set. Does it match our CV performance? First, create another DMatrix (this time for the test data).

In [66]:
testdmat = xgb.DMatrix(features_test, labels_test)

In [67]:
from sklearn.metrics import accuracy_score
y_pred = final_gb.predict(testdmat) # Predict using testdmat
y_pred

array([5.32376766e-03, 2.59926601e-04, 2.15576423e-04, 2.27064538e-05,
       5.31940022e-05, 8.26115429e-05, 1.32427696e-04, 2.16945584e-04,
       2.38324152e-04, 3.42542116e-05, 1.22178899e-04, 1.38507559e-04,
       3.49947310e-04, 1.15617993e-03, 8.94222292e-04, 1.30898416e-05,
       1.61435557e-04, 5.32376766e-03, 2.10244780e-05, 3.21217522e-04,
       1.57114072e-03, 2.59167919e-05, 2.96985963e-04, 1.43660815e-04,
       1.09937240e-03, 1.09267006e-04, 6.52321323e-05, 6.53717507e-05,
       4.59718940e-05, 3.31310206e-04, 2.45478121e-03, 3.93428170e-04,
       1.02856115e-03, 4.68228245e-03, 1.06104097e-04, 7.66891462e-05,
       9.94330840e-05, 6.31879084e-04, 2.11622086e-04, 1.28609172e-04,
       1.10779001e-04, 1.80154678e-03, 2.00286228e-03, 5.32376766e-03,
       1.32281290e-04, 2.86765484e-04, 2.09269393e-03, 1.81548923e-04,
       6.92603767e-01, 5.72598306e-03, 1.29658278e-04, 2.54044950e-04,
       4.16901102e-03, 5.59615437e-04, 2.14734755e-05, 8.04985233e-04,
      

In [77]:
final_gb.eval(testdmat, name='eval', iteration=0)

'[0]\teval-error:0.006000'

In [None]:
def makesubmit(clf,testdf,featurelist,output="submit.csv"):
    testdf = testdf.fillna(0)
    feature_test = testdf[featurelist]
    
    pred = clf.predict(feature_test)
    
    ids = list(testdf['Id'])
    
    fout = open(output,'w')
    fout.write("Id,Response\n")
    for i,id in enumerate(ids):
        fout.write('%s,%s\n' % (str(id),str(pred[i])))
    fout.close()
