<H2>Code for the 11th placed solution in the Mckinsey hiring hack by analyticsvidhya.</H2><p>(http://datahack.analyticsvidhya.com/contest/mckinsey-hiring-hack)</p>

<H4>Set working directory</H4>

In [1]:
import os
os.chdir('...\Mckinsey')

In [2]:
%matplotlib inline

import numpy as np 
import scipy as sp 
import matplotlib as mpl 
import matplotlib.cm as cm 
import matplotlib.pyplot as plt 
import pandas as pd 

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns 

<H4>Read training and test data</H4>

In [3]:
train = pd.read_csv('Train_psolI3n.csv')
test = pd.read_csv('Test_09JmpYa.csv')

<H3>Basic Data Cleansing</H3><H4>Replacing missing values in the column 'Total_Past_Communications' with the mode of that column</H4>

In [4]:
train['Total_Past_Communications'].mode()

0    30
dtype: float64

In [5]:
train['Total_Past_Communications'].fillna(30, inplace = True)
test['Total_Past_Communications'].fillna(30, inplace = True)

<H4>Replacing missing values in the column 'Total_Links' with the mode of that column</H4>

In [6]:
train['Total_Links'].mode()

0    11
dtype: float64

In [7]:
train['Total_Links'].fillna(11, inplace = True)
test['Total_Links'].fillna(11, inplace = True)

<H4>Replacing missing values in the column 'Total_Images' with the mode of that column</H4>

In [8]:
train['Total_Images'].mode()

0    0
dtype: float64

In [9]:
train['Total_Images'].fillna(0, inplace = True)
test['Total_Images'].fillna(0, inplace = True)

In [10]:
train['Cust_Loc'] = 0
test['Cust_Loc'] = 0

<H4>Converting column 'Customer_Location' to numeric</H4>

In [11]:
train.ix[train.Customer_Location=='A', 'Cust_Loc'] = 1
train.ix[train.Customer_Location=='B', 'Cust_Loc'] = 2
train.ix[train.Customer_Location=='C', 'Cust_Loc'] = 3
train.ix[train.Customer_Location=='D', 'Cust_Loc'] = 4
train.ix[train.Customer_Location=='E', 'Cust_Loc'] = 5
train.ix[train.Customer_Location=='F', 'Cust_Loc'] = 6
train.ix[train.Customer_Location=='G', 'Cust_Loc'] = 7

test.ix[test.Customer_Location=='A', 'Cust_Loc'] = 1
test.ix[test.Customer_Location=='B', 'Cust_Loc'] = 2
test.ix[test.Customer_Location=='C', 'Cust_Loc'] = 3
test.ix[test.Customer_Location=='D', 'Cust_Loc'] = 4
test.ix[test.Customer_Location=='E', 'Cust_Loc'] = 5
test.ix[test.Customer_Location=='F', 'Cust_Loc'] = 6
test.ix[test.Customer_Location=='G', 'Cust_Loc'] = 7

<H4>Separating x and y variables</H4>

In [12]:
y = train['Email_Status'].values
X = train.drop(['Email_ID','Email_Status','Customer_Location'], axis = 1).values

<H4>Separating data for cross-validation</H4>

In [13]:
from sklearn.cross_validation import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

In [15]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.grid_search import GridSearchCV

<H4>Using Grid Search to find the best parameters for Gradient Boosting Classifier</H4>

In [16]:
est = GradientBoostingClassifier(n_estimators = 2000, learning_rate=0.01, max_depth = 3, min_samples_leaf = 10, max_features = 0.25)

In [17]:
est.fit(X_train, y_train)

GradientBoostingClassifier(init=None, learning_rate=0.01, loss='deviance',
              max_depth=3, max_features=0.25, max_leaf_nodes=None,
              min_samples_leaf=10, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=2000,
              random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [18]:
param_grid = {'learning_rate': [0.1,0.033,0.01], 'max_depth': [1,2,3], 'min_samples_leaf': [5,10,15], 'max_features': [1.0,0.5]}

In [19]:
est = GradientBoostingClassifier(n_estimators = 2000)

In [20]:
gs_cv = GridSearchCV(est, param_grid).fit(X_train,y_train)

<H4>Best parameters</H4>

In [21]:
gs_cv.best_params_

{'learning_rate': 0.033,
 'max_depth': 2,
 'max_features': 1.0,
 'min_samples_leaf': 5}

<H4>Estimating optimal number of trees using grid search. For the rest of the parameters, using the already tuned values</H4>

In [22]:
param_grid = {'n_estimators': [1000, 2000, 3000, 4000, 5000, 6000]}

In [23]:
est = GradientBoostingClassifier(learning_rate = 0.033, max_depth = 2, min_samples_leaf = 5, max_features = 1.0)

In [24]:
gs_cv = GridSearchCV(est, param_grid).fit(X_train,y_train)

In [25]:
gs_cv.best_params_

{'n_estimators': 2000}

<H4>Training using the tuned hyper-parameters</H4>

In [26]:
est = GradientBoostingClassifier(n_estimators = 2000, learning_rate = 0.033, min_samples_leaf = 5, max_features = 1.0, max_depth = 2)

In [27]:
est.fit(X_train, y_train)

GradientBoostingClassifier(init=None, learning_rate=0.033, loss='deviance',
              max_depth=2, max_features=1.0, max_leaf_nodes=None,
              min_samples_leaf=5, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=2000,
              random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

<H4>Predicting for cv set</H4>

In [28]:
val_pred = est.predict(X_test)

In [29]:
validation = pd.DataFrame({"y":y_test, "pred":val_pred})

In [30]:
validation.diff = validation.y - validation.pred

In [31]:
accu_validation = validation[validation.diff == 0]

In [32]:
accu_validation.shape[0]

11188

In [33]:
validation.shape[0]

13671

<H4>Accuracy in cv set</H4>

In [34]:
11188.0/13671

0.8183746616926341

<H4>Predicting for test data</H4>

In [35]:
test1 = test.drop(['Email_ID','Customer_Location'], axis = 1).values

In [36]:
pred1 = est.predict(test1)

<H4>Creating submission file</H4>

In [37]:
submission = pd.DataFrame({"Email_ID":test.Email_ID, "Email_Status":pred1})

In [38]:
submission.to_csv("submission_0424_1.csv", index=False)