This is my submission for the McKinsey Analytics Online Hackathon during 20-22 July 2018. 

More details can be found at:
https://datahack.analyticsvidhya.com/contest/mckinsey-analytics-online-hackathon-4/

Public Leaderboard highest score: 0.7296413363 , My score: 0.7020503051

Private Leaderboard highest score: 0.7941335047, My score: 0.7548701932

#### Problem Statement

Your client is an Insurance company and they need your help in building a model to predict the propensity to pay renewal premium and build an incentive plan for its agents to maximise the net revenue (i.e. renewals minus incentives given to collect the renewals) collected from the policies post their issuance.

#### Evaluation Criteria

A.	The base probability of receiving a premium on a policy without considering any incentive <br>
B.	The monthly incentives you will provide on each policy to maximize the net revenue 

#### The Data

Variable : Definition
<br>id	: Unique ID of the policy
<br>perc_premium_paid_by_cash_credit	: Percentage of premium amount paid by cash or credit card
<br>age_in_days	: Age in days of policy holder
<br>Income	: Monthly Income of policy holder
<br>Count_3-6_months_late	: No of premiums late by 3 to 6 months
<br>Count_6-12_months_late	: No  of premiums late by 6 to 12 months
<br>Count_more_than_12_months_late	: No of premiums late by more than 12 months
<br>application_underwriting_score	: Underwriting Score of the applicant at the time of application (No applications <br>under the score of 90 are insured)
<br>no_of_premiums_paid	: Total premiums paid on time till now
<br>sourcing_channel	: Sourcing channel for application
<br>residence_area_type	: Area type of Residence (Urban/Rural)
<br>premium	: Monthly premium amount
<br>renewal	: Policy Renewed? (0 - not renewed, 1 - renewed)



### Contents
[1. EDA](#eda)

[2. Data preprocessing](#preprocessing)

[3. Random Forest Classifier](#rf)

[4. Random Forest with Randomised Search CV](#rf_cv)

[5. Gradient Boosting Classifier](#gbc)

[6. Gradient Boosting with Randomised Search CV](#gbc_cv)

[7. Optimisation problem](#optimisation)

[8. Submission for test set given](#given_test)

### Import packages

In [4]:
import pandas as pd
import numpy as np
pd.options.display.max_rows=999

### EDA <a id=eda></a>

In [8]:
#train dataset removing id
train = pd.read_csv('train.csv')
# train.drop('id', inplace=True, axis=1)
#test dataset removing id
test = pd.read_csv('test.csv')
# test.drop('id', inplace=True, axis=1)
train.head()

Unnamed: 0,id,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,sourcing_channel,residence_area_type,premium,renewal
0,110936,0.429,12058,355060,0.0,0.0,0.0,99.02,13,C,Urban,3300,1
1,41492,0.01,21546,315150,0.0,0.0,0.0,99.89,21,A,Urban,18000,1
2,31300,0.917,17531,84140,2.0,3.0,1.0,98.69,7,C,Rural,3300,0
3,19415,0.049,15341,250510,0.0,0.0,0.0,99.57,9,A,Urban,9600,1
4,99379,0.052,31400,198680,0.0,0.0,0.0,99.87,12,B,Urban,9600,1


In [6]:
#training set with 79,853 policies and 13 features
train.shape

(79853, 13)

In [7]:
train.describe()

Unnamed: 0,id,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,premium,renewal
count,79853.0,79853.0,79853.0,79853.0,79756.0,79756.0,79756.0,76879.0,79853.0,79853.0,79853.0
mean,57167.166368,0.314288,18846.696906,208847.2,0.248671,0.078188,0.060008,99.067291,10.863887,10924.507533,0.93741
std,32928.97016,0.334915,5208.719136,496582.6,0.691468,0.436507,0.312023,0.739799,5.170687,9401.676542,0.242226
min,2.0,0.0,7670.0,24030.0,0.0,0.0,0.0,91.9,2.0,1200.0,0.0
25%,28640.0,0.034,14974.0,108010.0,0.0,0.0,0.0,98.81,7.0,5400.0,1.0
50%,57262.0,0.167,18625.0,166560.0,0.0,0.0,0.0,99.21,10.0,7500.0,1.0
75%,85632.0,0.538,22636.0,252090.0,0.0,0.0,0.0,99.54,14.0,13800.0,1.0
max,114076.0,1.0,37602.0,90262600.0,13.0,17.0,11.0,99.89,60.0,60000.0,1.0


### Predictor variable (Renewal)
- highly imbalanced class with only about 5% who did not renew their policies

In [7]:
train['renewal'].value_counts()

1    74855
0     4998
Name: renewal, dtype: int64

### Data Preprocessing <a id=preprocessing></a>
Rationale: Gradient Boosting Classifier used later only accepts numeric values 

In [15]:
from sklearn.preprocessing import Imputer

In [16]:
train.isnull().sum()

id                                     0
perc_premium_paid_by_cash_credit       0
age_in_days                            0
Income                                 0
Count_3-6_months_late                 97
Count_6-12_months_late                97
Count_more_than_12_months_late        97
application_underwriting_score      2974
no_of_premiums_paid                    0
sourcing_channel                       0
residence_area_type                    0
premium                                0
renewal                                0
dtype: int64

In [17]:
#Numerical Encoding for Categorical Variables
#Replace Urban with 1, Rural with 0
train.replace('Urban',1, inplace=True)
train.replace('Rural',0 ,inplace=True)

#replace for the 5 sourcing channels
train.replace('A',1,inplace=True)
train.replace('B',2,inplace=True)
train.replace('C',3,inplace=True)
train.replace('D',4,inplace=True)
train.replace('E',5,inplace=True)

In [18]:
#checking for null values
train[train.isnull().any(axis=1)]

Unnamed: 0,id,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,sourcing_channel,residence_area_type,premium,renewal
18,68260,0.012,16433,87530,0.0,0.0,0.0,,5,1,0,3300,1
26,27507,0.000,8406,24690,0.0,0.0,0.0,,4,1,1,1200,1
39,60644,0.000,16799,99230,0.0,0.0,0.0,,6,1,0,5700,1
60,48944,0.000,20810,150120,0.0,0.0,0.0,,5,1,1,7500,1
61,91494,0.000,13875,125060,0.0,0.0,0.0,,10,2,1,11700,1
76,65014,0.266,18259,187550,1.0,0.0,0.0,,3,1,1,18000,1
77,47377,0.000,8044,24660,0.0,0.0,0.0,,4,1,1,1200,1
84,49811,0.327,21550,55360,1.0,0.0,0.0,,22,1,1,5400,0
140,28283,0.000,26656,315080,0.0,0.0,0.0,,10,1,1,13800,1
145,85482,0.714,18994,60060,1.0,0.0,0.0,,14,4,1,5700,1


In [26]:
#Imputer
#fill null values with the median of each column
#remove id column and target variable column from train set
X = Imputer(strategy='median').fit_transform(train.iloc[:,1:-1])
X

array([[  4.29000000e-01,   1.20580000e+04,   3.55060000e+05, ...,
          3.00000000e+00,   1.00000000e+00,   3.30000000e+03],
       [  1.00000000e-02,   2.15460000e+04,   3.15150000e+05, ...,
          1.00000000e+00,   1.00000000e+00,   1.80000000e+04],
       [  9.17000000e-01,   1.75310000e+04,   8.41400000e+04, ...,
          3.00000000e+00,   0.00000000e+00,   3.30000000e+03],
       ..., 
       [  0.00000000e+00,   2.33720000e+04,   3.05020000e+05, ...,
          1.00000000e+00,   0.00000000e+00,   9.60000000e+03],
       [  3.98000000e-01,   2.26410000e+04,   3.93300000e+04, ...,
          1.00000000e+00,   0.00000000e+00,   5.70000000e+03],
       [  5.50000000e-01,   1.57090000e+04,   2.80140000e+05, ...,
          1.00000000e+00,   1.00000000e+00,   9.60000000e+03]])

### Simple Random Forest Model <a id=rf></a>
- My first submission

In [62]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, accuracy_score, recall_score, confusion_matrix, roc_auc_score

In [95]:
X,y = X , train.iloc[:,-1]

In [96]:
#use train test split to validate model results later
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [66]:
#simple rf model without hyperparameter tuning
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [67]:
y_pred = rf_clf.predict(X_test)

In [68]:
#get a score of 0.69 
roc_auc_score(y_pred,y_test)

0.69306698219046026

In [None]:
best_pred_prob = rf_clf.predict_proba(X_test)

### RF with Randomised Search CV <a id=rf_cv></a>
- My second submission
- Using randomised search cv to tune hyperparameters
- Took about 3 hours to run this

In [78]:
from sklearn.model_selection import RandomizedSearchCV
import pprint

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 1000, num = 10)]
# Criterion to measure quality of split
criterion = ['gini','entropy']
# Number of features to consider at every split
max_features = ['auto', 'log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'criterion' : criterion,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pp = pprint.PrettyPrinter()
pp.pprint(random_grid)

{'bootstrap': [True, False],
 'criterion': ['gini', 'entropy'],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'log2'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000]}


In [79]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train,y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] n_estimators=560, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=20, criterion=entropy, bootstrap=False 
[CV] n_estimators=560, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=20, criterion=entropy, bootstrap=False 
[CV] n_estimators=560, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=20, criterion=entropy, bootstrap=False 
[CV] n_estimators=10, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=20, criterion=gini, bootstrap=False 
[CV]  n_estimators=10, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=20, criterion=gini, bootstrap=False, total=   1.6s
[CV] n_estimators=10, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=20, criterion=gini, bootstrap=False 
[CV]  n_estimators=10, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=20, criterion=gini, bootstrap=False, tot

[CV] n_estimators=780, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=None, criterion=entropy, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_features=auto, max_depth=80, criterion=entropy, bootstrap=True, total= 1.7min
[CV] n_estimators=780, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=None, criterion=entropy, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_features=auto, max_depth=80, criterion=entropy, bootstrap=True, total= 1.7min
[CV] n_estimators=670, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=90, criterion=entropy, bootstrap=True 
[CV]  n_estimators=780, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=None, criterion=entropy, bootstrap=False, total= 1.9min
[CV] n_estimators=670, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=90, criterion=entropy, bootstrap=True 
[CV]  n_estimator

[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 10.4min


[CV]  n_estimators=670, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=90, criterion=entropy, bootstrap=True, total= 1.2min
[CV] n_estimators=340, min_samples_split=2, min_samples_leaf=1, max_features=auto, max_depth=80, criterion=gini, bootstrap=True 
[CV]  n_estimators=780, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=None, criterion=entropy, bootstrap=False, total= 2.0min
[CV] n_estimators=340, min_samples_split=2, min_samples_leaf=1, max_features=auto, max_depth=80, criterion=gini, bootstrap=True 
[CV]  n_estimators=340, min_samples_split=2, min_samples_leaf=1, max_features=auto, max_depth=80, criterion=gini, bootstrap=True, total=  33.2s
[CV] n_estimators=1000, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=30, criterion=gini, bootstrap=False 
[CV]  n_estimators=340, min_samples_split=2, min_samples_leaf=1, max_features=auto, max_depth=80, criterion=gini, bootstrap=True, total=  34.8s
[CV] n_estimators=1000, 

[CV] n_estimators=230, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=None, criterion=entropy, bootstrap=True 
[CV]  n_estimators=560, min_samples_split=10, min_samples_leaf=4, max_features=log2, max_depth=60, criterion=gini, bootstrap=True, total=  54.6s
[CV] n_estimators=230, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=None, criterion=entropy, bootstrap=True 
[CV]  n_estimators=230, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=30, criterion=entropy, bootstrap=False, total=  37.1s
[CV] n_estimators=230, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=None, criterion=entropy, bootstrap=True 
[CV]  n_estimators=230, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=None, criterion=entropy, bootstrap=True, total=  22.3s
[CV] n_estimators=450, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=None, criterion=entropy, bootstrap=True 
[CV]  n_estimators=230

[CV]  n_estimators=780, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=70, criterion=gini, bootstrap=True, total= 1.2min
[CV] n_estimators=120, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=80, criterion=gini, bootstrap=False 
[CV]  n_estimators=890, min_samples_split=2, min_samples_leaf=1, max_features=auto, max_depth=50, criterion=entropy, bootstrap=True, total= 1.7min
[CV] n_estimators=120, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=80, criterion=gini, bootstrap=False 
[CV]  n_estimators=120, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=80, criterion=gini, bootstrap=False, total=  18.1s
[CV] n_estimators=120, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=80, criterion=gini, bootstrap=False 
[CV]  n_estimators=120, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=80, criterion=gini, bootstrap=False, total=  19.5s
[CV] n_estimators=780, min_sa

[CV] n_estimators=340, min_samples_split=5, min_samples_leaf=4, max_features=log2, max_depth=None, criterion=entropy, bootstrap=True 
[CV]  n_estimators=560, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=60, criterion=entropy, bootstrap=True, total= 1.0min
[CV] n_estimators=230, min_samples_split=2, min_samples_leaf=1, max_features=auto, max_depth=50, criterion=entropy, bootstrap=True 
[CV]  n_estimators=340, min_samples_split=5, min_samples_leaf=4, max_features=log2, max_depth=None, criterion=entropy, bootstrap=True, total=  36.8s
[CV] n_estimators=230, min_samples_split=2, min_samples_leaf=1, max_features=auto, max_depth=50, criterion=entropy, bootstrap=True 
[CV]  n_estimators=230, min_samples_split=2, min_samples_leaf=1, max_features=auto, max_depth=50, criterion=entropy, bootstrap=True, total=  31.1s
[CV] n_estimators=230, min_samples_split=2, min_samples_leaf=1, max_features=auto, max_depth=50, criterion=entropy, bootstrap=True 
[CV]  n_estimators=340, mi

[CV]  n_estimators=10, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=30, criterion=gini, bootstrap=False, total=   1.5s
[CV] n_estimators=10, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=30, criterion=gini, bootstrap=False 
[CV]  n_estimators=10, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=30, criterion=gini, bootstrap=False, total=   1.5s
[CV] n_estimators=450, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, criterion=entropy, bootstrap=True 
[CV]  n_estimators=560, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=40, criterion=entropy, bootstrap=True, total=  56.4s
[CV] n_estimators=450, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, criterion=entropy, bootstrap=True 


[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 48.9min


[CV]  n_estimators=560, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=40, criterion=entropy, bootstrap=True, total= 1.0min
[CV] n_estimators=450, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, criterion=entropy, bootstrap=True 
[CV]  n_estimators=450, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, criterion=entropy, bootstrap=True, total=  47.3s
[CV] n_estimators=340, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=20, criterion=entropy, bootstrap=False 
[CV]  n_estimators=560, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=40, criterion=entropy, bootstrap=True, total= 1.0min
[CV] n_estimators=340, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=20, criterion=entropy, bootstrap=False 
[CV]  n_estimators=450, min_samples_split=2, min_samples_leaf=4, max_features=auto, max_depth=20, criterion=entropy, bootstrap=True, total=  48.4s
[CV] n_estim

[CV] n_estimators=890, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=100, criterion=entropy, bootstrap=False 
[CV]  n_estimators=890, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=70, criterion=gini, bootstrap=False, total= 2.2min
[CV] n_estimators=890, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=100, criterion=entropy, bootstrap=False 
[CV]  n_estimators=890, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=70, criterion=gini, bootstrap=False, total= 2.3min
[CV] n_estimators=340, min_samples_split=2, min_samples_leaf=2, max_features=log2, max_depth=40, criterion=gini, bootstrap=False 
[CV]  n_estimators=340, min_samples_split=2, min_samples_leaf=2, max_features=log2, max_depth=40, criterion=gini, bootstrap=False, total=  47.8s
[CV] n_estimators=340, min_samples_split=2, min_samples_leaf=2, max_features=log2, max_depth=40, criterion=gini, bootstrap=False 
[CV]  n_estimators=890, min_samples_s

[CV]  n_estimators=1000, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=80, criterion=gini, bootstrap=True, total= 1.8min
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_features=auto, max_depth=110, criterion=gini, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=80, criterion=gini, bootstrap=True, total= 1.8min
[CV] n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_features=auto, max_depth=110, criterion=gini, bootstrap=False 
[CV]  n_estimators=1000, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=80, criterion=gini, bootstrap=True, total= 1.7min
[CV] n_estimators=120, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=None, criterion=entropy, bootstrap=False 
[CV]  n_estimators=120, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=None, criterion=entropy, bootstrap=False, total=  17.8s
[CV] n_estimator

[CV] n_estimators=120, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=100, criterion=entropy, bootstrap=False 
[CV]  n_estimators=120, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=100, criterion=entropy, bootstrap=False, total=  21.6s
[CV] n_estimators=120, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=100, criterion=entropy, bootstrap=False 
[CV]  n_estimators=120, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=100, criterion=entropy, bootstrap=False, total=  21.4s
[CV] n_estimators=120, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=100, criterion=entropy, bootstrap=False 
[CV]  n_estimators=120, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=100, criterion=entropy, bootstrap=False, total=  20.3s
[CV] n_estimators=340, min_samples_split=5, min_samples_leaf=4, max_features=log2, max_depth=10, criterion=entropy, bootstrap=False 
[CV]  n_esti

[CV]  n_estimators=670, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=50, criterion=gini, bootstrap=True, total= 1.3min
[CV] n_estimators=10, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=20, criterion=gini, bootstrap=False 
[CV]  n_estimators=10, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=20, criterion=gini, bootstrap=False, total=   1.4s
[CV] n_estimators=10, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=20, criterion=gini, bootstrap=False 
[CV]  n_estimators=10, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=20, criterion=gini, bootstrap=False, total=   1.4s
[CV] n_estimators=10, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=20, criterion=gini, bootstrap=False 
[CV]  n_estimators=10, min_samples_split=10, min_samples_leaf=1, max_features=auto, max_depth=20, criterion=gini, bootstrap=False, total=   1.5s
[CV] n_estimators=230, min_sam

[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 94.5min finished


RandomizedSearchCV(cv=3, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
          fit_params={}, iid=True, n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'criterion': ['gini', 'entropy'], 'max_features': ['auto', 'log2'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score=True, scoring=None, verbose=2)

In [81]:
#hyperparameters of best RF after tuning
best_rf = rf_random.best_estimator_
rf_random.best_params_

{'bootstrap': False,
 'criterion': 'entropy',
 'max_depth': 10,
 'max_features': 'auto',
 'min_samples_leaf': 2,
 'min_samples_split': 10,
 'n_estimators': 890}

In [97]:
#fit on the entire training set again (no train test split)
best_rf.fit(X_train,y_train)
best_y_pred = best_rf.predict(X_test)

In [98]:
#get a score of about 0.76
roc_auc_score(best_y_pred, y_test)

0.764803996752075

In [154]:
best_pred_prob = best_rf.predict_proba(X_test)

In [158]:
best_pred_prob[:,1]

(26352,)

### Simple GBC Model <a id=gbc></a>
- My third submission attempt
- Using gradient boosting classifier with no hyperparameter tuning

In [20]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, accuracy_score, recall_score, confusion_matrix, roc_auc_score

In [21]:
X,y = X , train.iloc[:,-1]

In [22]:
#using train test split to validate the model later
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [23]:
gbc_clf = GradientBoostingClassifier()
gbc_clf.fit(X_train,y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)

In [24]:
y_pred = gbc_clf.predict(X_test)

In [25]:
# roc_auc score of 0.749 using gradient boosting classifier
# roc_auc was used as it is one of the evaluation criteria
roc_auc_score(y_pred,y_test)

0.74939341421143857

In [None]:
best_pred_prob = gbc_clf.predict_proba(X_test)


### GBC with Randomised Search CV <a id=gbc_cv></a>
- I tried to tune the hyperparameters of the gradient boosting classifier but this took about 8 hours to run and gave poorer performance compared to simple gbc model in the solution checker

In [21]:
from sklearn.model_selection import RandomizedSearchCV
import pprint

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 2000, num = 10)]
# Loss function to optimise
loss = ['deviance','exponential']
# Number of features to consider at every split
max_features = ['auto', 'log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(3, 110, num = 12)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'loss' : loss,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

pp = pprint.PrettyPrinter()
pp.pprint(random_grid)

{'loss': ['deviance', 'exponential'],
 'max_depth': [3, 12, 22, 32, 41, 51, 61, 71, 80, 90, 100, 110],
 'max_features': ['auto', 'log2'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [100, 311, 522, 733, 944, 1155, 1366, 1577, 1788, 2000]}


In [23]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
gbc = GradientBoostingClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
gbc_random = RandomizedSearchCV(estimator = gbc, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
gbc_random.fit(X_train,y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] n_estimators=311, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=22, loss=deviance 
[CV] n_estimators=311, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=22, loss=deviance 
[CV] n_estimators=311, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=22, loss=deviance 
[CV] n_estimators=2000, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=3, loss=deviance 
[CV]  n_estimators=2000, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=3, loss=deviance, total= 1.1min
[CV] n_estimators=2000, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=3, loss=deviance 
[CV]  n_estimators=2000, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=3, loss=deviance, total= 1.1min
[CV] n_estimators=2000, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=3, loss=deviance 
[CV]  n_est

[CV]  n_estimators=311, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=61, loss=deviance, total= 8.6min
[CV] n_estimators=2000, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=3, loss=exponential 


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 36.8min


[CV]  n_estimators=311, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=61, loss=deviance, total= 8.3min
[CV] n_estimators=2000, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=3, loss=exponential 
[CV]  n_estimators=2000, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=3, loss=exponential, total= 1.2min
[CV] n_estimators=2000, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=3, loss=exponential 
[CV]  n_estimators=2000, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=3, loss=exponential, total= 1.2min
[CV] n_estimators=1577, min_samples_split=5, min_samples_leaf=2, max_features=log2, max_depth=3, loss=exponential 
[CV]  n_estimators=733, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=80, loss=exponential, total= 5.0min
[CV] n_estimators=1577, min_samples_split=5, min_samples_leaf=2, max_features=log2, max_depth=3, loss=exponential 
[CV]  n_estimators

[CV] n_estimators=1788, min_samples_split=2, min_samples_leaf=2, max_features=auto, max_depth=110, loss=deviance 
[CV]  n_estimators=1366, min_samples_split=5, min_samples_leaf=4, max_features=log2, max_depth=71, loss=deviance, total= 3.5min
[CV] n_estimators=1788, min_samples_split=2, min_samples_leaf=2, max_features=auto, max_depth=110, loss=deviance 
[CV]  n_estimators=1366, min_samples_split=5, min_samples_leaf=4, max_features=log2, max_depth=71, loss=deviance, total= 3.5min
[CV] n_estimators=1366, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=71, loss=exponential 
[CV]  n_estimators=1788, min_samples_split=2, min_samples_leaf=2, max_features=auto, max_depth=110, loss=deviance, total= 6.8min
[CV] n_estimators=1366, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=71, loss=exponential 
[CV]  n_estimators=1366, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=71, loss=exponential, total= 4.4min
[CV] n_estimators=1366, 

[CV] n_estimators=1788, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=80, loss=deviance 
[CV]  n_estimators=733, min_samples_split=10, min_samples_leaf=4, max_features=log2, max_depth=41, loss=deviance, total= 3.1min
[CV] n_estimators=733, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=12, loss=exponential 
[CV]  n_estimators=1788, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=80, loss=deviance, total= 3.6min
[CV] n_estimators=733, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=12, loss=exponential 
[CV]  n_estimators=1788, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=80, loss=deviance, total= 3.7min
[CV] n_estimators=733, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=12, loss=exponential 
[CV]  n_estimators=1788, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=80, loss=deviance, total= 3.7min
[CV] n_estimators=1155, min

[CV]  n_estimators=944, min_samples_split=2, min_samples_leaf=1, max_features=log2, max_depth=100, loss=deviance, total= 2.3min
[CV] n_estimators=2000, min_samples_split=2, min_samples_leaf=2, max_features=auto, max_depth=80, loss=deviance 
[CV]  n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_features=auto, max_depth=80, loss=exponential, total= 9.6min
[CV] n_estimators=2000, min_samples_split=2, min_samples_leaf=2, max_features=auto, max_depth=80, loss=deviance 
[CV]  n_estimators=100, min_samples_split=5, min_samples_leaf=2, max_features=auto, max_depth=80, loss=exponential, total= 9.6min
[CV] n_estimators=311, min_samples_split=10, min_samples_leaf=4, max_features=log2, max_depth=71, loss=exponential 
[CV]  n_estimators=311, min_samples_split=10, min_samples_leaf=4, max_features=log2, max_depth=71, loss=exponential, total= 4.4min
[CV] n_estimators=311, min_samples_split=10, min_samples_leaf=4, max_features=log2, max_depth=71, loss=exponential 
[CV]  n_estimators=2000

[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 218.1min


[CV]  n_estimators=2000, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=41, loss=exponential, total= 9.1min
[CV] n_estimators=944, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=41, loss=exponential 
[CV]  n_estimators=2000, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=41, loss=exponential, total= 9.0min
[CV] n_estimators=944, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=22, loss=exponential 
[CV]  n_estimators=944, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=41, loss=exponential, total= 9.2min
[CV] n_estimators=944, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=22, loss=exponential 
[CV]  n_estimators=944, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=41, loss=exponential, total= 9.6min
[CV] n_estimators=944, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=22, loss=exponential 
[CV]  n_estim

[CV]  n_estimators=1155, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=110, loss=exponential, total=10.0min
[CV] n_estimators=1577, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=12, loss=deviance 
[CV]  n_estimators=944, min_samples_split=10, min_samples_leaf=4, max_features=auto, max_depth=71, loss=exponential, total=10.3min
[CV] n_estimators=1577, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=12, loss=deviance 
[CV]  n_estimators=1577, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=12, loss=deviance, total= 3.2min
[CV] n_estimators=1577, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=12, loss=deviance 
[CV]  n_estimators=1577, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=12, loss=deviance, total= 3.1min
[CV] n_estimators=1577, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=110, loss=deviance 
[CV]  n_estimators=15

[CV] n_estimators=311, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=80, loss=exponential 
[CV]  n_estimators=1577, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=80, loss=deviance, total= 5.7min
[CV] n_estimators=311, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=80, loss=exponential 
[CV]  n_estimators=311, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=80, loss=exponential, total= 3.8min
[CV] n_estimators=311, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=80, loss=exponential 
[CV]  n_estimators=1577, min_samples_split=5, min_samples_leaf=1, max_features=auto, max_depth=80, loss=deviance, total= 5.6min
[CV] n_estimators=1577, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=110, loss=deviance 
[CV]  n_estimators=311, min_samples_split=10, min_samples_leaf=2, max_features=log2, max_depth=80, loss=exponential, total= 3.8min
[CV] n_estimators=1

[CV] n_estimators=311, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=90, loss=deviance 
[CV]  n_estimators=522, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=32, loss=exponential, total= 3.5min
[CV] n_estimators=311, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=90, loss=deviance 
[CV]  n_estimators=522, min_samples_split=10, min_samples_leaf=1, max_features=log2, max_depth=32, loss=exponential, total= 3.5min
[CV] n_estimators=1577, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=71, loss=deviance 
[CV]  n_estimators=311, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=90, loss=deviance, total= 2.9min
[CV] n_estimators=1577, min_samples_split=2, min_samples_leaf=4, max_features=log2, max_depth=71, loss=deviance 
[CV]  n_estimators=311, min_samples_split=5, min_samples_leaf=1, max_features=log2, max_depth=90, loss=deviance, total= 3.0min
[CV] n_estimators=1577, min_sample

[CV] n_estimators=100, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=90, loss=exponential 
[CV]  n_estimators=311, min_samples_split=10, min_samples_leaf=2, max_features=auto, max_depth=32, loss=exponential, total= 7.7min
[CV] n_estimators=2000, min_samples_split=2, min_samples_leaf=2, max_features=log2, max_depth=12, loss=deviance 
[CV]  n_estimators=2000, min_samples_split=2, min_samples_leaf=2, max_features=log2, max_depth=12, loss=deviance, total= 3.0min
[CV] n_estimators=2000, min_samples_split=2, min_samples_leaf=2, max_features=log2, max_depth=12, loss=deviance 
[CV]  n_estimators=100, min_samples_split=5, min_samples_leaf=4, max_features=auto, max_depth=90, loss=exponential, total= 8.4min
[CV] n_estimators=2000, min_samples_split=2, min_samples_leaf=2, max_features=log2, max_depth=12, loss=deviance 
[CV]  n_estimators=2000, min_samples_split=2, min_samples_leaf=2, max_features=log2, max_depth=12, loss=deviance, total= 3.1min
[CV] n_estimators=2000, min_s

[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 401.4min finished


RandomizedSearchCV(cv=3, error_score='raise',
          estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False),
          fit_params={}, iid=True, n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [100, 311, 522, 733, 944, 1155, 1366, 1577, 1788, 2000], 'loss': ['deviance', 'exponential'], 'max_features': ['auto', 'log2'], 'max_depth': [3, 12, 22, 32, 41, 51, 61, 71, 80, 90, 100, 110], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score=True, scoring=None, verbose=2)

In [24]:
best_gbc = gbc_random.best_estimator_
gbc_random.best_params_

{'loss': 'deviance',
 'max_depth': 3,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 5,
 'n_estimators': 100}

In [25]:
best_gbc.fit(X_train,y_train)
best_y_pred = best_gbc.predict(X_test)

In [26]:
#score of only 0.75
roc_auc_score(best_y_pred, y_test)

0.75024549445471356

In [27]:
best_pred_prob = best_gbc.predict_proba(X_test)

In [28]:
best_pred_prob[:,1]

array([ 0.97650362,  0.98708603,  0.9850488 , ...,  0.6387139 ,
        0.97603632,  0.98364302])

### Incentive optimisation function <a id=optimisation></a>

Additional info given during the hackathon: 

#### Objective function
<br> Total Net Revenue = sum[(benchmark_p + delta_p)*premium - incentives]

where

benchmark_p : renewal probability predicted using a benchmark model by the insurance company
<br> delta_p : (% Improvement in renewal probability*benchmark_p) is the improvement in renewal probability calculated from the agent efforts in hours

#### Constraints:
<br> 1. Equation for the effort-incentives curve: Y = 10*(1-exp(-X/400))
<br> 2. Equation for the % improvement in renewal prob vs effort curve: Y = 20*(1-exp(-X/5))

In [29]:
#getting dataframe of prediction and subset test of training set
pred_df = pd.DataFrame(np.concatenate([X_test, best_y_pred.reshape(-1,1), best_pred_prob[:,1].reshape(-1,1)], axis=1))
col = [c for c in train.iloc[:,1:-1].columns]
col.append('pred')
col.append('pred_proba')
pred_df.columns = col
pred_df.head()

Unnamed: 0,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,sourcing_channel,residence_area_type,premium,pred,pred_proba
0,0.125,16797.0,180610.0,0.0,0.0,0.0,98.86,15.0,4.0,1.0,13800.0,1.0,0.976504
1,0.021,23375.0,168110.0,0.0,0.0,0.0,99.19,20.0,2.0,1.0,7500.0,1.0,0.987086
2,0.033,21908.0,102550.0,0.0,0.0,0.0,98.38,7.0,1.0,1.0,7500.0,1.0,0.985049
3,0.147,13510.0,270440.0,0.0,0.0,0.0,99.28,7.0,3.0,1.0,24300.0,1.0,0.983955
4,1.0,10965.0,51130.0,0.0,0.0,0.0,99.79,3.0,1.0,1.0,3300.0,1.0,0.917183


In [30]:
from scipy.optimize import differential_evolution, minimize
from math import exp

In [31]:
#want to maximise revenue but because 
#scipy.optimise is a minimisation function we multiply by negative 1 to the objective function
def obj_ftn(x, bench_prob, premium):
    return -1*(premium*(bench_prob+(20*(1-exp(-2+2*exp(-x/400))))) - x)
    

In [32]:
def incentives(x):
    sol = minimize(obj_ftn, 0, args=(x[0], x[1]))
    return sol.x[0]

In [33]:
pred_df['incentives'] = pred_df[['pred_proba','premium']].apply(incentives, axis=1)

In [34]:
pred_df.head()

Unnamed: 0,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,sourcing_channel,residence_area_type,premium,pred,pred_proba,incentives
0,0.125,16797.0,180610.0,0.0,0.0,0.0,98.86,15.0,4.0,1.0,13800.0,1.0,0.976504,2095.98529
1,0.021,23375.0,168110.0,0.0,0.0,0.0,99.19,20.0,2.0,1.0,7500.0,1.0,0.987086,1855.477522
2,0.033,21908.0,102550.0,0.0,0.0,0.0,98.38,7.0,1.0,1.0,7500.0,1.0,0.985049,1855.396941
3,0.147,13510.0,270440.0,0.0,0.0,0.0,99.28,7.0,3.0,1.0,24300.0,1.0,0.983955,2319.383125
4,1.0,10965.0,51130.0,0.0,0.0,0.0,99.79,3.0,1.0,1.0,3300.0,1.0,0.917183,1536.892031


In [35]:
#base revenue
(pred_df['premium']*pred_df['pred_proba']).sum()

271571595.2638097

In [36]:
#revenue with incentives
(pred_df['premium']*(pred_df['pred_proba']+(20*(1-np.exp(-2+2*np.exp(-pred_df['incentives']/400))))) - pred_df['incentives']).sum()

5187674172.690677

We see that the incentives increased the revenue of the company, but this is my first time trying out the objective function solution so I was not able to completely put it into the function here

### On given test set for submission <a id = given_test></a>
we repeat the steps for data preprocessing and select our model to fit to our data, in this case the code below is for simple gbc model

In [37]:
#Replace Urban with 1, Rural with 0
test.replace('Urban',1, inplace=True)
test.replace('Rural',0 ,inplace=True)

#replace for the 5 sourcing channels
test.replace('A',1,inplace=True)
test.replace('B',2,inplace=True)
test.replace('C',3,inplace=True)
test.replace('D',4,inplace=True)
test.replace('E',5,inplace=True)

In [38]:
#fill null values with the median of each column
real_X = Imputer(strategy='median').fit_transform(test.iloc[:,1:])
real_X

array([[  1.00000000e-03,   2.73840000e+04,   5.11500000e+04, ...,
          1.00000000e+00,   0.00000000e+00,   3.30000000e+03],
       [  1.24000000e-01,   2.37350000e+04,   2.85140000e+05, ...,
          1.00000000e+00,   1.00000000e+00,   1.17000000e+04],
       [  1.00000000e+00,   1.71700000e+04,   1.86030000e+05, ...,
          2.00000000e+00,   1.00000000e+00,   1.17000000e+04],
       ..., 
       [  3.25000000e-01,   1.82640000e+04,   1.96570000e+05, ...,
          2.00000000e+00,   0.00000000e+00,   1.20000000e+03],
       [  9.83000000e-01,   1.67980000e+04,   2.12150000e+05, ...,
          1.00000000e+00,   1.00000000e+00,   1.38000000e+04],
       [  5.00000000e-03,   2.44670000e+04,   8.91900000e+04, ...,
          1.00000000e+00,   1.00000000e+00,   5.40000000e+03]])

In [39]:
#fit on entire training set
best_gbc.fit(X,y)

#predict on actual test set
real_y = best_gbc.predict(real_X)

In [40]:
#get renewal probability
real_proba_y = best_gbc.predict_proba(real_X)

In [41]:
real_pred_df = pd.DataFrame(np.concatenate([test, real_y.reshape(-1,1), real_proba_y[:,1].reshape(-1,1)], axis=1))
col = [c for c in test.columns]
col.append('pred')
col.append('renewal')
real_pred_df.columns = col
real_pred_df.head()

Unnamed: 0,id,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,sourcing_channel,residence_area_type,premium,pred,renewal
0,649.0,0.001,27384.0,51150.0,0.0,0.0,0.0,99.89,7.0,1.0,0.0,3300.0,1.0,0.989114
1,81136.0,0.124,23735.0,285140.0,0.0,0.0,0.0,98.93,19.0,1.0,1.0,11700.0,1.0,0.985131
2,70762.0,1.0,17170.0,186030.0,0.0,0.0,0.0,,2.0,2.0,1.0,11700.0,1.0,0.815035
3,53935.0,0.198,16068.0,123540.0,0.0,0.0,0.0,99.0,11.0,2.0,0.0,5400.0,1.0,0.977374
4,15476.0,0.041,10591.0,200020.0,1.0,0.0,0.0,99.17,14.0,1.0,0.0,9600.0,1.0,0.968635


In [42]:
real_pred_df['incentives'] = real_pred_df[['renewal','premium']].apply(incentives, axis=1)

In [43]:
real_pred_df.head(20)

Unnamed: 0,id,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,sourcing_channel,residence_area_type,premium,pred,renewal,incentives
0,649.0,0.001,27384.0,51150.0,0.0,0.0,0.0,99.89,7.0,1.0,0.0,3300.0,1.0,0.989114,1536.982905
1,81136.0,0.124,23735.0,285140.0,0.0,0.0,0.0,98.93,19.0,1.0,1.0,11700.0,1.0,0.985131,2030.897168
2,70762.0,1.0,17170.0,186030.0,0.0,0.0,0.0,,2.0,2.0,1.0,11700.0,1.0,0.815035,2030.718862
3,53935.0,0.198,16068.0,123540.0,0.0,0.0,0.0,99.0,11.0,2.0,0.0,5400.0,1.0,0.977374,1726.925139
4,15476.0,0.041,10591.0,200020.0,1.0,0.0,0.0,99.17,14.0,1.0,0.0,9600.0,1.0,0.968635,1952.477005
5,64797.0,0.112,16065.0,245130.0,0.0,0.0,0.0,98.88,7.0,2.0,0.0,11700.0,1.0,0.985623,2030.397351
6,67412.0,0.325,17167.0,388050.0,6.0,0.0,0.0,99.0,27.0,3.0,1.0,3300.0,1.0,0.841443,1536.708842
7,44241.0,0.362,20085.0,99880.0,3.0,0.0,0.0,98.4,14.0,1.0,1.0,5400.0,1.0,0.760548,1727.268036
8,5069.0,0.115,23008.0,360040.0,0.0,0.0,0.0,99.88,6.0,4.0,0.0,13800.0,1.0,0.989843,2096.304463
9,16615.0,0.021,21906.0,510040.0,0.0,0.0,0.0,99.89,11.0,2.0,1.0,28500.0,1.0,0.989289,2385.386671


In [44]:
real_pred_df.shape

(34224, 15)

In [45]:
#save to csv for submission
submit = real_pred_df[['id','renewal','incentives']]
submit.to_csv('submission_3_23072018.csv', index=False)