## Homesite Quote Conversion
------------------------------------------

### Which customers will purchase a quoted insurance plan? [Kaggle - Homesite Quote Conversion](https://www.kaggle.com/c/homesite-quote-conversion)

![alt text](homesite.png "homesite") Before asking someone on a date or skydiving, it's important to know your likelihood of success. The same goes for quoting home insurance prices to a potential customer. [Homesite](https://homesite.com/), a leading provider of homeowners insurance, does not currently have a dynamic conversion rate model that can give them confidence a quoted price will lead to a purchase. 

Using an anonymized database of information on customer and sales activity, including property and coverage information, Homesite is challenging you to predict which customers will purchase a given quote. Accurately predicting conversion would help Homesite better understand the impact of proposed pricing changes and maintain an ideal portfolio of customer segments. 

### Benchmarking XGBoost with Raw Data - No Feature Extraction

In [21]:
import pandas as pd
import numpy as np
import xgboost as xgb

In [22]:
# Load training and test data
train_df = pd.read_csv('data/raw_train.csv')
test_df = pd.read_csv('data/raw_test.csv')

In [23]:
# drop unnecessary columns, these columns won't be useful in analysis and prediction
test_df.drop(['QuoteNumber'], axis=1, inplace=True)
train_df.drop(['QuoteNumber'], axis=1, inplace=True)

In [24]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260753 entries, 0 to 260752
Columns: 298 entries, Original_Quote_Date to GeographicField64
dtypes: float64(6), int64(264), object(28)
memory usage: 592.8+ MB


In [25]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173836 entries, 0 to 173835
Columns: 297 entries, Original_Quote_Date to GeographicField64
dtypes: float64(6), int64(263), object(28)
memory usage: 393.9+ MB


** Basic Date Manipulations **

In [26]:
# Convert Date to Year, Month, and Week
train_df['Date'] = pd.to_datetime(pd.Series(train_df['Original_Quote_Date']))

In [27]:
train_df[['Date']].head(1)

Unnamed: 0,Date
0,2013-08-16


In [28]:
train_df['Year']  = train_df['Date'].apply(lambda x: int(str(x)[:4]))
train_df['Month'] = train_df['Date'].apply(lambda x: int(str(x)[5:7]))
train_df['Weekday']  = train_df['Date'].dt.dayofweek

In [29]:
train_df[['Date','Year', 'Month', 'Weekday']].head(1)

Unnamed: 0,Date,Year,Month,Weekday
0,2013-08-16,2013,8,4


In [30]:
test_df['Date'] = pd.to_datetime(pd.Series(test_df['Original_Quote_Date']))
test_df['Year']  = test_df['Date'].apply(lambda x: int(str(x)[:4]))
test_df['Month'] = test_df['Date'].apply(lambda x: int(str(x)[5:7]))
test_df['Weekday']  = test_df['Date'].dt.dayofweek

In [31]:
train_df.drop(['Original_Quote_Date', 'Date'], axis=1, inplace=True)
test_df.drop(['Original_Quote_Date', 'Date'], axis=1, inplace=True)

** Convert Categorical Values to Numerical Values for XGBoost to work **

In [32]:
# There are some columns with non-numerical values(i.e. dtype='object'),
categorical = []
for f in train_df.columns:
    if train_df[f].dtype=='object':
        categorical.append(f)

In [33]:
', '.join(categorical)

'Field6, Field10, Field12, CoverageField8, CoverageField9, SalesField7, PersonalField7, PersonalField16, PersonalField17, PersonalField18, PersonalField19, PropertyField3, PropertyField4, PropertyField5, PropertyField7, PropertyField14, PropertyField28, PropertyField30, PropertyField31, PropertyField32, PropertyField33, PropertyField34, PropertyField36, PropertyField37, PropertyField38, GeographicField63, GeographicField64'

In [34]:
# So, We will create a corresponding unique numerical value for each non-numerical value in a column of training and testing set.
from sklearn import preprocessing
    
for f in train_df.columns:
    if train_df[f].dtype=='object':
        lbl_encoder = preprocessing.LabelEncoder()
        lbl_encoder.fit(np.unique(list(train_df[f].values) + list(test_df[f].values)))
        train_df[f] = lbl_encoder.transform(list(train_df[f].values))
        test_df[f] = lbl_encoder.transform(list(test_df[f].values))

In [35]:
# define training and testing sets
y_train = train_df['QuoteConversion_Flag']
X_train = train_df.drop('QuoteConversion_Flag', axis=1)
X_test  = test_df.copy()
X_test = X_test[X_train.columns.tolist()] # maintain same column order between train and test data

** Hyper Parameter Search and Optimization using scikit-learns GridSearchCV **

In [36]:
# Xgboost 
xgb_clf = xgb.XGBClassifier(objective='binary:logistic', nthread=4, silent=True)

In [37]:
from sklearn.grid_search import GridSearchCV
#param_grid = {'max_depth': [2,4,6,8,10],
#              'n_estimators': [50,100,200,500,1000],
#              'learning_rate': [0.1, 0.05, 0.02, 0.01],
#              'subsample': [0.9, 1.0],
#              'colsample_bytree': [0.8, 1.0]}

param_grid = {'max_depth': [4,6],
              'n_estimators': [200,500],
              'learning_rate': [0.1, 0.01],
              'subsample': [0.9],
              'colsample_bytree': [0.8]}

In [38]:
gs = GridSearchCV(xgb_clf,
                  param_grid,
                  scoring='roc_auc',
                  cv=5,
                  n_jobs=4,
                  verbose=1)

In [39]:
gs.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=4)]: Done  40 out of  40 | elapsed: 179.4min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=4,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1),
       fit_params={}, iid=True, n_jobs=4,
       param_grid={'n_estimators': [200, 500], 'subsample': [0.9], 'learning_rate': [0.1, 0.01], 'colsample_bytree': [0.8], 'max_depth': [4, 6]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=1)

**Selecting best classifier and predictions**

In [40]:
gs.best_score_, gs.best_params_

(0.96629241179497771,
 {'colsample_bytree': 0.8,
  'learning_rate': 0.1,
  'max_depth': 4,
  'n_estimators': 500,
  'subsample': 0.9})

In [41]:
clf = gs.best_estimator_
clf.fit(X_train, y_train)
y_pred_proba = clf.predict_proba(X_test)[:,1]

In [42]:
# Create submission
preds_out = pd.read_csv('data/sample_submission.csv')
preds_out['QuoteConversion_Flag'] = y_pred_proba #or, sample.QuoteConversion_Flag = y_pred_proba
preds_out.head(10)

Unnamed: 0,QuoteNumber,QuoteConversion_Flag
0,3,0.000487
1,5,0.035534
2,7,0.030583
3,9,0.006038
4,10,0.237182
5,11,0.01973
6,15,5.8e-05
7,16,0.029652
8,17,2.9e-05
9,21,3e-06


In [43]:
preds_out.to_csv('homesite_xgb_benchmark.csv', index=False)
print 'Done'