## Homesite Quote Conversion

### Which customers will purchase a quoted insurance plan? [Kaggle - Homesite Quote Conversion](https://www.kaggle.com/c/homesite-quote-conversion)

Before asking someone on a date or skydiving, it's important to know your likelihood of success. The same goes for quoting home insurance prices to a potential customer. [Homesite](https://homesite.com/), a leading provider of homeowners insurance, does not currently have a dynamic conversion rate model that can give them confidence a quoted price will lead to a purchase. 

Using an anonymized database of information on customer and sales activity, including property and coverage information, Homesite is challenging you to predict which customers will purchase a given quote. Accurately predicting conversion would help Homesite better understand the impact of proposed pricing changes and maintain an ideal portfolio of customer segments. 

## Stacking & Blending Different Classifiers - Creating a Meta Classifier with Logistic Regression

I learnt this technique from [Kaggle Ensembling Guide](http://mlwave.com/kaggle-ensembling-guide/) and from this [Kaggle post]( http://www.kaggle.com/c/bioresponse/forums/t/1889/question-about-the-process-of-ensemble-learning/10950#post10950)

**Explanation:** In each of the cross validation folds, train the models, then stack the validation test results for each classifier to form one 'meta-variable' column. So, for each classifier we will have a separate column. This constitutes our training data for the next stage LogisticRegression meta-classifier. At the same time during each cross validation fold training, use the learned sub model in that fold of each classifier to predict on the entire test data. At the end of all the cross validation rounds take average prediction of the test data for each classifier across each folds. Stack the averaged test predictions side by side columnwise, this will act as the test data for the next stage LogisticRegression meta-classifier.

In [21]:
import pandas as pd
import numpy as np

from sklearn.cross_validation import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

** Loading Feature Extracted Cleaned Data **

In [22]:
train_df = pd.read_csv('data/feature_extracted_train.csv')
test_df = pd.read_csv('data/feature_extracted_test.csv')

In [23]:
# define training and testing sets
train_label = train_df['QuoteConversion_Flag']
trainset = train_df.drop('QuoteConversion_Flag', axis=1)
testset = test_df.copy()
testset = testset[X_train.columns.tolist()] # maintain same column order between train and test data

In [24]:
n_folds = 4
n_threads = 4

** Create a Blend of Classifiers **

In [25]:
# Level 0 classifiers
clfs = [RandomForestClassifier(n_estimators=100, n_jobs=n_threads, criterion='gini'),
        RandomForestClassifier(n_estimators=100, n_jobs=n_threads, criterion='entropy'),
        ExtraTreesClassifier(n_estimators=100, n_jobs=n_threads, criterion='gini'),
        ExtraTreesClassifier(n_estimators=100, n_jobs=n_threads, criterion='entropy'),
        GradientBoostingClassifier(n_estimators=50, learning_rate=0.05, subsample=0.5, max_depth=6)]

In [26]:
# Stratified random shuffled cross validation
skf = list(StratifiedKFold(y_train, n_folds, shuffle=True))

In [27]:
# Create train and test sets for blending and Pre-allocate the data
blend_train = np.zeros((X_train.shape[0], len(clfs)))
blend_test = np.zeros((X_train.shape[0], len(clfs)))

** Stack Predictions of Classifiers **

In [28]:
# For each classifier, we train the number of fold times (=len(skf))
for clf_index, clf in enumerate(clfs):
    print('Training classifier [%s]' % (clf_index + 1))
    print(clf)

    blend_test_j = np.zeros((testset.shape[0], len(skf)))  # Number of testing data x Number of folds , we will take the mean of the predictions

    for fold_index, (train_index, valid_index) in enumerate(skf):
        print('Fold [%s]' % (fold_index + 1))

        # Cross validation training and validation set
        X_train = trainset.iloc[train_index]
        y_train = train_label.iloc[train_index]
        X_valid = trainset.iloc[valid_index]
        y_valid = train_label.iloc[valid_index]

        clf.fit(X_train, y_train)

        # This output will be the basis for our blended classifier to train against,
        # which is also the output of our classifiers
        blend_train[valid_index, clf_index] = clf.predict_proba(X_valid)[:, 1]
        blend_test_j[:, fold_index] = clf.predict_proba(testset)[:, 1]

    # Take the mean of the predictions of the cross validation set. Each column is now a meta-feature
    blend_test[:, clf_index] = blend_test_j.mean(axis=1)

    # Another way of doing this instead of predicting on the cv set, the level 0 estimator can be trained on the full data again
    # and take the prediction on the full testset
    #clf.fit(trainset, train_label)
    #blend_test[:, clf_index] = clf.predict_proba(testset)

print

** Blend Predictions of Classifiers using Logistic Regression **

In [29]:
print('Blending using LogisticRegression')
bclf = LogisticRegression()
bclf.fit(blend_train, train_label)
y_pred_proba = bclf.predict_proba(blend_test)[:, 1]

In [30]:
print('Linear stretch of predictions to [0,1]')
y_pred_proba = (y_pred_proba - y_pred_proba.min()) / (y_pred_proba.max() - y_pred_proba.min())

In [31]:
print('Writing Final Submission File')
preds_out = pd.read_csv('data/sample_submission.csv')
preds_out['QuoteConversion_Flag'] = y_pred_proba

In [32]:
# preds_out = preds_out.set_index('QuoteNumber')
preds_out.to_csv('homesite_blended_RF_ET_GBM_with_FE_nan.csv', index=False, float_format='%0.9f')
print 'Done'