# Kaggle Competition : [Homesite Quote Conversion](https://www.kaggle.com/quantify/homesite-quote-conversion)

![](https://kaggle2.blob.core.windows.net/competitions/kaggle/4657/logos/front_page.png)

## Which customers will purchase a quoted insurance plan?

Before asking someone on a date or skydiving, it's important to know your likelihood of success. The same goes for quoting home insurance prices to a potential customer. Homesite, a leading provider of homeowners insurance, does not currently have a dynamic conversion rate model that can give them confidence a quoted price will lead to a purchase. 

Using an anonymized database of information on customer and sales activity, including property and coverage information, Homesite is challenging you to predict which customers will purchase a given quote. Accurately predicting conversion would help Homesite better understand the impact of proposed pricing changes and maintain an ideal portfolio of customer segments. 

## Data

This dataset represents the activity of a large number of customers who are interested in buying policies from Homesite. Each QuoteNumber corresponds to a potential customer and the QuoteConversion_Flag indicates whether the customer purchased a policy.

The provided features are anonymized and provide a rich representation of the prospective customer and policy. They include specific coverage information, sales information, personal information, property information, and geographic information. Your task is to predict QuoteConversion_Flag for each QuoteNumber in the test set.

### File descriptions

- train.csv - the training set, contains QuoteConversion_Flag
- test.csv - the test set, does not contain QuoteConversion_Flag
- sample_submission.csv - a sample submission file in the correct format

In [1]:
#Deadline: 1st Febuary, 2016
    
from datetime import date
print "Number of days left: " + str(abs((date.today() - date(2016, 02, 01)).days))

Number of days left: 34


## Import data

In [None]:
import pandas as pd
train_data = pd.read_csv('data/input/train.csv', sep=',', header=0, quoting=2, skip_blank_lines=True)
test_data = pd.read_csv('data/input/test.csv', sep=',')

train_data.head()

In [None]:
print pd.value_counts(train_data.QuoteConversion_Flag)*100/len(train_data.QuoteConversion_Flag)

## Data Cleaning

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

labels = train_data.QuoteConversion_Flag
test_ind = test_data.QuoteNumber

def pre_processing(data):
    # Extract month and date part
    data['Date'] = pd.to_datetime(pd.Series(data['Original_Quote_Date']))
    # Now drop this column from the data frame
    data = data.drop('Original_Quote_Date', axis=1)
    data['year'] = data['Date'].apply(lambda x: int(str(x)[0:4]))
    data['month'] = data['Date'].apply(lambda x: int(str(x)[5:7]))
    data['day'] = data['Date'].apply(lambda x: int(str(x)[8:10]))
    data['weekday'] = data['Date'].dt.dayofweek
    data.drop('Date', axis=1, inplace=True)
        
    """        
    le = LabelEncoder()

    for f in data.columns:
        if data[f].dtype=='object':
            lbl = preprocessing.LabelEncoder()
            lbl.fit(list(data[f].values))
            data[f] = lbl.transform(list(data[f].values))
    """
    return data


train_data = pre_processing(train_data)
test_data = pre_processing(test_data)


train_data.drop(['QuoteNumber','QuoteConversion_Flag'], axis=1, inplace=True)
test_data.drop('QuoteNumber', axis=1, inplace=True)

In [None]:
from sklearn.feature_extraction import DictVectorizer
train = train_data.T.to_dict().values()
test = test_data.T.to_dict().values()

#Transfer the list of dictionaries into a sparse matrix
vec = DictVectorizer()
train = vec.fit_transform(train)
test = vec.transform(test)

## XGBoost model

In [None]:
params = {}
params["silent"] = 1
params["objective"] = "binary:logistic"
params["eval_metric"] = "auc"
params["booster"] = "gbtree"
params["eta"] = 0.01
params["min_child_weight"] = 3
params["max_depth"] = 10
params["subsample"] = 0.8
params["colsample_bytree"] = 0.8
params["nthread"] = 4
#params["scale_pos_weight"] = 1

plst = list(params.items())
offset = 10000

num_rounds = 1000

### Training

In [None]:
import xgboost as xgb

xgtest = xgb.DMatrix(test)
#Create training and validation DMatrix
xgtrain = xgb.DMatrix(train[offset:, :], label=labels[offset:])
xgval = xgb.DMatrix(train[:offset, :], label=labels[:offset])

evallist = [(xgtrain, 'train'), (xgval, 'val')]
model = xgb.train(plst, xgtrain, num_rounds, evallist, early_stopping_rounds=500)

In [None]:
r_train = train[::-1,:]
r_labels = labels[::-1]
#Create training and validation DMatrix for remaining data
r_xgtrain = xgb.DMatrix(r_train[offset:, :], label=r_labels[offset:])
r_xgval = xgb.DMatrix(r_train[:offset, :], label=r_labels[:offset])

evallist = [(r_xgtrain, 'r_train'), (r_xgval, 'r_val')]
r_model = xgb.train(plst, r_xgtrain, num_rounds, evallist, early_stopping_rounds=500)

### Prediction

In [None]:
preds = model.predict(xgtest, ntree_limit=model.best_iteration)
r_preds = r_model.predict(xgtest, ntree_limit=model.best_iteration)
avg_preds = ((preds)*0.6 + (r_preds)*0.4)

preds = pd.DataFrame({"QuoteNumber": test_ind, "QuoteConversion_Flag": preds})
preds = preds.set_index('QuoteNumber')

r_preds = pd.DataFrame({"QuoteNumber": test_ind, "QuoteConversion_Flag": r_preds})
r_preds = r_preds.set_index('QuoteNumber')

avg_preds = pd.DataFrame({"QuoteNumber": test_ind, "QuoteConversion_Flag": avg_preds})
avg_preds = avg_preds.set_index('QuoteNumber')
avg_preds.to_csv('data/output/xgboost_withDate.csv')

## Done

1. XGBoost model
2. Multi-threaded process

## What's next

1. Cross validation
2. Handling missing values
3. Grid search hyper-parameter
4. Stacked xgboost