
# Basic Overview
The objective is to see if we can squeeze out some more predictive power from logistic regression, random forest and
xgboot models via ensembling.

Comments/criticisms/appreciations are greatly accepted and appreciated. Do not be shy and send me an email at babinu@gmail.com !

Source of data : https://www.kaggle.com/c/titanic/data

In [105]:
import pandas as pd
import numpy as np
import os
import re

In [144]:
files = [f for f in os.listdir('.') if re.match('kaggle_out.*\.csv', f)]
print(files)
count = 0
for  csv_file in files:
    count += 1
    data_df = pd.read_csv(csv_file)
    survived_list = data_df['Survived'].values
    if count > 1:
        print("Correlation with previous list is {0:0.3g}".format(
            np.corrcoef(survived_list, prev_survived_list)[0,1]))
    prev_survived_list = survived_list

['kaggle_out_logistic_regression.csv', 'kaggle_out_random_forest.csv', 'kaggle_out_xgboost.csv']
Correlation with previous list is 0.711
Correlation with previous list is 0.725


### Comment
The correlation is relatively, but let us still go with ensembling and see what we get (for more details see https://mlwave.com/kaggle-ensembling-guide/)

In [132]:
from collections import defaultdict
pid_to_survived_0 = defaultdict(int)
pid_to_survived_1 = defaultdict(int)


In [139]:
# Log the predictions in each scenario and store them into relevant dictionaries.
pid_to_survived_lists = list()
for  csv_file in files:
    data_df = pd.read_csv(csv_file)
    pid_to_survived = dict(zip(data_df['PassengerId'], data_df['Survived']))
    for (pid, survived) in pid_to_survived.items():
        if survived == 0:                
            pid_to_survived_0[pid] += 1
        elif survived == 1:
            pid_to_survived_1[pid] += 1
        else:
            print("Expected value {0} for survived obtained for pid {1}".format(survived, pid))
    pid_to_survived_lists.append(pid_to_survived)


In [140]:
# Pick the majority of the predictions as the ensembled one.
pid_to_ensembled_survived = dict()
for pid in pid_to_survived_lists[0].keys():
    num_zeros = pid_to_survived_0.get(pid, 0)
    num_ones = pid_to_survived_1.get(pid, 0)
    if num_zeros > num_ones:
        pid_to_ensembled_survived[pid] = 0
    else:
        pid_to_ensembled_survived[pid] = 1


In [164]:
# Check with out of sample of data (to be submitted to kaggle.)
test_data = pd.read_csv("test_data_processed_correct.csv")
test_data['Predictions'] = test_data['PassengerId'].apply(lambda x : pid_to_ensembled_survived.get(x))
sucess_rate = (np.sum(test_data['Survived'] == test_data['Predictions'])/len(test_data))
print("Success rate of model on test data is {0:0.3g}".format(sucess_rate))

Success rate of model on test data is 0.773


In [162]:
# We notice that the success rate is same as in the case when we used xgboost models alone.
# Let us confirm the same.
print("Number of passengers whose survivorship is to be predicted is {0}".format(len(pid_to_ensembled_survived)))

for i in range(len(pid_to_survived_lists)):

    relevant_prediction_dictionary = pid_to_survived_lists[i]

    shared_items = \
        {k : pid_to_ensembled_survived[k] for k in relevant_prediction_dictionary.keys() if \
         relevant_prediction_dictionary[k] == pid_to_ensembled_survived[k]}
    print("Number of entries with common prediction as that of the ensembled model is {0}".format(len(shared_items)))

Number of passengers whose survivorship is to be predicted is 418
Number of entries with common prediction as that of the ensembled model is 406
Number of entries with common prediction as that of the ensembled model is 375
Number of entries with common prediction as that of the ensembled model is 408


### Conclusion
We do look to be getting some differences in the ensembled model(though the total accuracy remained same as that of the xgboost model). Let us generate and submit predictions to kaggle.

In [163]:
predictions_to_kaggle = test_data[['PassengerId', 'Predictions']].copy()
predictions_to_kaggle.rename(columns={'Predictions' : 'Survived'}, inplace=True)
predictions_to_kaggle.to_csv('kaggle_out_ensemble_xgboost_rft_logreg.csv', index=False)