
# Basic Overview
The objective is to see if we can squeeze out some more predictive power from logistic regression, random forest and
xgboot models via ensembling.

Comments/criticisms/appreciations are greatly accepted and appreciated. Do not be shy and send me an email at babinu@gmail.com !

Source of data : https://www.kaggle.com/c/titanic/data

In [91]:
import pandas as pd
import numpy as np
import os
import re

In [92]:
import seaborn as sns
import matplotlib.pyplot as plt

Comments :

The ensemble of the following 3 models 

Model #1. Ensemble of 5 models :

(a) Final fine tuned xgboost with Sex,Pclass,Age,Fare

(b) Oscar's idea : xgboost model with Sex,Age,Pclass,Embarked

(c) Xgboost model on Parch, SibSp, Fare --> Generated so as to get a reasonable model with low correlation with other models.

(d) SVC on tickets field.

(e) SVC on names field.


Model #2. Ensemble of the following 3 models :


(a) Final fine tuned xgboost with Sex,Pclass,Age,Fare


(b) Xgboost model on Parch, SibSp, Embarked --> Generated so as to get a reasonable model with low correlation with other models.


(c) SVC on names field.



Model #3. Ensemble of the following 3 models :

(a) Final fine tuned xgboost with Sex,Pclass,Age,Fare


(b) SVC on tickets field.


(c) SVC on names field.


Other models worth taking a look :


Model #4. Ensemble of the following 3 models :

(a) Oscar's idea : xgboost model with Sex,Age,Pclass,Embarked

(b) SVC on tickets field.

(c) SVC on names field.



Model #5. Ensemble of the following 3 models :

(a) Final fine tuned xgboost with Sex,Pclass,Age,Fare

(b) Oscar's idea : xgboost model with Sex,Age,Pclass,Embarked

(c) SVC on names field.


In [95]:
def populate_model_files_data(files):
    count = 0
    master_df = pd.DataFrame()
    count_to_file_name = dict()
    for  csv_file in files:
        count += 1
        data_df = pd.read_csv(csv_file)
        column_name = 'Survived_or_not_model_' + str(count)

        survived_list = data_df['Survived'].values
        master_df[column_name] = survived_list
        master_df['PassengerId'] = data_df['PassengerId'].values
        count_to_file_name[count] = csv_file
        prev_survived_list = survived_list
    return master_df

In [96]:
def display_corr_info(master_df, generate_corr_heat_map):
    relevant_cols = [col for col in master_df.columns if col not in ['PassengerId']]

    display(master_df[relevant_cols].corr())
    if generate_corr_heat_map:
        fig, ax = plt.subplots(1, 1, figsize=(16, 9))
        sns.heatmap(master_df[relevant_cols].corr(), ax=ax)    

In [97]:
def get_most_frequent_entry_3(a, b, c):
    sum_vals = a + b + c
    
    if sum_vals <= 1:
        frequent_val = 0
    else:
        frequent_val = 1
    return frequent_val

In [98]:
def get_most_frequent_entry_5(a, b, c, d, e):
    sum_vals = a + b + c + d + e
    
    if sum_vals <= 2:
        frequent_val = 0
    else:
        frequent_val = 1
    return frequent_val

In [99]:
def update_ensembled_cols(master_df):
    
    # Decrease by 1 to account for PassengerId column.
    num_files = len(master_df.columns) - 1
    if num_files == 5:

        master_df['Survived_or_not_ensembled'] = master_df.apply(
            lambda x : get_most_frequent_entry_5(x['Survived_or_not_model_1'], 
                                                 x['Survived_or_not_model_2'], 
                                                 x['Survived_or_not_model_3'],                                        
                                                 x['Survived_or_not_model_4'],                                                                               
                                                 x['Survived_or_not_model_5']), axis=1)
    elif num_files == 3:
        master_df['Survived_or_not_ensembled'] = master_df.apply(
            lambda x : get_most_frequent_entry_3(x['Survived_or_not_model_1'], 
                                                 x['Survived_or_not_model_2'], 
                                                 x['Survived_or_not_model_3']), axis=1)    
    master_df.sort_values(by=['PassengerId'], inplace=True)

In [100]:
def evaluate_correctness_percentage(master_df):
    # Check with out of sample of data (to be submitted to kaggle.)
    test_data = pd.read_csv("../input/test_data_processed_correct.csv")
    test_data['Predictions'] = master_df['Survived_or_not_ensembled']
    sucess_rate = (np.sum(test_data['Survived'] == test_data['Predictions'])/len(test_data))
    print("Success rate of model on test data is {0:0.3g}".format(sucess_rate))

In [101]:
def display_commonalities_stats(master_df, files):
    print("Number of passengers whose survivorship is to be predicted is {0}".format(len(master_df)))

    for i in range(len(files)):
        index = i + 1
        rel_csv_file = files[i]
        print("\nRel file is {0}".format(rel_csv_file))    
        single_model_prediction_col = 'Survived_or_not_model_' + str(index)
        num_common_entries = np.sum(master_df[single_model_prediction_col] == master_df['Survived_or_not_ensembled'])
        print("Number of entries with common prediction as that of the ensembled model is {0}".format(num_common_entries))    
    

In [102]:
def dump_predictions_to_csv(master_df, csv_file):
    predictions_to_kaggle = master_df[['PassengerId', 'Survived_or_not_ensembled']].copy()
    predictions_to_kaggle.rename(columns={'Survived_or_not_ensembled' : 'Survived'}, inplace=True)
    predictions_to_kaggle.to_csv(csv_file, index=False)    

In [103]:
def generate_ensembled_predictions_and_verify_results(files, generate_corr_heat_map=False, 
                                                      generate_csv=False, csv_file='temp.csv'):
    print(csv_file)
    master_df = populate_model_files_data(files)
    
    # Display correlation info amongst predictors as a matrix as well as  heatmap
    display_corr_info(master_df, generate_corr_heat_map)
    
    # The core routine for selecting the majority vote as the ensembled prediction.
    update_ensembled_cols(master_df)
    
    # How correct are the predictions obtained by ensembling ?
    evaluate_correctness_percentage(master_df)
    
    # How common are the ensembled predictions 
    display_commonalities_stats(master_df, files)
    
    if generate_csv:
        dump_predictions_to_csv(master_df, csv_file)


In [104]:
generate_ensembled_predictions_and_verify_results(files)

temp.csv


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3
Survived_or_not_model_1,1.0,0.245163,0.750166
Survived_or_not_model_2,0.245163,1.0,0.255624
Survived_or_not_model_3,0.750166,0.255624,1.0


Success rate of model on test data is 0.794
Number of passengers whose survivorship is to be predicted is 418

Rel file is ../learn_from_oscar/kaggle_out_xgboost_sex_pclass_embarked.csv
Number of entries with common prediction as that of the ensembled model is 402

Rel file is ../learn_from_oscar/kaggle_out_tickets_svc.csv
Number of entries with common prediction as that of the ensembled model is 316

Rel file is ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries with common prediction as that of the ensembled model is 385


### Generate model 1

In [105]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_my_attempt_detailed_fine_tuning.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_sex_pclass_embarked.ipynb
        '../learn_from_oscar/kaggle_out_xgboost_sex_pclass_embarked.csv',  
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_parch_sibsp_fare.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_tickets_svc.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv']


In [106]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_out_ensemble_model_1.csv')

kaggle_out_ensemble_model_1.csv


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3,Survived_or_not_model_4,Survived_or_not_model_5
Survived_or_not_model_1,1.0,0.822808,0.326308,0.183321,0.875787
Survived_or_not_model_2,0.822808,1.0,0.33898,0.245163,0.750166
Survived_or_not_model_3,0.326308,0.33898,1.0,0.462658,0.321721
Survived_or_not_model_4,0.183321,0.245163,0.462658,1.0,0.255624
Survived_or_not_model_5,0.875787,0.750166,0.321721,0.255624,1.0


Success rate of model on test data is 0.804
Number of passengers whose survivorship is to be predicted is 418

Rel file is ../learn_from_aashita/kaggle_out_xgboost_my_attempt_detailed_fine_tuning.csv
Number of entries with common prediction as that of the ensembled model is 387

Rel file is ../learn_from_oscar/kaggle_out_xgboost_sex_pclass_embarked.csv
Number of entries with common prediction as that of the ensembled model is 402

Rel file is ../learn_from_aashita/kaggle_out_xgboost_parch_sibsp_fare.csv
Number of entries with common prediction as that of the ensembled model is 319

Rel file is ../learn_from_oscar/kaggle_out_tickets_svc.csv
Number of entries with common prediction as that of the ensembled model is 304

Rel file is ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries with common prediction as that of the ensembled model is 385


### Generate model 2

In [107]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_my_attempt_detailed_fine_tuning.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_name_analysis     
        '../learn_from_oscar/kaggle_out_xgboost_parch_sibsp_embarked.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv']


In [108]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_out_ensemble_model_2.csv')

kaggle_out_ensemble_model_2.csv


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3
Survived_or_not_model_1,1.0,0.248443,0.875787
Survived_or_not_model_2,0.248443,1.0,0.292766
Survived_or_not_model_3,0.875787,0.292766,1.0


Success rate of model on test data is 0.794
Number of passengers whose survivorship is to be predicted is 418

Rel file is ../learn_from_aashita/kaggle_out_xgboost_my_attempt_detailed_fine_tuning.csv
Number of entries with common prediction as that of the ensembled model is 403

Rel file is ../learn_from_oscar/kaggle_out_xgboost_parch_sibsp_embarked.csv
Number of entries with common prediction as that of the ensembled model is 297

Rel file is ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries with common prediction as that of the ensembled model is 409


### Generate model 3

In [109]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_my_attempt_detailed_fine_tuning.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_tickets_svc.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv']


In [110]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_out_ensemble_model_3.csv')

kaggle_out_ensemble_model_3.csv


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3
Survived_or_not_model_1,1.0,0.183321,0.875787
Survived_or_not_model_2,0.183321,1.0,0.255624
Survived_or_not_model_3,0.875787,0.255624,1.0


Success rate of model on test data is 0.799
Number of passengers whose survivorship is to be predicted is 418

Rel file is ../learn_from_aashita/kaggle_out_xgboost_my_attempt_detailed_fine_tuning.csv
Number of entries with common prediction as that of the ensembled model is 401

Rel file is ../learn_from_oscar/kaggle_out_tickets_svc.csv
Number of entries with common prediction as that of the ensembled model is 290

Rel file is ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries with common prediction as that of the ensembled model is 411


### Generate model that is ensemble of ensembles.

In [111]:
files = ['kaggle_out_ensemble_model_1.csv',
         'kaggle_out_ensemble_model_2.csv',
         'kaggle_out_ensemble_model_3.csv']


In [112]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_ensemble_of_ensembles.csv')

kaggle_ensemble_of_ensembles.csv


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3
Survived_or_not_model_1,1.0,0.847311,0.865445
Survived_or_not_model_2,0.847311,1.0,0.937821
Survived_or_not_model_3,0.865445,0.937821,1.0


Success rate of model on test data is 0.809
Number of passengers whose survivorship is to be predicted is 418

Rel file is kaggle_out_ensemble_model_1.csv
Number of entries with common prediction as that of the ensembled model is 396

Rel file is kaggle_out_ensemble_model_2.csv
Number of entries with common prediction as that of the ensembled model is 410

Rel file is kaggle_out_ensemble_model_3.csv
Number of entries with common prediction as that of the ensembled model is 414
