
# Basic Overview
The objective is to see if we can squeeze out some more predictive power from logistic regression, random forest and
xgboot models via ensembling.

Comments/criticisms/appreciations are greatly accepted and appreciated. Do not be shy and send me an email at babinu@gmail.com !

Source of data : https://www.kaggle.com/c/titanic/data

In [4]:
import pandas as pd
import numpy as np
import os
import re

In [5]:
import seaborn as sns
import matplotlib.pyplot as plt

Comments :

The ensemble of the following 3 models 

Model #1. Ensemble of 5 models :

(a) Final fine tuned xgboost with Sex,Pclass,Age,Fare

(b) Oscar's idea : xgboost model with Sex,Age,Pclass,Embarked

(c) Xgboost model on Parch, SibSp, Fare --> Generated so as to get a reasonable model with low correlation with other models.

(d) SVC on tickets field.

(e) SVC on names field.


Model #2. Ensemble of the following 3 models :


(a) Final fine tuned xgboost with Sex,Pclass,Age,Fare


(b) Xgboost model on Parch, SibSp, Embarked --> Generated so as to get a reasonable model with low correlation with other models.


(c) SVC on names field.



Model #3. Ensemble of the following 3 models :

(a) Final fine tuned xgboost with Sex,Pclass,Age,Fare


(b) SVC on tickets field.


(c) SVC on names field.


Other models worth taking a look :


Model #4. Ensemble of the following 3 models :

(a) Oscar's idea : xgboost model with Sex,Age,Pclass,Embarked

(b) SVC on tickets field.

(c) SVC on names field.



Model #5. Ensemble of the following 3 models :

(a) Final fine tuned xgboost with Sex,Pclass,Age,Fare

(b) Oscar's idea : xgboost model with Sex,Age,Pclass,Embarked

(c) SVC on names field.


In [6]:
def populate_model_files_data(files):
    count = 0
    master_df = pd.DataFrame()
    count_to_file_name = dict()
    for  csv_file in files:
        count += 1
        data_df = pd.read_csv(csv_file)
        column_name = 'Survived_or_not_model_' + str(count)

        survived_list = data_df['Survived'].values
        master_df[column_name] = survived_list
        master_df['PassengerId'] = data_df['PassengerId'].values
        count_to_file_name[count] = csv_file
        prev_survived_list = survived_list
    return master_df

In [28]:
def display_corr_info(master_df, generate_corr_heat_map):
    relevant_cols = [col for col in master_df.columns if col not in ['PassengerId']]

    print("                              CORRELATION MATRIX OF MODEL OUTPUTS")
    display(master_df[relevant_cols].corr())
    if generate_corr_heat_map:
        fig, ax = plt.subplots(1, 1, figsize=(16, 9))
        sns.heatmap(master_df[relevant_cols].corr(), ax=ax)    

In [8]:
def get_most_frequent_entry_3(a, b, c):
    sum_vals = a + b + c
    
    if sum_vals <= 1:
        frequent_val = 0
    else:
        frequent_val = 1
    return frequent_val

In [9]:
def get_most_frequent_entry_5(a, b, c, d, e):
    sum_vals = a + b + c + d + e
    
    if sum_vals <= 2:
        frequent_val = 0
    else:
        frequent_val = 1
    return frequent_val

In [10]:
def get_most_frequent_entry_7(a, b, c, d, e, f, g):
    sum_vals = a + b + c + d  + e + f + g
    
    if sum_vals <= 3:
        frequent_val = 0
    else:
        frequent_val = 1
    return frequent_val

In [11]:
def update_ensembled_cols(master_df):
    
    # Decrease by 1 to account for PassengerId column.
    num_files = len(master_df.columns) - 1
    if num_files == 7:
        master_df['Survived_or_not_ensembled'] = master_df.apply(
            lambda x : get_most_frequent_entry_7(x['Survived_or_not_model_1'], 
                                                 x['Survived_or_not_model_2'], 
                                                 x['Survived_or_not_model_3'],                                        
                                                 x['Survived_or_not_model_4'],                                                                               
                                                 x['Survived_or_not_model_5'],
                                                 x['Survived_or_not_model_6'],
                                                 x['Survived_or_not_model_7']), axis=1)

    elif num_files == 5:

        master_df['Survived_or_not_ensembled'] = master_df.apply(
            lambda x : get_most_frequent_entry_5(x['Survived_or_not_model_1'], 
                                                 x['Survived_or_not_model_2'], 
                                                 x['Survived_or_not_model_3'],                                        
                                                 x['Survived_or_not_model_4'],                                                                               
                                                 x['Survived_or_not_model_5']), axis=1)
    elif num_files == 3:
        master_df['Survived_or_not_ensembled'] = master_df.apply(
            lambda x : get_most_frequent_entry_3(x['Survived_or_not_model_1'], 
                                                 x['Survived_or_not_model_2'], 
                                                 x['Survived_or_not_model_3']), axis=1)    
    master_df.sort_values(by=['PassengerId'], inplace=True)

In [12]:
def evaluate_correctness_percentage(master_df):
    # Check with out of sample of data (to be submitted to kaggle.)
    test_data = pd.read_csv("../input/test_data_processed_correct.csv")
    test_data['Predictions'] = master_df['Survived_or_not_ensembled']
    sucess_rate = (np.sum(test_data['Survived'] == test_data['Predictions'])/len(test_data))
    print("Success rate of model on test data is {0:0.3g}".format(sucess_rate))

In [83]:
def display_commonalities_stats(master_df, files):
    
    print("                              COMMONALITY STATS\n")
    print("Number of entries to be predicted         : {0}".format(len(master_df)))

    for i in range(len(files)):
        index = i + 1
        rel_csv_file = files[i]
        print("\nRelevant model file                       : {0}".format(rel_csv_file))    
        single_model_prediction_col = 'Survived_or_not_model_' + str(index)
        num_common_entries = np.sum(master_df[single_model_prediction_col] == master_df['Survived_or_not_ensembled'])
        print("Number of entries common with final model : {0}".format(num_common_entries))    
    

In [14]:
def dump_predictions_to_csv(master_df, csv_file):
    predictions_to_kaggle = master_df[['PassengerId', 'Survived_or_not_ensembled']].copy()
    predictions_to_kaggle.rename(columns={'Survived_or_not_ensembled' : 'Survived'}, inplace=True)
    predictions_to_kaggle.to_csv(csv_file, index=False)    

In [18]:
def generate_ensembled_predictions_and_verify_results(files, generate_corr_heat_map=False, 
                                                      generate_csv=False, csv_file='temp.csv'):
    master_df = populate_model_files_data(files)
    
    # Display correlation info amongst predictors as a matrix as well as  heatmap
    display_corr_info(master_df, generate_corr_heat_map)
    
    # The core routine for selecting the majority vote as the ensembled prediction.
    update_ensembled_cols(master_df)
    
    # How correct are the predictions obtained by ensembling ?
    #evaluate_correctness_percentage(master_df)
    
    # How common are the ensembled predictions 
    display_commonalities_stats(master_df, files)
    
    if generate_csv:
        dump_predictions_to_csv(master_df, csv_file)


### Generate model 1

Ensemble of several models. The source file for each of the csv files is listed as a comment above the same so as to provide maximum clarity

In [61]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_sex_pclass_embarked.ipynb
        '../learn_from_oscar/kaggle_out_xgboost_sex_pclass_embarked.csv',  
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_parch_sibsp_fare.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_tickets_svc.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv']


In [84]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_out_ensemble_model_1.csv')

                              CORRELATION MATRIX OF MODEL OUTPUTS


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3,Survived_or_not_model_4,Survived_or_not_model_5
Survived_or_not_model_1,1.0,0.822808,0.326308,0.183321,0.875787
Survived_or_not_model_2,0.822808,1.0,0.33898,0.245163,0.750166
Survived_or_not_model_3,0.326308,0.33898,1.0,0.462658,0.321721
Survived_or_not_model_4,0.183321,0.245163,0.462658,1.0,0.255624
Survived_or_not_model_5,0.875787,0.750166,0.321721,0.255624,1.0


                              COMMONALITY STATS

Number of entries to be predicted         : 418

Relevant model file                       : ../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv
Number of entries common with final model : 387

Relevant model file                       : ../learn_from_oscar/kaggle_out_xgboost_sex_pclass_embarked.csv
Number of entries common with final model : 402

Relevant model file                       : ../learn_from_aashita/kaggle_out_xgboost_parch_sibsp_fare.csv
Number of entries common with final model : 319

Relevant model file                       : ../learn_from_oscar/kaggle_out_tickets_svc.csv
Number of entries common with final model : 304

Relevant model file                       : ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries common with final model : 385


### Generate model 2

In [85]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_name_analysis     
        '../learn_from_oscar/kaggle_out_xgboost_parch_sibsp_embarked.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv']


In [86]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_out_ensemble_model_2.csv')

                              CORRELATION MATRIX OF MODEL OUTPUTS


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3
Survived_or_not_model_1,1.0,0.248443,0.875787
Survived_or_not_model_2,0.248443,1.0,0.292766
Survived_or_not_model_3,0.875787,0.292766,1.0


                              COMMONALITY STATS

Number of entries to be predicted         : 418

Relevant model file                       : ../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv
Number of entries common with final model : 403

Relevant model file                       : ../learn_from_oscar/kaggle_out_xgboost_parch_sibsp_embarked.csv
Number of entries common with final model : 297

Relevant model file                       : ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries common with final model : 409


### Generate model 3

In [87]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_tickets_svc.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv']


In [88]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_out_ensemble_model_3.csv')

                              CORRELATION MATRIX OF MODEL OUTPUTS


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3
Survived_or_not_model_1,1.0,0.183321,0.875787
Survived_or_not_model_2,0.183321,1.0,0.255624
Survived_or_not_model_3,0.875787,0.255624,1.0


                              COMMONALITY STATS

Number of entries to be predicted         : 418

Relevant model file                       : ../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv
Number of entries common with final model : 401

Relevant model file                       : ../learn_from_oscar/kaggle_out_tickets_svc.csv
Number of entries common with final model : 290

Relevant model file                       : ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries common with final model : 411


### Generate model that is ensemble of ensembles.

In [89]:
files = ['kaggle_out_ensemble_model_1.csv',
         'kaggle_out_ensemble_model_2.csv',
         'kaggle_out_ensemble_model_3.csv']


In [90]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_ensemble_of_ensembles.csv')

                              CORRELATION MATRIX OF MODEL OUTPUTS


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3
Survived_or_not_model_1,1.0,0.847311,0.865445
Survived_or_not_model_2,0.847311,1.0,0.937821
Survived_or_not_model_3,0.865445,0.937821,1.0


                              COMMONALITY STATS

Number of entries to be predicted         : 418

Relevant model file                       : kaggle_out_ensemble_model_1.csv
Number of entries common with final model : 396

Relevant model file                       : kaggle_out_ensemble_model_2.csv
Number of entries common with final model : 410

Relevant model file                       : kaggle_out_ensemble_model_3.csv
Number of entries common with final model : 414


Comment : The purpose of doing this was just to record the earlier way in which I had scored 82.29%. 

Honestly, it looks like this was obtained more by luck , rather than by anything else. I had just thrown in a lot of stuff, until something worked.

Let us try more 'scientific' approaches below :


Let us see if we can have an ensemble of 7 models (including our earlier logistic regression and randomForest models)

### Generate ensembler with randomForest and Logistic regression models.

In [91]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_sex_pclass_embarked.ipynb
        '../learn_from_oscar/kaggle_out_xgboost_sex_pclass_embarked.csv',  
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_parch_sibsp_fare.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_tickets_svc.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv',
        # Obtained by running ../first_attempts/random_forest_models.ipynb     
        '../first_attempts/kaggle_out_random_forest.csv',
        # Obtained by running ../first_attempts/logistic_regression_models.ipynb
        '../first_attempts/kaggle_out_logistic_regression.csv']


In [92]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=False, 
                                                  csv_file='kaggle_ensemble_of_ensembles.csv')

                              CORRELATION MATRIX OF MODEL OUTPUTS


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3,Survived_or_not_model_4,Survived_or_not_model_5,Survived_or_not_model_6,Survived_or_not_model_7
Survived_or_not_model_1,1.0,0.822808,0.326308,0.183321,0.875787,0.730146,0.927266
Survived_or_not_model_2,0.822808,1.0,0.33898,0.245163,0.750166,0.728044,0.780749
Survived_or_not_model_3,0.326308,0.33898,1.0,0.462658,0.321721,0.339859,0.307886
Survived_or_not_model_4,0.183321,0.245163,0.462658,1.0,0.255624,0.222948,0.165471
Survived_or_not_model_5,0.875787,0.750166,0.321721,0.255624,1.0,0.680408,0.834635
Survived_or_not_model_6,0.730146,0.728044,0.339859,0.222948,0.680408,1.0,0.710525
Survived_or_not_model_7,0.927266,0.780749,0.307886,0.165471,0.834635,0.710525,1.0


                              COMMONALITY STATS

Number of entries to be predicted         : 418

Relevant model file                       : ../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv
Number of entries common with final model : 400

Relevant model file                       : ../learn_from_oscar/kaggle_out_xgboost_sex_pclass_embarked.csv
Number of entries common with final model : 397

Relevant model file                       : ../learn_from_aashita/kaggle_out_xgboost_parch_sibsp_fare.csv
Number of entries common with final model : 306

Relevant model file                       : ../learn_from_oscar/kaggle_out_tickets_svc.csv
Number of entries common with final model : 291

Relevant model file                       : ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries common with final model : 390

Relevant model file                       : ../first_attempts/kaggle_out_random_forest.csv
Number of entries common with final model : 381

Relevant model 

Comment : The problem here is that we have 7 models, only 3 of them are known to be really good with out of sample data. Since the majority is not known to be good, we do not expect an imporvement here !

### Yet another ensembler on xgboost models using Cabin column

In [93]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_sex_pclass_embarked.ipynb
        '../learn_from_oscar/kaggle_out_xgboost_cabin_null.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_sex_pclass_embarked.ipynb
        '../learn_from_oscar/kaggle_out_xgboost_parch_sibsp_embarked.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_tickets_svc.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv']
 

In [94]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_diff_factors_ensemble.csv')

                              CORRELATION MATRIX OF MODEL OUTPUTS


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3,Survived_or_not_model_4,Survived_or_not_model_5
Survived_or_not_model_1,1.0,0.152047,0.248443,0.183321,0.875787
Survived_or_not_model_2,0.152047,1.0,0.215749,0.29343,0.140695
Survived_or_not_model_3,0.248443,0.215749,1.0,0.371325,0.292766
Survived_or_not_model_4,0.183321,0.29343,0.371325,1.0,0.255624
Survived_or_not_model_5,0.875787,0.140695,0.292766,0.255624,1.0


                              COMMONALITY STATS

Number of entries to be predicted         : 418

Relevant model file                       : ../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv
Number of entries common with final model : 338

Relevant model file                       : ../learn_from_oscar/kaggle_out_xgboost_cabin_null.csv
Number of entries common with final model : 340

Relevant model file                       : ../learn_from_oscar/kaggle_out_xgboost_parch_sibsp_embarked.csv
Number of entries common with final model : 354

Relevant model file                       : ../learn_from_oscar/kaggle_out_tickets_svc.csv
Number of entries common with final model : 353

Relevant model file                       : ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries common with final model : 342


Comments : We do not have a great results as the poor models look to dominate here. Let us try another version below.

### Another version, trying to make sure that we 3 strong models.

In [95]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_sex_pclass_embarked.ipynb
        '../learn_from_oscar/kaggle_out_xgboost_sex_pclass_embarked.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_sex_pclass_embarked.ipynb
        '../learn_from_oscar/kaggle_out_xgboost_cabin_null.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_tickets_svc.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv']
 

In [96]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_diff_factors_ensemble_v2.csv')

                              CORRELATION MATRIX OF MODEL OUTPUTS


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3,Survived_or_not_model_4,Survived_or_not_model_5
Survived_or_not_model_1,1.0,0.822808,0.152047,0.183321,0.875787
Survived_or_not_model_2,0.822808,1.0,0.249679,0.245163,0.750166
Survived_or_not_model_3,0.152047,0.249679,1.0,0.29343,0.140695
Survived_or_not_model_4,0.183321,0.245163,0.29343,1.0,0.255624
Survived_or_not_model_5,0.875787,0.750166,0.140695,0.255624,1.0


                              COMMONALITY STATS

Number of entries to be predicted         : 418

Relevant model file                       : ../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv
Number of entries common with final model : 382

Relevant model file                       : ../learn_from_oscar/kaggle_out_xgboost_sex_pclass_embarked.csv
Number of entries common with final model : 409

Relevant model file                       : ../learn_from_oscar/kaggle_out_xgboost_cabin_null.csv
Number of entries common with final model : 304

Relevant model file                       : ../learn_from_oscar/kaggle_out_tickets_svc.csv
Number of entries common with final model : 309

Relevant model file                       : ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries common with final model : 378


Comments : We have a similar issue here, where we have 5 models, of which only 2 are known to be very good. Since, ensembling here is just a majority vote, we are giving free reins for bad models to dominate here.

Let us get back to the drawing board with our first ensembler here.

In [98]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_my_attempt_detailed_fine_tuning.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_tickets_svc.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv']
 

In [99]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_xgboost_best_tickets_name.csv')

                              CORRELATION MATRIX OF MODEL OUTPUTS


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3
Survived_or_not_model_1,1.0,0.183321,0.875787
Survived_or_not_model_2,0.183321,1.0,0.255624
Survived_or_not_model_3,0.875787,0.255624,1.0


                              COMMONALITY STATS

Number of entries to be predicted         : 418

Relevant model file                       : ../learn_from_aashita/kaggle_out_xgboost_my_attempt_detailed_fine_tuning.csv
Number of entries common with final model : 401

Relevant model file                       : ../learn_from_oscar/kaggle_out_tickets_svc.csv
Number of entries common with final model : 290

Relevant model file                       : ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries common with final model : 411


Comments : In this case, we are essentially using the tickets model as a yardstick to decide between (Sex, Age,Pclass,Fare) model and Names model. Though the tickets model is not very accurate in itself, the low correlation with either of the other models, makes it a reasonably good model for ensembling.

Let us try to get a better indicator, that is a model, which preserves the correlation metric, but tries to be more accurate.

We try to do that by creating a ensemble of the following :
1. Model on tickets.
2. Model on cabin
3. Model on Parch,SibSp, Embarked.

Since, we are repeating any of the fields in the current models, we hope to get a more refined low correlated model that can be used in an ensembler.

NOTE : One may question the usage of ensembling in itself to come up with a low correlated model and may suggest the creation of an xgboost model with tickets, cabin, Parch, SibSp and Embarked as it's predictors. That is definitely a suggestion worth exploring, but I am not quite sure as to how to combine SVC on tickets field along with other predictors to an xgboost model.

### Getting a refined model, which is to be used for ensembling.


In [100]:
files = [
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_sex_pclass_embarked.ipynb
        '../learn_from_oscar/kaggle_out_xgboost_cabin_null.csv',    
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_tickets_svc.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea1_sex_pclass_embarked.ipynb
        '../learn_from_oscar/kaggle_out_xgboost_parch_sibsp_embarked.csv']
 

In [101]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_ticket_cabin_parch_sibsp_embarked.csv')

                              CORRELATION MATRIX OF MODEL OUTPUTS


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3
Survived_or_not_model_1,1.0,0.29343,0.215749
Survived_or_not_model_2,0.29343,1.0,0.371325
Survived_or_not_model_3,0.215749,0.371325,1.0


                              COMMONALITY STATS

Number of entries to be predicted         : 418

Relevant model file                       : ../learn_from_oscar/kaggle_out_xgboost_cabin_null.csv
Number of entries common with final model : 356

Relevant model file                       : ../learn_from_oscar/kaggle_out_tickets_svc.csv
Number of entries common with final model : 379

Relevant model file                       : ../learn_from_oscar/kaggle_out_xgboost_parch_sibsp_embarked.csv
Number of entries common with final model : 362


Comments : As per kaggle, we look to be getting a better indicator here (this is rather expected because of the decreased correlation amongst the ensemble components). Let us see how correlated this is, when compared with our other good models.

In [102]:
files = [
        # Obtained by running ../learn_from_aashita/xgboost_my_attempt_detailed_fine_tuning.ipynb
        '../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv',    
        'kaggle_ticket_cabin_parch_sibsp_embarked.csv',  
        # Obtained by running ../learn_from_oscar/xgboost_oscar_idea2_name_analysis     
        '../learn_from_oscar/kaggle_out_names_svc.csv']
 

In [103]:
generate_ensembled_predictions_and_verify_results(files, 
                                                  generate_corr_heat_map=False, 
                                                  generate_csv=True, 
                                                  csv_file='kaggle_sex_pclass_age_fare_names_refined_ensembler.csv')

                              CORRELATION MATRIX OF MODEL OUTPUTS


Unnamed: 0,Survived_or_not_model_1,Survived_or_not_model_2,Survived_or_not_model_3
Survived_or_not_model_1,1.0,0.266983,0.875787
Survived_or_not_model_2,0.266983,1.0,0.30597
Survived_or_not_model_3,0.875787,0.30597,1.0


                              COMMONALITY STATS

Number of entries to be predicted         : 418

Relevant model file                       : ../learn_from_aashita/kaggle_out_xgboost_sex_pclass_age_fare.csv
Number of entries common with final model : 404

Relevant model file                       : kaggle_ticket_cabin_parch_sibsp_embarked.csv
Number of entries common with final model : 302

Relevant model file                       : ../learn_from_oscar/kaggle_out_names_svc.csv
Number of entries common with final model : 408


Comment : On first glance, we look to be getting some benefit, though we cannot say for sure if this gives us a better result than the previous one.