In [1]:
import pandas as pd
import numpy as np
%autosave 60

Autosaving every 60 seconds


In [2]:
traindata = pd.read_csv('train.csv', index_col='PassengerId', header=0)
testdata = pd.read_csv('test.csv', index_col='PassengerId', header=0)
sample_submission = pd.read_csv('gender_submission.csv', index_col='PassengerId', header=0)
my_submission = sample_submission.copy()
my_submission['Survived'] = np.NaN

Import all data into Pandas DataFrames. To ensure that my submission complies with all form requirements, set submission to a copy of the sample submission with all outputs set to NaN (for now).

In [3]:
traindata.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Naive Bayes Model (very simplified proof of concept)
Implement a Naive Bayes Model to forecast survival:<br><br>
The purpose of this is to see if I can implement a Naive Bayes Model, not do a rigorous analysis. Therefore, I will only consider 3 factors that I know (from experience with this data set) have a high predictive value, and that are presented very cleanly in the data (no feature generation required).<br><br>


survival_chance_given_Pclass3_Female_under12 = <br>
\[(chance_Pclass3_given_survival <br>* chance_female_given_survival <br>* chance_under12_given_survival)
\* global_chance_of_survival\] /<br> \[chance_Pclass3 * chance_fem * chance_under12\]

The four cells below define these variables from df traindata.

In [22]:
survival = (traindata['Survived'] == 1)
global_chance_of_survival = survival.sum() / traindata.index.size
print('global_chance_of_survival: %s' % global_chance_of_survival)

global_chance_of_survival: 0.3838383838383838


In [28]:
Pclass3 = (traindata['Pclass'] == 3)
Pclass3_given_survival = (Pclass3 & survival).sum() / survival.sum()
print("Pclass3_given_survival: %s" % Pclass3_given_survival)
chance_Pclass3 = Pclass3.sum() / traindata.index.size
print("chance_Pclass3: %s" % chance_Pclass3)

Pclass3_given_survival: 0.347953216374269
chance_Pclass3: 0.5510662177328844


In [29]:
female = (traindata['Sex'] == 'female')
female_given_survival = (female & survival).sum() / survival.sum()
print("female_given_survival %s" % female_given_survival)
chance_female = female.sum() / traindata.index.size
print("chance_female %s" % chance_female)

female_given_survival 0.6812865497076024
chance_female 0.35241301907968575


In [30]:
under12 = (traindata['Age'] < 12.0)
under12_given_survival = (under12 & survival).sum() / survival.sum()
print("under12_given_survival: %s" % under12_given_survival)
chance_under12 = under12.sum() / traindata.index.size
print("chance_under12: %s" % chance_under12)

under12_given_survival: 0.11403508771929824
chance_under12: 0.07631874298540965


In [51]:
def NaiveBayesPredict(row):
    if row['Sex'] == 'female':
        N2 = female_given_survival 
    else: 
        N2 = 1-female_given_survival  
        
    if row['Pclass'] == 3:
        N1 = Pclass3_given_survival 
    else: 
        N1 = 1-Pclass3_given_survival  

    if row['Age'] < 12.0:
        N3 = under12_given_survival 
    else: 
        N3 = 1-under12_given_survival
    
    N4 = global_chance_of_survival
    
    if row['Pclass'] == 3:
        D1 = chance_Pclass3 
    else: 
        D1 = 1-chance_Pclass3
    
    if row['Sex'] == 'female':
        D2 = chance_female 
    else: 
        D2 = 1-chance_female
    
    if row['Age'] < 12.0:
        D3 = chance_under12 
    else: 
        D3 = 1-chance_under12
    
    return ((N1*N2*N3)*N4)/(D1*D2*D3)

# Model test
To test this model, we will generate a column "PredictedSurvival" in the original training data set.
How does our model preduct survival of the "training" set?
Output value of the cell is the prediction accuracy.

In [53]:
traindata['PredictedSurvival'] = traindata.apply(lambda row: NaiveBayesPredict(row),axis=1).round(0).astype(np.int64)
print(traindata[['Survived', 'PredictedSurvival']].head())
print('')
sum(traindata['Survived'] == traindata['PredictedSurvival'])/traindata.index.size

             Survived  PredictedSurvival
PassengerId                             
1                   0                  0
2                   1                  1
3                   1                  0
4                   1                  1
5                   0                  0



0.77665544332211

# Generate predictions

We will now use this model to generate our submission to Kaggle.

In [76]:
my_submission['Survived'] = testdata.apply(lambda row: NaiveBayesPredict(row),axis=1).round(0).astype(np.int64)
# the predictor generated several results of '2'. These are clipped to '1' below.
my_submission['Survived'] = my_submission['Survived'].apply(lambda row: 0 if row == 0 else 1)
my_submission.to_csv('Submission DG.csv')

0.215311004784689


# Model improvements
Improvements that are not implemented, that would improve predictive power of this model:
* Predict demise, not survival, as that sample size is larger (891 - 342)
* Proper EDA, to determine the most predictive factors in the data -- or at least quantify variance% captured with proposed factors.