## Introduction

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. In this notebook we apply the tools of machine learning to predict which passengers survived the tragedy.

In [1]:
#importing the required libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

  from numpy.core.umath_tests import inner1d


We are loading the csv files and converting them into dataframes using read_csv function

In [2]:
df_raw = pd.read_csv("titanic_data.csv")
df_test  = pd.read_csv('test.csv')

We are splitting the data into training and validation sets. We fit the model using training data and check how well it can generalize to new data using validation dataset. (stratify = df_raw['Survived'].values) ensures the proportion of survived people is same in both training and validation datasets. This is to ensure that apriori probability for each label is same in both classes.

In [3]:
df_trn,df_val,label_trn,label_val = train_test_split(df_raw,df_raw[['PassengerId','Survived']],
                                                stratify = df_raw['Survived'].values,test_size = 0.3,random_state = 123)

In [4]:
df_trn.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,Fare,Parch,Ticket,SibSp,Cabin,Embarked,Survived
199,200,2,"Yrois, Miss. Henriette (""Mrs Harbeck"")",female,24.0,13.0,0,248747,0,,S,0
468,469,3,"Scanlan, Mr. James",male,,7.725,0,36209,0,,Q,0
198,199,3,"Madigan, Miss. Margaret ""Maggie""",female,,7.75,0,370370,0,,Q,1
574,575,3,"Rush, Mr. Alfred George John",male,16.0,8.05,0,A/4. 20589,0,,S,0
776,777,3,"Tobin, Mr. Roger",male,,7.75,0,383121,0,F38,Q,0


Here we are extracting the columns with null values in each dataframes and their counts.

In [5]:
for df in [df_trn,df_val,df_test]:
    print(df.isnull().sum()[df.isnull().sum() != 0])

Age         122
Cabin       493
Embarked      1
dtype: int64
Age          55
Cabin       194
Embarked      1
dtype: int64
Age       86
Fare       1
Cabin    327
dtype: int64


Since Embarked columns has very few null values and also Embarked column in test dataset doesnt have any null values we can drop those rows which have null value in Embarked column using dropna method on dataframes. Also the missing values in fare are replaced with the median value of Fare of training set. Filling missing values with median values makes better sense when the distribution have outliers

In [6]:
for df in [df_trn,df_val,df_test]:
    df.dropna(subset = ['Embarked'],inplace  = True)
    df['Fare'] = df['Fare'].fillna(df_trn['Fare'].median())
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Here we are splitting the Name column and extracting the family name and Title.

In [7]:
for df in [df_trn,df_val,df_test]: 
    df.loc[:,'Fmly_Name'] = df.Name.str.rsplit(pat=',',expand = True).iloc[:,0]
    df.loc[:,'Title'] = df.Name.str.rsplit(pat=',',expand = True).iloc[:,1].\
    str.rsplit(pat = '.',expand  = True).iloc[:,0].str[1:]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [8]:
for df in [df_trn,df_val,df_test]: 
    print(df.Title.value_counts())

Mr              360
Miss            122
Mrs              92
Master           29
Dr                5
Rev               4
Ms                1
Col               1
Don               1
Jonkheer          1
the Countess      1
Sir               1
Lady              1
Mme               1
Major             1
Mlle              1
Name: Title, dtype: int64
Mr        157
Miss       59
Mrs        32
Master     11
Rev         2
Dr          2
Major       1
Capt        1
Mlle        1
Col         1
Name: Title, dtype: int64
Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Ms          1
Dr          1
Dona        1
Name: Title, dtype: int64


After observing the counts of different values, we are replacing the values which apppear rarely with appropriate values.

In [9]:
for df in [df_trn,df_val,df_test]:
    df.replace(to_replace = ['Rev','Major','Jonkheer','Dr','Sir','Don', 'Col','Capt'],value =  'Mr',inplace = True)
    df.replace(to_replace =['Ms',    'Lady', 'Mlle','Dona'],value = 'Miss',inplace = True)
    df.replace(to_replace =['Mme', 'the Countess'],value = 'Mrs',inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  method=method)


In [10]:
for df in [df_trn,df_val,df_test]: 
    print(df.Title.unique())

['Miss' 'Mr' 'Mrs' 'Master']
['Mr' 'Mrs' 'Master' 'Miss']
['Mr' 'Mrs' 'Miss' 'Master']


Age columns has large number of missing values. Here we are grouping the age column based on the title and for each group we are finding the mean value of each group and we are using these values to fill the missing values in the age column appropriately.

In [None]:
def impute_median(series):
    return series.fillna(series.median())
for df in [df_trn,df_val,df_test]:
    df.Age = df.groupby('Title')['Age'].transform(impute_median)
    

Here we are creating new feature called Family which is the sum of the siblings, spouse, parents and the children. Eventhough survival rate depends on the sex, but its effect is more predominant in first class and comparatively less in the third class.
This might be because of interaction effect between sex and class and hence we are creating a new feature sec_class inorder to capture this interaction. Also we are standardizing the Age feature

In [11]:
Age_mean = df_raw['Age'].mean()
Age_std = df_raw['Age'].std()

for df in [df_trn,df_val,df_test]: 
    df['Family'] = df['Parch'] + df['SibSp']
    df['Sex'] = df.Sex.map({'male':0,'female':1})
    df['Sex_class'] = df['Pclass']*df['Sex']
    df['Embarked_map'] = df['Embarked'].map({'S':1,'Q':2,'C':3})
    df['Title_map'] = df['Title'].map({'Mr':1,'Miss':2,'Mrs':3,'Master':4})
    df['Pclass_map'] = df['Pclass']
    df['Age_std'] = (df['Age'] - Age_mean)/Age_std
    

Since the classifiers work only with numerical values. Unordered categorical variables like Embarked and Title are passed onto get_dummies function to create a signal feature for each value in a column.

In [14]:
df_trn = pd.get_dummies(df_trn, columns=['Embarked','Title','Pclass'])
df_val = pd.get_dummies(df_val, columns=['Embarked','Title','Pclass'])
df_test = pd.get_dummies(df_test, columns=['Embarked','Title','Pclass'])

Here to the Random forest classifier we are sending a different combination of features. and observe how accuracy is varying 
for both training and validation datasets and after that select best features.

But this is process is done separately for two groups, for those who are travelling alone and those who are travelling with families.

In [20]:
clf = RandomForestClassifier(n_estimators=100,max_features=0.8)
features_all = ['Sex','Pclass_1','Embarked_C','Age_std','Pclass_2','Title_Master','Pclass_3','Title_Mrs','Embarked_Q','SibSp','Parch','Family',
'Title_Miss','Title_Mr','Embarked_S','Sex_class']
for i in range(len(features_all)):
    features = features_all[:i+1]
    clf.fit(df_trn[df_trn['Family'] == 0][features],df_trn[df_trn['Family'] == 0]['Survived'])
    pred_train = clf.predict(df_trn[df_trn['Family'] == 0][features])
    pred_val = clf.predict(df_val[df_val['Family'] == 0][features])

    from sklearn.metrics import accuracy_score
    print(features)
    print("Train_score = " + str(accuracy_score(pred_train,df_trn[df_trn['Family'] == 0]['Survived'])) + ' , '\
          +"Crossvalidation_score = " + str(accuracy_score(pred_val,df_val[df_val['Family'] == 0]['Survived'])))

['Sex']
Train_score = 0.8306878306878307 , Crossvalidation_score = 0.8280254777070064
['Sex', 'Pclass_1']
Train_score = 0.8306878306878307 , Crossvalidation_score = 0.8280254777070064
['Sex', 'Pclass_1', 'Embarked_C']
Train_score = 0.8306878306878307 , Crossvalidation_score = 0.8280254777070064
['Sex', 'Pclass_1', 'Embarked_C', 'Age_std']
Train_score = 0.8862433862433863 , Crossvalidation_score = 0.8152866242038217
['Sex', 'Pclass_1', 'Embarked_C', 'Age_std', 'Pclass_2']
Train_score = 0.9047619047619048 , Crossvalidation_score = 0.8089171974522293
['Sex', 'Pclass_1', 'Embarked_C', 'Age_std', 'Pclass_2', 'Title_Master']
Train_score = 0.9047619047619048 , Crossvalidation_score = 0.8089171974522293
['Sex', 'Pclass_1', 'Embarked_C', 'Age_std', 'Pclass_2', 'Title_Master', 'Pclass_3']
Train_score = 0.9047619047619048 , Crossvalidation_score = 0.8089171974522293
['Sex', 'Pclass_1', 'Embarked_C', 'Age_std', 'Pclass_2', 'Title_Master', 'Pclass_3', 'Title_Mrs']
Train_score = 0.9074074074074074 ,

In [16]:
best_single_features = ['Sex','Pclass_1','Embarked_C','Age']


In [17]:
features_all = ['Age','Sex','Fare','Pclass_map','Title_map','Embarked_map']
#features_all = ['Sex','Age','Pclass_1','Pclass_2','Pclass_3','SibSp','Parch','Family', 
#'Title_Master','Title_Miss','Title_Mr','Title_Mrs','Embarked_C','Embarked_Q','Embarked_S']
for i in range(len(features_all)):
    features = features_all[:i+1]
    clf.fit(df_trn[df_trn['Family'] > 0][features],df_trn[df_trn['Family'] > 0]['Survived'])
    pred_train = clf.predict(df_trn[df_trn['Family'] > 0][features])
    pred_val = clf.predict(df_val[df_val['Family'] > 0][features])


    from sklearn.metrics import accuracy_score
    print(features)
    print("Train_score = " + str(accuracy_score(pred_train,df_trn[df_trn['Family'] > 0]['Survived'])) + ' , '\
          +"Crossvalidation_score = " + str(accuracy_score(pred_val,df_val[df_val['Family'] > 0]['Survived'])))

['Age']
Train_score = 0.7377049180327869 , Crossvalidation_score = 0.5454545454545454
['Age', 'Sex']
Train_score = 0.8565573770491803 , Crossvalidation_score = 0.6545454545454545
['Age', 'Sex', 'Fare']
Train_score = 1.0 , Crossvalidation_score = 0.6909090909090909
['Age', 'Sex', 'Fare', 'Pclass_map']
Train_score = 1.0 , Crossvalidation_score = 0.7909090909090909
['Age', 'Sex', 'Fare', 'Pclass_map', 'Title_map']
Train_score = 1.0 , Crossvalidation_score = 0.8272727272727273
['Age', 'Sex', 'Fare', 'Pclass_map', 'Title_map', 'Embarked_map']
Train_score = 1.0 , Crossvalidation_score = 0.8363636363636363


In [18]:
best_group_features = ['Age','Sex','Fare','Pclass_map','Title_map','Embarked_map']

Here we are slecting the best features and predict the outcomes for those with and without families separately and then combined both the datafranes using concat function and finally convert the dataframe into a csv file using to_csv method on the dataframe.

In [19]:
df_alone = df_test[df_test['Family'] == 0]
df_grp = df_test[df_test['Family'] != 0]

clf = RandomForestClassifier(n_estimators=100,max_features=0.8)
clf.fit(df_trn[df_trn['Family'] == 0][best_single_features],df_trn[df_trn['Family'] == 0]['Survived'])
df_alone['Survived'] = clf.predict(df_alone[best_single_features])

clf = RandomForestClassifier(n_estimators=100,max_features=0.8)
clf.fit(df_trn[df_trn['Family'] > 0][best_group_features],df_trn[df_trn['Family'] > 0]['Survived'])
df_grp['Survived'] = clf.predict(df_grp[best_group_features])

df_test = pd.concat([df_alone,df_grp]).sort_index()
#df_test[['PassengerId','Survived']].to_csv('Separate_best_Predn.csv',index = False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
