# Titanic Survival Likelihood with Machine Learning

In this tutorial I will explain how to estimate if a passenger on the Titanic will survive using a machine learning algorithm, XGBClassifier. The data for this tutorial is sourced from kaggle.com. Tutorial adapted from [here](https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy/notebook).

### Importing Libraries

In [1]:
import pandas as pd

Pandas is a library that provides easy-to-use data structeres and data analysis tools.

In [2]:
from sklearn import model_selection

Sklearn is a machine learning library. Model selection helps choose what data model to use.

In [3]:
from sklearn.preprocessing import LabelEncoder

Preprocessing transforms data so that it is easier to read for a computer.

In [4]:
from xgboost import XGBClassifier

This is the machine learning algorithm we will use. We are using classification instead of regression because we are predicting group membership (alive or dead). Regression predicts a quantity i.e. a house price.

### Importing the Data

In [5]:
raw_train = pd.read_csv('train.csv')
raw_test  = pd.read_csv('test.csv')

In [6]:
type(raw_train)

pandas.core.frame.DataFrame

Variables 'raw_train' and 'raw_test' are both DataFrames. A DataFrame is a variable type from the pandas package. It is essentially a spreadsheet that plays nicely with python.

In [18]:
raw_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Now, the meaning of the column labels.<br>
'PassengerId': ID number of the passenger<br>
'Survived': Boolean. 0 is False, 1 True<br>
'Pclass': Ticket class. 1 is highest<br>
'Name': Name<br>
'Sex': Sex<br>
'Age': Age in years<br>
'SibSp': Number of siblings/spouses aboard the Titanic<br>
'Parch': Number of parents/children aboard the Titanic<br>
'Ticket': Ticket number<br>
'fare': Passenger fare in dollars<br>
'Cabin': Cabin number<br>
'Embarked': Port of embarkment. C = Cherbourg, Q = Queenstown, S = Southampton<br>

In [8]:
train = raw_train.copy(deep=True)
test  = raw_test.copy(deep=True)

'.copy()' is a function from pandas. 'deep=True' means that we are copying the data. Without this, python would only copy the reference to the DataFrame. This means that if we changed anything in 'train', it would change on 'raw_train' and vice versa because they are both pointing to the same data.

In [9]:
clean_data = [train, test]

This line makes it so we can easily change both train and test in one line of code rather than a line for both.

### Cleaning the Data

When you start working with a set of data you must clean it. This means you remove any unnecessary data. You also must replace data that is missing with a reasonable estimate.

In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In our training data we have 891 entries. We have a majority of the data for each column, save 'Cabin'. This means that we will be able to make an educated guess and fill in the missing data for most points. We will have to drop the 'Cabin' column because there is not enough data to make educated guesses to fill in the rest.

In [11]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


Again, we have most of the data, save 'Cabin'.

In [12]:
for dataset in clean_data:
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)

Now we fill in our missing data. '.fillna()' is a method in pandas that fills empty cells in a DataFrame. The first argument is what value to fill cell with. We are using either 'median()' or 'mode()'. These, again, are pandas methods. They get the median or mode, respectively, from whichever column they are called on. The second argument, 'inplace', means that we are performing the operations and saving them into the DataFrame that we are calling 'fillna()' on.

In [13]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [14]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           418 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


Now we will drop the unnecessary data. As stated earlier, there are too few data points for 'Cabin' for it to be of any use, so we will drop it. 'PassengerId' and 'Ticket' do not have any effect on the likelihood of survival, as they are essentially randomly assigned.

In [15]:
drop_columns = ['PassengerId', 'Cabin', 'Ticket']
train.drop(drop_columns, axis=1, inplace=True)
test.drop(drop_columns, axis=1, inplace=True)

Now both 'train' and 'test' should have no null objects and be missing the columns we dropped.

In [16]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    891 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 62.7+ KB


In [19]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
Pclass      418 non-null int64
Name        418 non-null object
Sex         418 non-null object
Age         418 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        418 non-null float64
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(3)
memory usage: 26.2+ KB


### Adding in Data Features

There are times when your data set may contain features that have little usage in their original state, but still contain useful data. For example, as it stands now, the 'Name' is of little use. However, one could guess that the title of each individual (Mr., Miss., etc.) could have some sort of correlation with survival rate. Adding this derivative data as an extra feature gives the algorithm more data to work off of, and can increase its accuracy.

In [22]:
for dataset in clean_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
    dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)

This is quite a lot in a few lines, so I will go line by line.<br>
Line 1: The for loop is going through 'clean_data' so we apply these changes to both 'train' and 'test'.
'dataset['FamilySize']' creates a new column that will contain the size of the family, so siblings+parents and children.<br><br>
Line 2: 'dataset['Title']' creates the 'Title' column. We need to glean the title from the name column. Let's look at 'Braund, Mr. Owen Harris' as an example. The first split, at ", " will return ['Braund', 'Mr. Owen Harris']. It is returned as a DataFrame because we use 'expand=True'. With '[1]' we access the second element of the DataFrame, then split again at ".", returning ['Mr', ' Owen Harris']. Finally, we access 'Mr' with '[0]' and add it into the 'Title' column.<br><br>
Line 3: We are creating bins for the 'Fare' column data. This means that we are generalizing the data into groups. We do this because it is unlikely that a fare of \$5.55 will effect the outcome much differently than one of \$5.60. To create these bins we are using 'qcut()', a pandas method. 'qcut()' will seperate the data given in the first argument into the amount of bins given in the second argument. Each bin will have exactly the same amount of records.<br><br>
Line 4: Similar to Line 3 we are splitting the 'Age' column into bins. This time, though, we are using 'cut()' (again, a pandas method). This will choose the bins to be evenly spaced based off of the values. This means that there will not be a congruent number of records in each bin.

In [23]:
train['Title'].value_counts()

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Col               2
Major             2
Sir               1
Lady              1
Ms                1
Don               1
the Countess      1
Jonkheer          1
Mme               1
Capt              1
Name: Title, dtype: int64

Here are the titles we got, along with their counts. We are going to replace any titles with a count of under 10 with 'Misc' as they are statistically insignificant.

In [24]:
train['Title'].value_counts() < 10

Mr              False
Miss            False
Mrs             False
Master          False
Dr               True
Rev              True
Mlle             True
Col              True
Major            True
Sir              True
Lady             True
Ms               True
Don              True
the Countess     True
Jonkheer         True
Mme              True
Capt             True
Name: Title, dtype: bool

In [26]:
title_counts = train['Title'].value_counts() < 10
train['Title'] = train['Title'].apply(lambda x: 'Misc' if title_counts.loc[x] == True else x)

In [27]:
train['Title'].value_counts()

Mr        517
Miss      182
Mrs       125
Master     40
Misc       27
Name: Title, dtype: int64

### Converting Data

Now we will convert our data into something that is a bit more readable for the machine.

In [28]:
label = LabelEncoder()

This initializes scikit-learn LabelEncoder.

In [29]:
for dataset in clean_data:    
    dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
    dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
    dataset['Title_Code'] = label.fit_transform(dataset['Title'])
    dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
    dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])

Now all datasets are easily readable for the algorithm.

In [30]:
train.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,FamilySize,Title,FareBin,AgeBin,Sex_Code,Embarked_Code,Title_Code,AgeBin_Code,FareBin_Code
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,2,Mr,"(-0.001, 7.91]","(16.0, 32.0]",1,2,3,1,0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,2,Mrs,"(31.0, 512.329]","(32.0, 48.0]",0,0,4,2,3
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,1,Miss,"(7.91, 14.454]","(16.0, 32.0]",0,2,2,1,1
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,2,Mrs,"(31.0, 512.329]","(32.0, 48.0]",0,2,4,2,3
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,1,Mr,"(7.91, 14.454]","(32.0, 48.0]",1,2,3,2,1


### Survey Data

Now that we have our data all cleaned up and converted we can finally see how each feature correlates with survival rate. 

In [31]:
Target = ['Survived']

Target is what the output of the algorithm should be.

In [32]:
train_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize']

In [33]:
train_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']

These are all of the features our algorithm will use when calculating.

Now we can print the correlation between our features and our target.

In [34]:
for x in train_x:
    if train[x].dtype != 'float64' :
        print('Survival Correlation by:', x)
        print(train[[x, Target[0]]].groupby(x, as_index=False).mean())
        print('-'*10, '\n')

Survival Correlation by: Sex
      Sex  Survived
0  female  0.742038
1    male  0.188908
---------- 

Survival Correlation by: Pclass
   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363
---------- 

Survival Correlation by: Embarked
  Embarked  Survived
0        C  0.553571
1        Q  0.389610
2        S  0.339009
---------- 

Survival Correlation by: Title
    Title  Survived
0  Master  0.575000
1    Misc  0.444444
2    Miss  0.697802
3      Mr  0.156673
4     Mrs  0.792000
---------- 

Survival Correlation by: SibSp
   SibSp  Survived
0      0  0.345395
1      1  0.535885
2      2  0.464286
3      3  0.250000
4      4  0.166667
5      5  0.000000
6      8  0.000000
---------- 

Survival Correlation by: Parch
   Parch  Survived
0      0  0.343658
1      1  0.550847
2      2  0.500000
3      3  0.600000
4      4  0.000000
5      5  0.200000
6      6  0.000000
---------- 

Survival Correlation by: FamilySize
   FamilySize  Survived
0           1  0.303538
1 

### Finally, Machine Learning

Now that we have all of that out of the way we can apply our machine learning algorithm. Like I said earlier, we will use 'XGBClassifier()'.

In [35]:
alg = XGBClassifier()

In [36]:
split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 )

We split the data into a test and train set using sklearn 'ShuffleSplit()'. 'n_splits' is how many times we run the model. We test it with 30% of the data and train it with 60% of the data. We set a random state so that the output is the same every time, so it is replicatable on others' machines.

In [37]:
results = model_selection.cross_validate(alg, train[train_x_bin], pd.Series.ravel(train[Target]), cv  = split, return_train_score=True)

We finally implement the ML algorithm using 'cross_validate()'. The first argument is the MLA, second is the dats, third is target (we used pandas 'ravel()' to make it into a 1d array becuase it must be that shape), fourth is our data split, fifth is a boolean telling it to return the scores.

In [38]:
results['train_score'].mean()

0.85636704119850171

In [40]:
results['test_score']

array([ 0.83955224,  0.80223881,  0.83208955,  0.81343284,  0.83208955,
        0.82835821,  0.81343284,  0.82462687,  0.83955224,  0.86940299])