In this article, we will go through a very famous case study for machine learning practitioners which is to predict titanic survival. We will first get introduced to this case study and then we will show how we can build a predictive model to predict survival with Machine Learning.

### Machine Learning Case Study: Titanic Survival Analysis

The sinking of the Titanic is one of the most infamous wrecks in history. On **April 15, 1912**, during her maiden voyage, the RMS Titanic, widely considered **unsinkable**, sank after hitting an iceberg.

Unfortunately, there were not enough lifeboats for everyone on board, resulting in the deaths of **1,502 out of 2,224** passengers and crew. While there was an element of luck in survival, it appears that certain groups of people were more likely to survive than others.

Here, our challenge is to build a predictive model that can give a solution to the question, **What types of people were more likely to survive?** using passenger data (i.e. `name`, `age`, `sex`, `socio-economic class`, etc.).

### Predict Titanic Survival with Machine Learning

As a solution to the above case study for predicting titanic survival, we are using a **now-classic** dataset, which relates to passenger survival rates on the Titanic, which sank in 1912.

In [1]:
import pandas as pd
train = pd.read_csv('titanic/train.csv') 
test = pd.read_csv('titanic/test.csv')
train[:4]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


Scikit-learn’s algorithms generally cannot be powered by missing data, so we’ll be looking at the columns to see if there are any that contain missing data:

In [2]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [3]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

### Data Preparation

In statistics and machine learning examples like this, a typical task is to predict whether a passenger will survive based on the characteristics of the data. A model is fitted to a training data set and then evaluated on an out-of-sample test data set.

We would like to use `Age` as a predictor, but data is missing. There are several ways to do missing data imputation, but we’ll make a simple one and use the **median of the training dataset to fill in the null values** in both tables:

In [4]:
impute_value = train['Age'].median()
train['Age'] = train['Age'].fillna(impute_value)
test['Age'] = test['Age'].fillna(impute_value)

We now need to specify our models. We’ll add an `IsFemale` column as the encoded version of the `Sex` column:

In [5]:
train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

Next, we decide on some model variables and create NumPy arrays:

In [6]:
predictors = ['Pclass', 'IsFemale', 'Age']
X_train = train[predictors].values
X_test = test[predictors].values
y_train = train['Survived'].values
X_train[:5]

array([[ 3.,  0., 22.],
       [ 1.,  1., 38.],
       [ 3.,  1., 26.],
       [ 1.,  1., 35.],
       [ 3.,  0., 35.]])

### Machine Learning Model to Predict Titanic Survival

Now we are going to use the **LogisticRegression model from scikit-learn** and create a model instance:

In [7]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

Next we can fit this model to the training data using the scikit-learn’s fit method:

In [8]:
model.fit(X_train, y_train)

LogisticRegression()

Now, we can make predictions on the test dataset using model.predict:

In [9]:
y_predict = model.predict(X_test)
y_predict[:10]

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

In practice, there are often many additional layers of complexity in training the models. Many models have parameters that can be adjusted, and there are techniques such as **cross-validation that can be used for parameter tuning to prevent overfitting of training data**. This can often improve predictive performance or the robustness of new data.

### Implementing Cross-Validation

Cross-validation works by splitting training data to simulate out-of-sample prediction. Based on a model accuracy score such as the **root mean square error**, one can perform a grid search on the model parameters. Some models, like **logistic regression, have classes of estimators with built-in cross-validation**.

For example, the `LogisticRegressionCV` class can be used with a parameter indicating the degree of precision of a grid search to be performed on the model regularization parameter `C`:

In [10]:
from sklearn.linear_model import LogisticRegressionCV
model_cv = LogisticRegressionCV(10)
model_cv.fit(X_train, y_train)



LogisticRegressionCV()

To perform cross-validation by hand, we can use the `cross_val_score` helper function, which handles the process of splitting data. For example, to validate our model with **four non-overlapping divisions** of training data, we can do:

In [11]:
from sklearn.model_selection import cross_val_score
model = LogisticRegression(C=10)
scores = cross_val_score(model, X_train, y_train, cv=4)
scores

array([0.77578475, 0.79820628, 0.77578475, 0.78828829])

The default rating metric depends on the model, but it is possible to choose an explicit rating function. Cross-validated models take longer to train, but can often improve model performance.