# Predicting Titantic Survival using a Logistic Regression Model

In this notebook, we will take data from a Kaggle competition 'Titanic - Machine Learning from Disaster' and see if we can use a Logistic Regression Model to get some predictions.

We have two sets of data, train and test, containing almost the same information. Our data consists of the following information:

- PassengerId: a number assigned to each passenger.
- Survived: whether or not this passenger survived.  This is only in the training data.
- Pclass: The passengers class, either first, second or third.
- Name: The name of the passenger.
- Sex: The sex of the passenger.
- Age: The age of the passenger.
- SibSp: The number of siblings or spouses the passenger had also on board.
- Parch: The number of parents or children the passenger had also on board.
- Ticket: A ticket id.
- Fare: The amount paid for the ticket.
- Cabin: The cabin the passenger stayed in.
- Embarked: Where the passenger boarded the Titanic.

We will plug this information into a Logistic Regression Model. Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable. Sklearn has an automatic certainty threshold of 0.5, meaning if the model gives out a more than 0.5 probability that an entry will go one way or the other, it puts it in that category. For example, if our model things that there is a 0.57 chance that someone survived on the Titanic, it will say they survived when making predictions.

### Table of Contents

- Exploring and Cleaning the Data
- Making, Training and using our Model
- Conclusion

## Exploring and Cleaning the Data

In [128]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [129]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
gender_submission = pd.read_csv('gender_submission.csv')

In [130]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [131]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [132]:
np.mean(train.Age)

29.69911764705882

It is clear that there are some missing values in the age column, which is assumed to be an important factor to whether one survived on the Titanic.  We should try to fill some of those in. We will fill them in with the average age of our passengers, which is a little younger than 30 years old.

In [133]:
train.Age.fillna(value = np.mean(train.Age), inplace = True)

The next thing that we will tackle is the class system.  It will be difficult to work with the class column as it is. Instead of having it be a column containing a number 1 through 3, we will add columns that will contain a 0 or a 1 for whether that passenger was of the class contained in the column.

In [134]:
train['FirstClass'] = train['Pclass'].apply(lambda x: 1 if x == 1 else 0)
train['SecondClass'] = train['Pclass'].apply(lambda x: 1 if x == 2 else 0)
train['ThirdClass'] = train['Pclass'].apply(lambda x: 1 if x == 3 else 0)
train.drop(columns = ['Pclass'])

Unnamed: 0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass,ThirdClass
0,1,0,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,,S,0,0,1
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C,1,0,0
2,3,1,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,,S,0,0,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S,1,0,0
4,5,0,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,,S,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,,S,0,1,0
887,888,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,B42,S,1,0,0
888,889,0,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,,S,0,0,1
889,890,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,C148,C,1,0,0


In [135]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass,ThirdClass
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,0,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,0,1


Then we will make it easier to work with the sex of the passengers.  If a passenger is male, the column Sex will now contain a 0, and if the passenger is female it will contain a 1.

In [136]:
train['Sex'] = train['Sex'].apply(lambda x: 1 if x == 'female' else 0)

In [137]:
train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass,ThirdClass
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,1,0,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,0,0,1
5,6,0,3,"Moran, Mr. James",0,29.699118,0,0,330877,8.4583,,Q,0,0,1
6,7,0,1,"McCarthy, Mr. Timothy J",0,54.0,0,0,17463,51.8625,E46,S,1,0,0
7,8,0,3,"Palsson, Master. Gosta Leonard",0,2.0,3,1,349909,21.075,,S,0,0,1
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27.0,0,2,347742,11.1333,,S,0,0,1
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14.0,1,0,237736,30.0708,,C,0,1,0


We want to do the same thing to our test data as the data we use to train our model, so let's take care of that now.

In [138]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [139]:
test.Age.fillna(value = np.mean(test.Age), inplace = True)

test['FirstClass'] = test['Pclass'].apply(lambda x: 1 if x == 1 else 0)
test['SecondClass'] = test['Pclass'].apply(lambda x: 1 if x == 2 else 0)
test['ThirdClass'] = test['Pclass'].apply(lambda x: 1 if x == 3 else 0)
train.drop(columns = ['Pclass'])

test['Sex'] = test['Sex'].apply(lambda x: 1 if x == 'female' else 0)

In [140]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass,ThirdClass
0,892,3,"Kelly, Mr. James",0,34.5,0,0,330911,7.8292,,Q,0,0,1
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,363272,7.0,,S,0,0,1
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,240276,9.6875,,Q,0,1,0
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,315154,8.6625,,S,0,0,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,3101298,12.2875,,S,0,0,1


There is one entry that contains a null value for fare in our testing set.  Let's look into that more and try to fill it.

In [141]:
test[test['Fare'].isnull()]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FirstClass,SecondClass,ThirdClass
152,1044,3,"Storey, Mr. Thomas",0,60.5,0,0,3701,,,S,0,0,1


We can see that this was a third class passenger, so lets take the average fare of third class passengers and use that to fill his fair.

In [142]:
avg_thirdclass_fare = np.mean(test.Fare[test['ThirdClass'] == 1])
avg_thirdclass_fare

12.459677880184334

In [143]:
test.Fare.fillna(value = avg_thirdclass_fare, inplace = True)

Now that our data is clean and easier to use, let's try to make, train and test our model.

## Making, Training and Using our Model

Not all of this data will help us figure out if someone survived on the Titanic.  We will select out features that we want our model to consider when making its predictions.  Those will be age, sex, fare and passenger class.

In [144]:
train_features = train[['Sex', 'Age', 'Fare', 'FirstClass', 'SecondClass', 'ThirdClass']]
train_survival = train['Survived']

test_features = test[['Sex', 'Age', 'Fare', 'FirstClass', 'SecondClass', 'ThirdClass']]

We know we are going to use a Logistic Regression model, which uses regularization.  This means that we need to scale our data, so we will do that using sklearn's Standard Scaler.

In [145]:
scaler = StandardScaler()

train_features = scaler.fit_transform(train_features)
test_features = scaler.fit_transform(test_features)

Now that our data is ready, let's plug it into our model.

In [146]:
model = LogisticRegression()
model.fit(train_features, train_survival)
predictions = pd.DataFrame(model.predict(test_features), index = test['PassengerId'], columns = ['Survived']) 

In [147]:
model.score(train_features, train_survival)

0.8002244668911336

In [148]:
predictions

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,1
...,...
1305,0
1306,1
1307,0
1308,0


In [151]:
predictions.to_csv('LogisticPredictions.csv')

After submitting our predictions to Kaggle, we see that we have a 0.7655 accuracy.

Let's see what had the biggest impact on survival by looking at the coefficents.

In [153]:
print(model.coef_)
print(['Sex', 'Age', 'Fare', 'FirstClass', 'SecondClass', 'ThirdClass'])

[[ 1.23268698 -0.42195467  0.04026862  0.52593205  0.06768462 -0.50823106]]
['Sex', 'Age', 'Fare', 'FirstClass', 'SecondClass', 'ThirdClass']


We can see that sex of passenger had the biggest connect to survival.

## Conclusion

We were able to build a Logistic Regression model that could predict whether a passenger would survive the sinking of the Titanic with accuracy of 0.76555, and see that the feature that has the biggest effect on survival was passenger sex.