# Survival Prediction

## About the Project

Exploratory Data Analysis was performed on the provided data of Titanic passengers to explore the possible factors affecting the chances of surviving the disaster. Features were engineered from the given data and machine learning models were then created to predict the survival of passengers.  

## Setup

In [1]:
# Data analysis and handling
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Visualisation
import matplotlib.pyplot as plt
%matplotlib inline

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV


/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv


## Explore Data

In [2]:
train = pd.read_csv('train.csv')
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
train.shape

(891, 12)

There are 891 passengers for the training set with 11 features.

In [4]:
train.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Berglund, Mr. Karl Ivar Sven",male,,,,1601.0,,G6,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


Brief analysis of the data:
* PassengerID, Ticket are unlikely to affect survival rate
* Name is unlikely to affect survival rate, but title could affect
* About 38% survival rate for training set
* About 50% of passengers were in passenger class 3
* Age of 177 passengers is missing in the training data
* More than 75% of passengers travelled without siblings, spouses, parents and children
* Cabin of 687 passengers is missing in the training data, about 77%. Unlikely to be useful to predict survival rate
* Missing 2 entries for embarked column

In [5]:
train['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [6]:
train[['Pclass','Survived']].groupby(['Pclass'], as_index=False).mean()

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


In [7]:
pd.crosstab(train['Pclass'], train['Survived'])

Survived,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80,136
2,97,87
3,372,119


We would expect better passenger class (ie Pclass 1) to have priority and higher chances of surviving. This aligns with the data as shown above, there is a strong correlation between passenger class and survival rate where survival rate decreases with lower Pclass. 

In [8]:
train[['Sex','Survived']].groupby(['Sex'], as_index=False).mean()

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


We would expect priority to board the emergency crafts to be given to females. This aligns with the data as shown above where females have a higher survival rate. 

In [9]:
bins = [0, 20, 40, 60, 80]
train[['Age','Survived']].groupby(['Survived', pd.cut(train.Age, bins)]).size().unstack()

Age,"(0, 20]","(20, 40]","(40, 60]","(60, 80]"
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,97,232,78,17
1,82,153,50,5


In [10]:
bins_fare = [0, 8, 15, 30, 513]
train[['Fare','Survived']].groupby(['Survived', pd.cut(train.Fare, bins_fare)]).size().unstack()

Fare,"(0, 8]","(8, 15]","(15, 30]","(30, 513]"
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,175,155,107,98
1,51,62,92,136


In [11]:
train[['Fare','Pclass']].groupby(['Pclass', pd.cut(train.Fare, bins_fare)]).size().unstack()

Fare,"(0, 8]","(8, 15]","(15, 30]","(30, 513]"
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1.0,,41.0,169.0
2,,89.0,65.0,24.0
3,225.0,128.0,93.0,41.0


As expected, there is a strong correlation to the fare and the passenger class. Similarly, the rate of survival is higher for passengers who pay more. 

In [12]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [13]:
#data[['Embarked','Survived']].groupby(['Embarked'], as_index=False).mean()
pd.crosstab(train['Embarked'], train['Survived'])

Survived,0,1
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
C,75,93
Q,47,30
S,427,217


In [14]:
test = pd.read_csv("/kaggle/input/titanic/test.csv")
train.drop(['PassengerId', 'Cabin', 'Ticket'], axis=1, inplace=True)
test.drop(['Cabin', 'Ticket'], axis=1, inplace=True)
data = [train, test]

Remove PassengerId, Cabin and Ticket columns from training dataset. Remove Cabin and Ticket columns from test dataset. 

In [15]:
for dataset in data:
    dataset['Title'] = dataset.Name.str.extract('([A-Za-z]+)\.', expand=False)
pd.crosstab(test['Title'], test['Sex'])

Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Col,0,2
Dona,1,0
Dr,0,1
Master,0,21
Miss,78,0
Mr,0,240
Mrs,72,0
Ms,1,0
Rev,0,2


Extract Title from Name.

In [16]:
for dataset in data:
    dataset['Title'] = dataset['Title'].replace(['Capt', 'Col', 'Countess', 'Don', 'Dona', 'Dr', 'Jonkheer', 'Lady', 'Major', 'Rev', 'Sir'], 'Uncommon')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
pd.crosstab(train['Title'], train['Sex'])

Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Master,0,40
Miss,185,0
Mr,0,517
Mrs,126,0
Uncommon,3,20


In [17]:
train = train.drop('Name', axis=1)
test = test.drop('Name', axis=1)
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,male,22.0,1,0,7.25,S,Mr
1,1,1,female,38.0,1,0,71.2833,C,Mrs
2,1,3,female,26.0,0,0,7.925,S,Miss
3,1,1,female,35.0,1,0,53.1,S,Mrs
4,0,3,male,35.0,0,0,8.05,S,Mr


Remove Name column after extracting Title. 

In [18]:
train[train['Age'].isnull()]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
5,0,3,male,,0,0,8.4583,Q,Mr
17,1,2,male,,0,0,13.0000,S,Mr
19,1,3,female,,0,0,7.2250,C,Mrs
26,0,3,male,,0,0,7.2250,C,Mr
28,1,3,female,,0,0,7.8792,Q,Miss
...,...,...,...,...,...,...,...,...,...
859,0,3,male,,0,0,7.2292,C,Mr
863,0,3,female,,8,2,69.5500,S,Miss
868,0,3,male,,0,0,9.5000,S,Mr
878,0,3,male,,0,0,7.8958,S,Mr


In [19]:
mean_age = train.groupby(['Title'])['Age'].mean()
mean_age

Title
Master       4.574167
Miss        21.845638
Mr          32.368090
Mrs         35.788991
Uncommon    45.545455
Name: Age, dtype: float64

In [20]:
train['Age'].fillna(train.groupby(['Title']).transform('mean').Age, inplace=True)
test['Age'].fillna(test.groupby(['Title']).transform('mean').Age, inplace=True)

Fill missing ages with mean for each Title.

In [21]:
train.Age.isnull().sum()

0

In [22]:
test.Age.isnull().sum()

0

In [23]:
train = train.fillna(train['Embarked'].value_counts().index[0])
train.Embarked.isnull().sum()

0

In [24]:
test['Fare'].fillna(test.groupby(['Pclass']).transform('mean').Fare, inplace=True)

In [25]:
train.dtypes

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
Title        object
dtype: object

In [26]:
test.dtypes

PassengerId      int64
Pclass           int64
Sex             object
Age            float64
SibSp            int64
Parch            int64
Fare           float64
Embarked        object
Title           object
dtype: object

In [27]:
train['Pclass'] = train['Pclass'].astype(str)
train.dtypes

Survived      int64
Pclass       object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
Title        object
dtype: object

In [28]:
test['Pclass'] = test['Pclass'].astype(str)
test.dtypes

PassengerId      int64
Pclass          object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Fare           float64
Embarked        object
Title           object
dtype: object

In [29]:
train = pd.get_dummies(train)
train.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Uncommon
0,0,22.0,1,0,7.25,0,0,1,0,1,0,0,1,0,0,1,0,0
1,1,38.0,1,0,71.2833,1,0,0,1,0,1,0,0,0,0,0,1,0
2,1,26.0,0,0,7.925,0,0,1,1,0,0,0,1,0,1,0,0,0
3,1,35.0,1,0,53.1,1,0,0,1,0,0,0,1,0,0,0,1,0
4,0,35.0,0,0,8.05,0,0,1,0,1,0,0,1,0,0,1,0,0


In [30]:
test = pd.get_dummies(test)
test.head()

Unnamed: 0,PassengerId,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Uncommon
0,892,34.5,0,0,7.8292,0,0,1,0,1,0,1,0,0,0,1,0,0
1,893,47.0,1,0,7.0,0,0,1,1,0,0,0,1,0,0,0,1,0
2,894,62.0,0,0,9.6875,0,1,0,0,1,0,1,0,0,0,1,0,0
3,895,27.0,0,0,8.6625,0,0,1,0,1,0,0,1,0,0,1,0,0
4,896,22.0,1,1,12.2875,0,0,1,1,0,0,0,1,0,0,0,1,0


## Machine Learning Modelling

In [31]:
X = train.drop(['Survived'], axis=1)
y = train['Survived']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=0)
mms = MinMaxScaler()
X_train_scaled = mms.fit_transform(X_train)
X_val_scaled = mms.transform(X_val)

In [32]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train_scaled,y_train)
knn.score(X_val_scaled, y_val)

0.7910447761194029

In [33]:
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)
logreg.score(X_val_scaled, y_val)



0.8171641791044776

In [34]:
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
rf.score(X_val, y_val)

0.7947761194029851

In [35]:
svm = SVC(C=100)
svm.fit(X_train_scaled, y_train)
svm.score(X_val_scaled, y_val)



0.8283582089552238

In [36]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(SVC(), param_grid, cv=5)
grid.fit(X_train_scaled, y_train)
grid.best_params_

{'C': 1, 'gamma': 1}

In [37]:
grid.best_score_

0.826645264847512

In [38]:
param_grid_rf = {'max_features': [1, 2, 3]}
grid_rf = GridSearchCV(RandomForestClassifier(n_estimators=100), param_grid_rf, cv=5)
grid_rf.fit(X_train, y_train)
grid_rf.best_params_

{'max_features': 2}

In [39]:
grid_rf.best_score_

0.7961476725521669

In [40]:
svc = SVC(C=1, gamma=1)
svc.fit(X_train_scaled,y_train)
X_test = test.drop(['PassengerId'], axis=1)
X_test_scaled = mms.transform(X_test)
y_test = svc.predict(X_test_scaled)

In [41]:
submission = pd.DataFrame({
    'PassengerId': test['PassengerId'],
    'Survived': y_test })
submission.to_csv('submission.csv', index=False)

With the above model, 80.4% accuracy was achieved. 