# **Titanic Survival Prediction Using Machine Learning**

The [RMS Titanic](https://en.wikipedia.org/wiki/RMS_Titanic) was known as the unsinkable ship and was the largest, most luxurious passenger ship of its time. Sadly, the British ocean liner sank on April 15, 1912, killing over 1500 people while just 705 survived. In this article, we will analyze the Titanic data set and make predictions to see whether passengers on board the ship would survive or not.

The following are the steps involved in this project:




Let's start implementing the above steps one by one.

###**Import the necessary packages**

In [None]:
#Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

###**Data Collection**

The Train and Test Datasets are downloaed from Kaggle Datasets. One can download the data files from [here](https://www.kaggle.com/c/titanic/).

###**Loading the Dataset**

We can load the dataset which is in the `.csv` fromat using `pd.read_csv`

In [None]:
# Load the dataset
train_data = pd.read_csv('./dataset/train.csv')

In [None]:
# Print the first 10 rows fo data
train_data.head(10)

In [None]:
# Check the size of dataset
train_data.shape

#####**Description of the Dataset**

The dataset has 891 records with 12 columns. The details of each column are provided in the below table;

| Variable |	Definition |	Key |
| -- | -- | :--: | 
| PassengerID | Unique ID of Passenger | |
| Survivad |	If a passenger survived or not |	0 = No, 1 = Yes |
| Pclass |	Ticket class of passenger |	1 = 1st, 2 = 2nd, 3 = 3rd |
| Name | Name of the passenger | |
| Sex	| Gender of the passenger | |	
| Age |	Age of passenger in years	| |
| SbiSp |	# of siblings / spouses aboard the Titanic	| |
| Parch	| # of parents / children aboard the Titanic	| |
| Ticket |	Ticket number	| |
| Fare	| Passenger fare	| |
| Cabin	| Cabin number	| |
| Embarked	| Port of Embarkation |	C = Cherbourg, Q = Queenstown, S = Southampton |

###**Feature Engineering**

#####**Dealing with null values**

In [None]:
# Check how many columns has missing values
train_data.isna().sum()

There are missing values in the columns Age, Cabin and Embarked

    `Age` --> 177

    `Cabin` --> 687

    `Embarked` --> 2

So, we have to deal with these missing values in a reasonable way. We can do that in two ways:
1. Deleting those records which are having missing values (This will lead to loss of data)
2. Replace the missin information with relevant info 
  - Replacing numerical data with mean of the column and 
  - Replacing categorical data with frequently occured value

In [None]:
# Delete the Cabin column as it has 687 missing values
train_data.drop(['Cabin'], axis = 1, inplace = True)
# Filling the missing information
train_data = train_data.apply(
    lambda x : x.fillna(x.mean() if (x.dtype == "float") else x.fillna(x.value_counts().index[0]))
)

In [None]:
# Let's chek the dataset once
train_data.isnull().sum()

We can observe that all the missing information is filled with appropriate data from the dataframe.

The methods `.isna()` and `.isnull()` yeilds the same information

#####**Add/Delete the columns**

All the columns that are present in the dataset are not important for the analysis. We will delete those which are not important like `PassengerID`, `Ticket` and `Name`.

After that we will also add some features like whether a person is Man, Woman or Child and whether a passenger is an adult male or not.

In [None]:
# Drop the 3 unnecessary columns
cols_drop = ['PassengerId', 'Name', 'Ticket']
train_data.drop(cols_drop, axis = 1, inplace = True)


In [None]:
# Add the 2 columns
# Conditions, values and adding column for 'who'
conditions = [
              (train_data['Age'] < 18),
              ((train_data['Age'] >= 18) & (train_data['Sex'] == 'female')),
              ((train_data['Age'] >= 18) & (train_data['Sex'] == 'male')) 
             ]
values = ['child', 'woman', 'man']
train_data['Who'] = np.select(conditions, values)

# Conditions, values and adding column for 'adult_male'
train_data['Adult_male'] = np.select(
    [((train_data['Age'] >= 18) & (train_data['Sex'] == 'male')),
     ((train_data['Age'] < 18) | (train_data['Sex'] == 'female'))], 
    [True, False])

In [None]:
train_data.head()

###**Exploratory Data Analysis**

Let's study the data and bring out some interesting facts about it.

In [None]:
# Check the statical feature of numerical data
train_data.describe()

In [None]:
# Check the statical feature of categorical data
train_data.describe(include = 'O')

#####**Observations:**
- **Fare:** We can observe that the max price/fare a passenger paid for a ticket in this data set was 512.3292 British pounds, and the minimum price/fare was 0 British pounds with an average price/fare of 32.204208 British pounds.

- **Age:** The mean age of passengers is 29.699 and the oldest passenger on the ship was 80 years old, while the youngest was only 0.42 years old (about 5 months).

- **Missing Data:** We can also see that there is some missing data for the age column as it’s less than 891 (the number of passengers in this data set).

#####**Check for the number of survivors on board the Titanic**

In [None]:
# Getting a count of number of survivors on board the Titanic
train_data['Survived'].value_counts()

# Visualise the count of number of survivors
sns.countplot(train_data['Survived'], label = "Count")

#####**Observations:**
According to the data, among the 891 people only 342 could survive and 549 people have died.

<img src = "images/1. survived.png" width = "600" height = "300">



#####**Visualize the count of survivors for some columns**

Now let's visualize the count of survivors for the columns `Who, Sex, Pclass, SibSp, Parch`, and `Embarked`.

In [None]:
# Visualize the count of survivors for the columns who, sex, pclass, sibsp, parch, and embarked.
cols = ['Who', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']

n_rows = 2
n_cols = 3

fig, axs = plt.subplots(n_rows, n_cols, figsize = (n_cols * 3.2 , n_rows * 3.2))

for row in range(n_rows):
  for col in range(n_cols):
    i = row * n_cols + col  # Index to go through the number of cols
    ax = axs[row][col]    # Show where to position each subplot
    sns.countplot(train_data[cols[i]], hue = train_data['Survived'], ax = ax)
    ax.set_title(cols[i])
    ax.legend(title = "Survived", loc = "upper right")
plt.tight_layout()

#####**Observations:**
We can observe the following things whether a person has survival chances or not according to the charts given below.

<img src = "images/2. cols vs survived.png" width = "800" height = "500">

| Chart Title | Observation |
| :---------: | :---------- |
| Who | A man is not likely to survive. |
| Sex | Females are most likely to survive |
| Pclass | Third class is most likely to not survive |
| SibSp | If you have 0 siblings or spouses on board, you are not likely to survive |
| Parch | If you have 0 parents or children on board, you are not likely to survive | 
| Embarked | If you embarked from Southampton (S), you are not likely to survive |

#####**Check for Survival rate by Sex**

Now let's check the survival rate of a passenger onboard by sex.

In [None]:
# Check for Survival rate by sex
train_data.groupby('Sex')[['Survived']].mean()

# Plot the results
sns.barplot(x = 'Sex', y = 'Survived', data = train_data)

#####**Observation:**
From the results we can observe that about 74.2% of females survived and about 18.89% of males survived. 

<img src = "images/3. Survival rate by Sex.png" width = "600" height = "400">

#####**Check for Survival rate by Sex and Class as well**

Now let's check the survival rate of a passenger onboard by with their gender and class.

In [None]:
# Look at the survival rate by sex and class
train_data.pivot_table('Survived', index = 'Sex', columns = 'Pclass')

# Plot the results as well
sns.barplot(x = 'Pclass', y = 'Survived', data = train_data)

#####**Observation:**
From the results we can observe that:
-  Females in the first class has the highest survival rate of about 96.80% (majority of them have survived). 
- Males in the third class has the lowest survival rate of about 13.54% (Majority of them have not survived).

<img src = "images/4. Survival rate by Class.png" width = "600" height = "400">

####**Check for Survival rate by Sex, Age, and Class**
Now let's check for survival rate of passenger by Sex, Age, and Class.

In [None]:
# Survival rate by Sex, Age, and Class
age = pd.cut(train_data['Age'], [0, 18, 80])    # Separating the age into 3 sub groups
train_data.pivot_table('Survived', ['Sex', age], 'Pclass')

#####**Observation:**

We can see from the results that women in first class that were 18 and older had the highest survival rate at 97.59%, while men 18 and older in second class had the lowest survival rate of 8.62%.

Now let's delete the columns Who and Adult_Male as they have been created for better understanding of the data

In [None]:
train_data.drop(['Who', 'Adult_male'], axis = 1, inplace = True)

###**Label Encoding of Categorical Columns**

If we observe the data, some of the columns has categorical data types which are to be converted into numerical data types. For this we will use `LabelEncoder` from `sklearn.preprocessing`.

Let's check the data type of each column and decide which columns are to be encoded.

In [None]:
train_data.dtypes

The results show that the columns `Sex`, `Embarked` columns are `object` datatypes. So we have to convert/encode them.

In [None]:
# Category columns
cat_cols = ['Sex', 'Embarked']
for col in cat_cols:
  print(train_data[col].unique())

These are the unique values of each column

- Sex: ['male' 'female']

- Embarked :['S' 'C' 'Q']



Now change the non-numeric data to numeric data.


In [None]:
# Creating a LabelEncoder and transforming the values to numeric data
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
for col in cat_cols:
  before = train_data[col].unique()
  train_data[col] = labelencoder.fit_transform(train_data[col])
  after = train_data[col].unique()
  print(f"{before} converted into {after}")

In [None]:
# Check the datapyes once again
train_data.head()

###**Split the data into Features (X) and Targets (y)**

We have the training dataset. Now we have to split the dataset into Features ( X or Dependant variable) and Targets(y or Independant variable).

In [None]:
# Split the data into Features and Labels
X = train_data.drop(['Survived'], axis = 1)
y = train_data['Survived']

Split the data into 80% training (X_train and y_train) and 20% testing (X_test and y_test) data sets.

In [None]:
# Split the dataset into 80% Training set and 20% Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

###**Feaure Scaling**

Now we have to scale the data. The data has to be scaled as it can be within a specific range. (example 0-100, or 0-1)

In [None]:
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

###**Creating the Model**

Let's create a function that has within it many different machine learning models that we can use to make our predictions.

In [None]:
#Create a function within many Machine Learning Models
def models(X_train,Y_train):
  
  #Using Logistic Regression Algorithm to the Training Set
  from sklearn.linear_model import LogisticRegression
  log = LogisticRegression(random_state = 0)
  log.fit(X_train, Y_train)
  
  #Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
  from sklearn.neighbors import KNeighborsClassifier
  knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
  knn.fit(X_train, Y_train)

  #Using SVC method of svm class to use Support Vector Machine Algorithm
  from sklearn.svm import SVC
  svc_lin = SVC(kernel = 'linear', random_state = 0)
  svc_lin.fit(X_train, Y_train)

  #Using SVC method of svm class to use Kernel SVM Algorithm
  from sklearn.svm import SVC
  svc_rbf = SVC(kernel = 'rbf', random_state = 0)
  svc_rbf.fit(X_train, Y_train)

  #Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
  from sklearn.naive_bayes import GaussianNB
  gauss = GaussianNB()
  gauss.fit(X_train, Y_train)

  #Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm
  from sklearn.tree import DecisionTreeClassifier
  tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
  tree.fit(X_train, Y_train)

  #Using RandomForestClassifier method of ensemble class to use Random Forest Classification algorithm
  from sklearn.ensemble import RandomForestClassifier
  forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
  forest.fit(X_train, Y_train)
  
  #print model accuracy on the training data.
  print('[0] Logistic Regression Training Accuracy:', log.score(X_train, Y_train))
  print('[1] K Nearest Neighbor Training Accuracy:', knn.score(X_train, Y_train))
  print('[2] Support Vector Machine (Linear Classifier) Training Accuracy:', svc_lin.score(X_train, Y_train))
  print('[3] Support Vector Machine (RBF Classifier) Training Accuracy:', svc_rbf.score(X_train, Y_train))
  print('[4] Gaussian Naive Bayes Training Accuracy:', gauss.score(X_train, Y_train))
  print('[5] Decision Tree Classifier Training Accuracy:', tree.score(X_train, Y_train))
  print('[6] Random Forest Classifier Training Accuracy:', forest.score(X_train, Y_train))
  
  return log, knn, svc_lin, svc_rbf, gauss, tree, forest

###**Train the Model**

In [None]:
# Train all the model on Training Data
model = models(X_train, y_train)

The output of the Training phase is as follows:
    
    [0] Logistic Regression Training Accuracy: 0.7963483146067416
    [1] K Nearest Neighbor Training Accuracy: 0.8707865168539326
    [2] Support Vector Machine (Linear Classifier) Training Accuracy: 0.7865168539325843
    [3] Support Vector Machine (RBF Classifier) Training Accuracy: 0.8426966292134831
    [4] Gaussian Naive Bayes Training Accuracy: 0.7893258426966292
    [5] Decision Tree Classifier Training Accuracy: 0.9817415730337079
    [6] Random Forest Classifier Training Accuracy: 0.9676966292134831
        
From the above observations we can observe that Decision Tree Classifier has attained an accuracy of 98.17% with highest accuracy among all.        


###**Evaluate the Model**

Show the confusion matrix and accuracy for all the models on the test data.

In [None]:
# Priont the confustion matrix and test accuracy
from sklearn.metrics import confusion_matrix 
for i in range(len(model)):
   cm = confusion_matrix(y_test, model[i].predict(X_test)) 
   #extracting TN, FP, FN, TP
   TN, FP, FN, TP = confusion_matrix(y_test, model[i].predict(X_test)).ravel()
   print(f'Model[{i}] Testing Accuracy = "{(TP + TN) / (TP + TN + FN + FP)}"')

The model that was most accurate on the test data is the model at position 6, which is the Random Forest Classifier with an accuracy of 82.68%. So, we will choose Random Forest Classifier as our final model.

Now, let's find out the important features of the dataset, so that they can be used to predict the survivability of a passenger if he boards the Titanic.

In [None]:
#Get the importance of the features
final_model = model[6]
importances = pd.DataFrame({'feature':X.columns, 'importance':np.round(final_model.feature_importances_, 3)})
importances = importances.sort_values('importance', ascending=False).set_index('feature')
print(importances)

# Plot the importances
importances.plot.bar()

We can observe that the important features are:

    | Feature | Importance |
    | ------- | ---------- |
    | Age     |     0.300  |
    | Fare    |     0.265  |
    | Sex     |     0.237  |
    | Pclass  |     0.078  |
    | SibSp   |     0.054  |
    | Parch   |     0.036  |
    | Embarked|     0.031  |
    
<img src = "images/5. Imp features.png" width = "600" height = "400">

In [None]:
# Save the final model
import pickle
pickle.dump(final_model, open("Titanic_Survival_Predicion_RFC.pkl", "wb"))