# Background

Titanic sank in Atlantic ocean has become famous accident in all over the world. The accident even made into the film <i>Titanic</i> in 1997. So many people in Titanic ship become victim of this renowned accident. Many people dies, however there are also people who survived the crash. In this notebook, we want to prepare the dataset for predicting people who survived the accident.

In this notebook, the target of dataset would be class Survived. We will predict this target variabel after data preprocessing

## Import the required library and dataset

In [48]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
%matplotlib inline

In [49]:
training_set = pd.read_csv('train.csv')

## Peeking dataset

In [50]:
training_set.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [51]:
training_set.shape

(891, 12)

In [52]:
training_set.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [53]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Data Preprocessing

We don't need Ticket and PassengerID to predict the data, therefore we will drop the column

In [54]:
training_set.drop(['Ticket','PassengerId'], 1, inplace=True)

In [55]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Fare      891 non-null    float64
 8   Cabin     204 non-null    object 
 9   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(4)
memory usage: 69.7+ KB


We will create Deck column based on first character in colum Cabin

In [56]:
def assignDeckValue(CabinCode):
    if pd.isnull(CabinCode):
        category = 'Unknown'
    else:
        category = CabinCode[0]
    return category

Deck = np.array([assignDeckValue(cabin) for cabin in training_set['Cabin'].values])
print(Deck)

['Unknown' 'C' 'Unknown' 'C' 'Unknown' 'Unknown' 'E' 'Unknown' 'Unknown'
 'Unknown' 'G' 'C' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown'
 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'D' 'Unknown' 'A' 'Unknown'
 'Unknown' 'Unknown' 'C' 'Unknown' 'Unknown' 'Unknown' 'B' 'Unknown'
 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown'
 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown'
 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'D' 'Unknown' 'B' 'C'
 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'B' 'C' 'Unknown'
 'Unknown' 'Unknown' 'F' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown'
 'Unknown' 'Unknown' 'Unknown' 'F' 'Unknown' 'Unknown' 'Unknown' 'Unknown'
 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown'
 'Unknown' 'C' 'Unknown' 'Unknown' 'Unknown' 'E' 'Unknown' 'Unknown'
 'Unknown' 'A' 'D' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'D' 'Unknown'
 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'Unknown' 'C' 'Unknown

In [57]:
training_set = training_set.assign(Deck = Deck)

In [58]:
training_set['Deck'].values

array(['Unknown', 'C', 'Unknown', 'C', 'Unknown', 'Unknown', 'E',
       'Unknown', 'Unknown', 'Unknown', 'G', 'C', 'Unknown', 'Unknown',
       'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',
       'Unknown', 'D', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown',
       'C', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown',
       'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',
       'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',
       'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',
       'D', 'Unknown', 'B', 'C', 'Unknown', 'Unknown', 'Unknown',
       'Unknown', 'Unknown', 'B', 'C', 'Unknown', 'Unknown', 'Unknown',
       'F', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',
       'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown',
       'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown',
       'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown',
       'Unknow

In [59]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Fare      891 non-null    float64
 8   Cabin     204 non-null    object 
 9   Embarked  889 non-null    object 
 10  Deck      891 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


We create column that consist of size of the passenger's family (including her/himself) based on column ParCh (Parent and Children) and SibSp (Sibling and Spouse)

In [60]:
training_set['FamilySize'] = training_set['SibSp'] + training_set['Parch'] + 1

Name doesn't important enough to be fitted into model. We will denoted Title column based on Name

In [61]:
training_set['Title'] = training_set.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
training_set['Title'] = training_set['Title'].replace(['Dr', 'Rev', 'Col', 'Major', 'Countess', 'Sir', 'Jonkheer', 'Lady', 'Capt', 'Don'], 'Others')
training_set['Title'] = training_set['Title'].replace('Ms', 'Miss')
training_set['Title'] = training_set['Title'].replace('Mme', 'Mrs')
training_set['Title'] = training_set['Title'].replace('Mlle', 'Miss')

In [62]:
training_set['Title']

0          Mr
1         Mrs
2        Miss
3         Mrs
4          Mr
        ...  
886    Others
887      Miss
888      Miss
889        Mr
890        Mr
Name: Title, Length: 891, dtype: object

In [63]:
training_set.drop(['Cabin','Name'],1,inplace=True)

In [64]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Sex         891 non-null    object 
 3   Age         714 non-null    float64
 4   SibSp       891 non-null    int64  
 5   Parch       891 non-null    int64  
 6   Fare        891 non-null    float64
 7   Embarked    889 non-null    object 
 8   Deck        891 non-null    object 
 9   FamilySize  891 non-null    int64  
 10  Title       891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


In [65]:
training_set.isnull().sum()

Survived        0
Pclass          0
Sex             0
Age           177
SibSp           0
Parch           0
Fare            0
Embarked        2
Deck            0
FamilySize      0
Title           0
dtype: int64

Embarked still has 2 null data, so in this problem we will fill missing value with commonly occured value

In [66]:
training_set['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [67]:
common = 'S'
training_set['Embarked']=training_set['Embarked'].fillna('S')

In [68]:
training_set.isnull().sum()

Survived        0
Pclass          0
Sex             0
Age           177
SibSp           0
Parch           0
Fare            0
Embarked        0
Deck            0
FamilySize      0
Title           0
dtype: int64

Still many missing number in Age column, so we will fill the missing number with mean of each title (Mr, Mrs, Miss, Master, Others)

In [69]:
means = training_set.groupby('Title')['Age'].mean()
print(means)

Title
Master     4.574167
Miss      21.845638
Mr        32.368090
Mrs       35.788991
Others    45.545455
Name: Age, dtype: float64


In [70]:
title_list = ['Master', 'Miss', 'Mrs', 'Mr', 'Others']

def ageMissingReplace(means, dframe, title_list):
    for title in title_list:
        temp = dframe['Title'] == title
        dframe.loc[temp, 'Age'] = dframe.loc[temp, 'Age'].fillna(means[title])

ageMissingReplace(means, training_set, title_list)
training_set['Age']

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    21.845638
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

In [71]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Sex         891 non-null    object 
 3   Age         891 non-null    float64
 4   SibSp       891 non-null    int64  
 5   Parch       891 non-null    int64  
 6   Fare        891 non-null    float64
 7   Embarked    891 non-null    object 
 8   Deck        891 non-null    object 
 9   FamilySize  891 non-null    int64  
 10  Title       891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


We convert value into numerical value for better data fitting into model. In this case, we just use mapping function

In [72]:
training_set['Embarked'] = training_set['Embarked'].map({'C':0, 'Q':1, 'S':2})
training_set['Sex'] = training_set['Sex'].map({'male':0, 'female':1})
training_set['Title'] = training_set['Title'].map({'Master':0, 'Miss':1, 'Mr':2, 'Mrs':3, 'Others':4})

In [73]:
le = preprocessing.LabelEncoder()
training_set['Deck'] = le.fit_transform(training_set['Deck'])

In [74]:
training_set.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Deck,FamilySize,Title
0,0,3,0,22.0,1,0,7.25,2,8,2,2
1,1,1,1,38.0,1,0,71.2833,0,2,2,3
2,1,3,1,26.0,0,0,7.925,2,8,1,1
3,1,1,1,35.0,1,0,53.1,2,2,2,3
4,0,3,0,35.0,0,0,8.05,2,8,1,2


We then denote correlation matrix to know correlation between Survived column with every another column. This information will be useful when we start to build the model

In [75]:
training_set.corr()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Deck,FamilySize,Title
Survived,1.0,-0.338481,0.543351,-0.089402,-0.035322,0.081629,0.257307,-0.167675,-0.301116,0.016639,-0.071174
Pclass,-0.338481,1.0,-0.1319,-0.343799,0.083081,0.018443,-0.5495,0.162098,0.746616,0.065997,-0.181177
Sex,0.543351,-0.1319,1.0,-0.117476,0.114631,0.245489,0.182333,-0.108262,-0.123076,0.200988,-0.060299
Age,-0.089402,-0.343799,-0.117476,1.0,-0.267659,-0.196902,0.091029,-0.024452,-0.252426,-0.281305,0.532807
SibSp,-0.035322,0.083081,0.114631,-0.267659,1.0,0.414838,0.159651,0.06823,0.04154,0.890712,-0.209813
Parch,0.081629,0.018443,0.245489,-0.196902,0.414838,1.0,0.216225,0.039798,-0.032548,0.783111,-0.117587
Fare,0.257307,-0.5495,0.182333,0.091029,0.159651,0.216225,1.0,-0.224719,-0.523013,0.217138,-0.013273
Embarked,-0.167675,0.162098,-0.108262,-0.024452,0.06823,0.039798,-0.224719,1.0,0.194255,0.066516,0.005207
Deck,-0.301116,0.746616,-0.123076,-0.252426,0.04154,-0.032548,-0.523013,0.194255,1.0,0.012131,-0.095789
FamilySize,0.016639,0.065997,0.200988,-0.281305,0.890712,0.783111,0.217138,0.066516,0.012131,1.0,-0.202145


In [76]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Sex         891 non-null    int64  
 3   Age         891 non-null    float64
 4   SibSp       891 non-null    int64  
 5   Parch       891 non-null    int64  
 6   Fare        891 non-null    float64
 7   Embarked    891 non-null    int64  
 8   Deck        891 non-null    int32  
 9   FamilySize  891 non-null    int64  
 10  Title       891 non-null    int64  
dtypes: float64(2), int32(1), int64(8)
memory usage: 73.2 KB


Finally, data is already prepared and ready to deploy into model