<h1>Titanic: Machine Learning from Disaster</h1>

<p>This is my solution for the kaggle challenge (https://www.kaggle.com/c/titanic). The overall workflow is to do EDA, Feature Engineering, Hyperparameter Tuning, Modeling, and Prediction.</p>

In [246]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

train = pd.read_csv('titanic.csv')
test = pd.read_csv('titanic_test.csv')

In [247]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [248]:
print(train.shape)
print(test.shape)

(891, 12)
(418, 11)


In [249]:
print(train.isnull().sum())
print()
print(test.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


There are some <strong>missing values</strong> in the training and testing datasets

<h1>Feature Engineering</h1>

In [250]:
train = train.drop('Cabin', axis=1)
train = train.drop('Ticket', axis=1)
train = train.drop('PassengerId', axis=1)

test = test.drop('Cabin', axis=1)
test = test.drop('Ticket', axis=1)
test = test.drop('PassengerId', axis=1)

I am dropping some features that I think will not be useful in the analysis. Furthermore, the column, 'cabin', has too much missing values. 

<strong>Sex</strong>

In [251]:
#0=Male; 1=Female
df = [train, test]

for data in df:
    data['Sex'] = pd.get_dummies(data['Sex'])

<strong>Embarked</strong>

In [252]:
df = [train, test]

for data in df:
    data['Embarked'] = data['Embarked'].fillna('S')

In [253]:
#Embarkation
embarked = {
    'S':1,
    'C':2,
    'Q':3
}

df = [train, test]

for data in df:
    data['Embarked'] = data['Embarked'].map(embarked)


<strong>Name (Title)</strong>

In [254]:
#Finding Passenger's Title
df = [train, test]

for data in df:
    data['Title'] = data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

train

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,7.2500,1,Mr
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,71.2833,2,Mrs
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,7.9250,1,Miss
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,53.1000,1,Mrs
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,8.0500,1,Mr
...,...,...,...,...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",0,27.0,0,0,13.0000,1,Rev
887,1,1,"Graham, Miss. Margaret Edith",1,19.0,0,0,30.0000,1,Miss
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,,1,2,23.4500,1,Miss
889,1,1,"Behr, Mr. Karl Howell",0,26.0,0,0,30.0000,2,Mr


In [255]:
train = train.drop('Name', axis=1)
test = test.drop('Name', axis=1)

In [256]:
df = [train, test]

title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, 
                 "Master": 3, "Dr": 3, "Rev": 3, "Col": 3, "Major": 3, "Mlle": 3,"Countess": 3,
                 "Ms": 3, "Lady": 3, "Jonkheer": 3, "Don": 3, "Dona" : 3, "Mme": 3,"Capt": 3,"Sir": 3 }

for data in df:
    data['Title'] = data['Title'].map(title_mapping)

<strong>Missing Values (Age and Fair)</strong>

In [257]:
train["Age"].fillna(train.groupby("Title")["Age"].transform("mean"), inplace=True)
test["Age"].fillna(test.groupby("Title")["Age"].transform("mean"), inplace=True)
test["Fare"].fillna(test.groupby("Pclass")["Fare"].transform("mean"), inplace=True)

<h1>Modeling</h1>

In [258]:
# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

import numpy as np

In [259]:
X = train.drop('Survived', axis=1)
y = train['Survived']

<strong>KNN</strong>

In [260]:
model = KNeighborsClassifier(n_neighbors=5)
scoring = 'accuracy'
cross_val_score(model, X, y, scoring=scoring, cv=10).mean()

0.7139521620701397

<strong>Decision Tree</strong>

In [261]:
model = DecisionTreeClassifier()
scoring = 'accuracy'
cross_val_score(model, X, y, scoring=scoring, cv=10).mean()

0.7834544887072977

<strong>Random Forest</strong>

In [262]:
model = RandomForestClassifier(n_estimators=5)
scoring = 'accuracy'
cross_val_score(model, X, y, scoring=scoring, cv=10).mean()

0.8059011462944048

<strong>Gaussian Naive Bayes</strong>

In [263]:
model = GaussianNB()
scoring = 'accuracy'
cross_val_score(model, X, y, scoring=scoring, cv=10).mean()

0.8013684598796958

<strong>Support-vector Machine</strong>

In [264]:
model = SVC(gamma='auto')
scoring = 'accuracy'
cross_val_score(model, X, y, scoring=scoring, cv=10).mean()

0.731904153898536

<h1>Selected: Random Forest Classifier</h1>

<h3>Hyperparameter Tuning</h3>

In [265]:
#Training
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators':np.arange(2,10),
    'max_depth':np.arange(2,10),
    'criterion':['gini','entropy']
}

model = RandomForestClassifier()
gscv = GridSearchCV(model, cv=10, param_grid=param_grid, scoring='accuracy')
gscv.fit(X,y)



GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             ii

In [266]:
# Best parameters
gscv.best_params_

{'criterion': 'gini', 'max_depth': 6, 'n_estimators': 5}

In [267]:
# Model training and prediction
model_improved = RandomForestClassifier(criterion='gini', n_estimators=8, max_depth=7)
model_improved.fit(X,y)
result = model_improved.predict(test)

result

array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,

In [268]:
# Exporting result
df_result = pd.DataFrame(result, columns=['Survived'])
df_predict = pd.read_csv('titanic_test.csv')
df_id = pd.DataFrame(df_predict['PassengerId'], columns=['PassengerId'])
submission = df_id.join(df_result)

submission.to_csv('submission.csv', index=False)

<h1>Kaggle Final Score: 0.79425</h1>