# Machine Learning

### Titanic Survival Exploration

In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this introductory project, we will explore a subset of the RMS Titanic passenger manifest to determine which features best predict whether someone survived or did not survive. To complete this project, you will need to implement several conditional predictions and answer the questions below. Your project submission will be evaluated based on the completion of the code and your responses to the questions.


In [1]:
import numpy as np
import pandas as pd
%matplotlib inline

from sklearn.metrics import f1_score, make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz

In [2]:
data = pd.read_csv('titanic_data.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


From a sample of the RMS Titanic data, we can see the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

In [3]:
# Dropping features that won't have impact on the final results
# e.g. PassengerID, Name, Ticket, Fare, Cabin
data = data.drop(['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin'], axis = 1)

data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Embarked
0,0,3,male,22.0,1,0,S
1,1,1,female,38.0,1,0,C
2,1,3,female,26.0,0,0,S
3,1,1,female,35.0,1,0,S
4,0,3,male,35.0,0,0,S


In [4]:
# check for Nan values in the dataset
total_nan = data.isnull().sum().sum()
print('Total NAN values are: {}'.format(total_nan))

# column wise nan values
data.isnull().sum()

Total NAN values are: 179


Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Embarked      2
dtype: int64

In [5]:
# replace age with the median age in that column and remove
# the rows which have embarked values as NAN

data['Age'].fillna(data['Age'].median(), inplace=True)
data.dropna(axis='rows', inplace=True)

#sanity check
data.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Embarked    0
dtype: int64

In [6]:
# prepare data for ML model
dummy_vars = pd.get_dummies(data['Embarked'])


data = pd.concat([data, dummy_vars], axis=1, join='outer').drop(['Embarked'], axis=1)


data = data.replace({'Sex': {'male':0, 'female':1}})
data.head()

# minmax scaler for Age
data['Age'] = (data['Age'] - data['Age'].min()) / (data['Age'].max() - data['Age'].min())
    
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,C,Q,S
0,0,3,0,0.271174,1,0,0,0,1
1,1,1,1,0.472229,1,0,1,0,0
2,1,3,1,0.321438,0,0,0,0,1
3,1,1,1,0.434531,1,0,0,0,1
4,0,3,0,0.434531,0,0,0,0,1


In [7]:
# split data into training and testing

X = np.array(data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'C', 'Q', 'S']])

y = np.array(data['Survived'])

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, shuffle=True)

In [8]:
clf_dt = DecisionTreeClassifier(random_state=42).fit(X_train,y_train)
y_pred = clf_dt.predict(X_test)
print('The accuracy with default Decision Tree parameters is: {}%'.format(accuracy_score(y_pred, y_test)*100))

The accuracy with default Decision Tree parameters is: 79.7752808988764%


In [9]:
# hyperparameter optimization using GridSearch
clf = DecisionTreeClassifier(random_state=42)

# Create the parameters list you wish to tune.
parameters = {'max_depth':[2,4,6,8,10],'min_samples_leaf':[2,4,6,8,10], 'min_samples_split':[2,3,4]}

# Make an fbeta_score scoring object.
scorer = make_scorer(f1_score)

# Perform grid search on the classifier using 'scorer' as the scoring method.
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)

# Fit the grid search object to the training data and find the optimal parameters.
grid_fit = grid_obj.fit(X_train, y_train)

# Get the estimator.
best_clf = grid_fit.best_estimator_

# Fit the new model.
best_clf.fit(X_train, y_train)

# Make predictions using the new model.
best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

# Calculate the f1_score of the new model.
print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))

# Let's also explore what parameters ended up being used in the new model.
best_clf



The training F1 Score is 0.7855711422845691
The testing F1 Score is 0.761904761904762


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=8, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')

In [10]:
accuracy_score(best_test_predictions, y_test)
print('The accuracy after using Grid Search is: {}%'.format(accuracy_score(best_test_predictions, y_test)*100))

The accuracy after using Grid Search is: 83.14606741573034%
