In [445]:
import pandas as pd
import matplotlib as plt
import numpy as np

Data Expoloration

In [446]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv('test.csv')
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [447]:
train_data.info()   #This code gives more information about the coulmns, specifying the counts and the datatypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Age, Cabin and Embarked columns have missing values

The code below shouws the statistical details of the data, indicating mean,minimum, maximum, standard deviation, median(50 percentile), 1st and 3rd percentiles.

In [448]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The average age of passengers is 30, this implies that more youth are onboard. The average fare is $32. The maiximum number of Sibling/Spouses aboard in the Titanic is 8 while the maximum Parents/Children aboard per passenger is 6.

In [449]:
train_data['Pclass'].unique()

array([3, 1, 2], dtype=int64)

There are three ticket classes 

In [450]:
#train_data['Embarked'].unique()

There are three different ports of embarkation. The last category 'nan' implies that some passengers's port of embarkation were not recorded.

In [451]:
#train_data['Sex'].unique()

Passengers onboard are either Male or Female.

In [452]:
train_data.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


From the correlation table, the following hypotheses can be deduced:
    1. PassengerId has no significant correlation with the chances of a passenger surviving.
    2. Pclass has a negative correlation with the Survived column i.e Passengers in class 1 have higher chances of surviving than a passenger in class 3.
    3. Age has a low negative correlation to chances of Survival. However, older passengers have lower chances of survival.
    4. The higher the number of Siblings of Spouses of a passenger onboard, the lower the chances of survival.
    5. The higher the number of parents and children of a passenger onboard, the higher the chances of survival.
    6. Fare has a high positive correlation with chances of survival. A passenger who paid higher fare has a higher chance of surviving.

Feature Engineering and Data Cleaning

PassengerId,Name,Ticket columns are irrelevant to predict chances of survival, therefore they can be dropped. The cabin column has a lot of missing values so it can also be dropped.

In [453]:
main_data = train_data.loc[:,['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']]
main_test = test_data.loc[:,['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']]

Sex and Embarked columns are categorical text data. Using the LabelEncoder from sklearn, they can be encoded into numeric categorical data

In [454]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
main_data['Sex'] = labelencoder.fit_transform(main_data['Sex'])
main_test['Sex'] = labelencoder.fit_transform(main_test['Sex'])

In [455]:
main_data['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

To encoded the Embarked column, it is needed to fill the missing values.

In [456]:
main_data['Embarked'].mode()[0]

'S'

In [457]:
main_test['Embarked'].value_counts()

S    270
C    102
Q     46
Name: Embarked, dtype: int64

One method is to fill the missing values with the most common category (mode) i.e Southampo port, S.

In [458]:
main_data['Embarked'].fillna('S', inplace=True)
main_data['Embarked'] = LabelEncoder().fit_transform(main_data['Embarked'])

main_test['Embarked'].fillna('S', inplace=True)
main_test['Embarked'] = LabelEncoder().fit_transform(main_test['Embarked'])

In [459]:
main_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0
2,1,3,0,26.0,0,0,7.925,2
3,1,1,0,35.0,1,0,53.1,2
4,0,3,1,35.0,0,0,8.05,2


In [460]:
train_data['Age'].mode()[0]

24.0

The Age column is missing some values. A good approach is to fill it up with the Median age.

In [461]:
#main_data = preprocessed_data(train_data)
#main_test = preprocessed_data(test_data).drop('Survived',inplace = True)

In [462]:
main_data['Age'].median()

28.0

In [463]:
main_test['Age'].median()

27.0

In [464]:
main_data['Age'].fillna(28, inplace=True)

In [465]:
main_test['Age'].fillna(27, inplace=True)

In [466]:
main_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null int32
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    891 non-null int32
dtypes: float64(2), int32(2), int64(4)
memory usage: 48.8 KB


Model Training

Training the dataset involves using the training and test dataset. I have only worked on the training dataset at this point. To ensure that my code doesn't throw an error, I will edit the lines of codes that involves cleaning of the train dataset to include the test dataset too. A better way would have been creating a function which can be used on both the train and test data. However, I love telling stories with my codes to ensure that my readers can fully understand my line of thoughts.

In [467]:
from sklearn import preprocessing,neighbors,svm
from sklearn.model_selection import train_test_split

X = main_data.drop(['Survived'],axis=1)
y = main_data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)


In [468]:
from sklearn import tree
clftree = tree.DecisionTreeClassifier(max_depth=6)
clftree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [469]:
from sklearn.metrics import accuracy_score, log_loss

print('Results')
train_predictions = clftree.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
print("Accuracy: {:.4%}".format(acc))

Results
Accuracy: 80.4469%


The accuracy of the model is 80.4%. After a lot of trials, I discovered that the max_depth between 5 and 9 produces an accuracy close to 80%.