# Titanic Dataset Kaggle Competition
#### Farahana, 6/8/2020

1. Problem: Prediction the survived people on-board. 
2. Target : To get 90% above accuracy when submitted to the board.
3. Data   : Training set with given survived people and Test (will be called validation set) set with unknown survival. 

Resources: [Youtube video ](https://www.youtube.com/watch?v=irHhDMbw3xo), [Code 1](https://github.com/justmarkham/scikit-learn-videos/blob/master/10_categorical_features.ipynb), [Code 2](https://github.com/mrdbourke/your-first-kaggle-submission/blob/master/kaggle-titanic-dataset-example-submission-workflow.ipynb)

In [1]:
# import packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Input train and validation data files
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

Let us do simple EDA. Check the shape, info, missing values and type of features. 

In [3]:
train_data.shape

(891, 12)

In [4]:
test_data.shape

(418, 11)

In [5]:
print(train_data.columns)
print(test_data.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


In [6]:
train_data.isna().sum()
# checking missing data

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [7]:
# Get the non-missing value of embarked for 
train_3feature = train_data.loc[train_data.Embarked.notna(), ['Survived', 'Pclass', 'Sex', 'Embarked']]

In [8]:
train_3feature.shape

(889, 4)

In [9]:
train_3feature.isna().sum()

Survived    0
Pclass      0
Sex         0
Embarked    0
dtype: int64

In [10]:
train_3feature.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked
0,0,3,male,S
1,1,1,female,C
2,1,3,female,S
3,1,1,female,S
4,0,3,male,S


Let us classify with common machine learning, Logistic Regression with cross-validation.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [12]:
X = train_3feature.loc[:, ['Pclass']] #to make 2D data
y = train_3feature.Survived

In [13]:
logreg = LogisticRegression(solver='lbfgs')
print(cross_val_score (logreg, X, y, cv=5, scoring='accuracy'))

[0.6011236  0.6741573  0.6741573  0.71910112 0.72316384]


In [14]:
y.value_counts(normalize=True)

0    0.617548
1    0.382452
Name: Survived, dtype: float64

Encode the non-numeric data such as 'Sex' and 'Embarked'.

In [15]:
from sklearn.preprocessing import OneHotEncoder

encoderSex = OneHotEncoder(sparse=False)
encoderSex.fit_transform(train_3feature[['Sex']])

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 1.]])

In [16]:
encoderSex.categories_

[array(['female', 'male'], dtype=object)]

Let us try with 3 features,

In [17]:
# let us drop 'Survived' column to make use of column transformer
X = train_3feature.drop('Survived', axis='columns')

In [18]:
from sklearn.compose import make_column_transformer 
## use this when few columns need preprocessing such as onehotencoder above.

column_trans = make_column_transformer((OneHotEncoder(), ['Sex', 'Embarked']), remainder='passthrough')
# remainder column such as 'Pclass' will retain it values.

In [19]:
column_trans.fit_transform(X) # Sex(2), Embarked(3), Pclass(1)

array([[0., 1., 0., 0., 1., 3.],
       [1., 0., 1., 0., 0., 1.],
       [1., 0., 0., 0., 1., 3.],
       ...,
       [1., 0., 0., 0., 1., 3.],
       [0., 1., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 3.]])

Let us try to use pipeline now.

In [20]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(column_trans, logreg)

In [21]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
# this will split the data and use the pipeline.

array([0.76404494, 0.79213483, 0.76966292, 0.75280899, 0.78531073])

In [22]:
from sklearn import set_config
set_config(display='diagram')
pipe

let us try to fit with different machine learning algorithms.

In [23]:
import time, datetime

from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [24]:
def fit_ml_algo (algo):
    pipe = make_pipeline(column_trans, algo)
    acc = cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
    return acc

#### KNN

In [25]:
start_time = time.time()
acc_algo = fit_ml_algo(KNeighborsClassifier(n_neighbors=5))

algo_time = (time.time() - start_time)
print ("Mean accuracy :{:.3f}".format(acc_algo))
print ("Duration      :{:.3f} sec".format(algo_time))

Mean accuracy :0.773
Duration      :0.052 sec


#### Gaussian Naive Bayes

In [26]:
start_time = time.time()
acc_algo = fit_ml_algo(GaussianNB())

algo_time = (time.time() - start_time)
print ("Mean accuracy :{:.3f}".format(acc_algo))
print ("Duration      :{:.3f} sec".format(algo_time))

Mean accuracy :0.775
Duration      :0.026 sec


#### Linear SVM

In [27]:
start_time = time.time()
acc_svc = fit_ml_algo(LinearSVC())

algo_time = (time.time() - start_time)
print ("Mean accuracy :{:.3f}".format(acc_svc))
print ("Duration      :{:.3f} sec".format(algo_time))

Mean accuracy :0.776
Duration      :0.053 sec


#### Boosting Classifier

In [28]:
start_time = time.time()
acc_boost = fit_ml_algo(GradientBoostingClassifier())

algo_time = (time.time() - start_time)
print ("Mean accuracy :{:.3f}".format(acc_boost))
print ("Duration      :{:.3f} sec".format(algo_time))

Mean accuracy :0.811
Duration      :0.252 sec


Boosting classifier has the best accuracy for now with the 3 features. We can try with more features from here on. First, let us try to submit the result from the boosting classifier.

In [29]:
test_data.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

As we are going to take the three features, all the seem to have no missing values. If we have missing values for features in test, we should do something on that to be able to have a prediction with the GradientBoostingClassifier.column_trans.fit_transform(X_test)

In [31]:
X_test = test_data.loc[:,['Pclass', 'Sex', 'Embarked']]

In [32]:
column_trans.fit_transform(X_test)

array([[0., 1., 0., 1., 0., 3.],
       [1., 0., 0., 0., 1., 3.],
       [0., 1., 0., 1., 0., 2.],
       ...,
       [0., 1., 0., 0., 1., 3.],
       [0., 1., 0., 0., 1., 3.],
       [0., 1., 1., 0., 0., 3.]])

In [41]:
clf_boost = GradientBoostingClassifier()
clf_boost.fit(column_trans.fit_transform(X), y)
predictions = clf_boost.predict(column_trans.fit_transform(X_test))

In [42]:
submission = pd.DataFrame()
submission['PassengerId'] = test_data['PassengerId']
submission['Survived'] = predictions

In [46]:
submission.to_csv('submission.csv', index=False)

We have 0.77751 accuracy for this submission. First guess is we have not use the advantages of other feautures.