The following is an attempt at predcting if a passenger on Titanic survived or not based on attributes such as:
    
Passenger Class, Sex, Age, Number of siblings/spouses, Fare, Port of embarkment

More details about the dataset can be found here: https://www.kaggle.com/c/titanic/data

In [531]:
# We import the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from collections import Counter

In [532]:
# Load train and test datasets

train = pd.read_csv("/Users/ankit/Desktop/Titanic Dataset/train.csv")
test = pd.read_csv("/Users/ankit/Desktop/Titanic Dataset/test.csv")

In [533]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Since outliers can have a dramatic effect on the prediction, we first deal with them.

We have used the Tukey method to detect ouliers which defines an interquartile range comprised between the 1st and 3rd quartile of the distribution values (IQR). An outlier is a row that have a feature value outside the (IQR +- an outlier step).

We detect outliers from the numerical values features (Age, SibSp, Sarch and Fare). Then, we consider outliers as rows that have at least two outlied numerical values.

In [534]:
# Outlier detection 
# Reference - https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling/notebook

def detect_outliers(df,n,features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    """
    outlier_indices = []
    
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    
    return multiple_outliers   

# detect outliers from Age, SibSp , Parch and Fare
Outliers_to_drop = detect_outliers(train,2,["Age","SibSp","Parch","Fare"])


  interpolation=interpolation)


In [535]:
# Show the outliers rows
train.loc[Outliers_to_drop]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
159,160,0,3,"Sage, Master. Thomas Henry",male,,8,2,CA. 2343,69.55,,S
180,181,0,3,"Sage, Miss. Constance Gladys",female,,8,2,CA. 2343,69.55,,S
201,202,0,3,"Sage, Mr. Frederick",male,,8,2,CA. 2343,69.55,,S
324,325,0,3,"Sage, Mr. George John Jr",male,,8,2,CA. 2343,69.55,,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0,C23 C25 C27,S
792,793,0,3,"Sage, Miss. Stella Anna",female,,8,2,CA. 2343,69.55,,S
846,847,0,3,"Sage, Mr. Douglas Bullen",male,,8,2,CA. 2343,69.55,,S
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.55,,S


We detect 10 outliers. The 28, 89 and 342 passenger have an high Ticket Fare

The 7 others have very high values of SibSP.

In [536]:
# Drop outliers
train = train.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)

## Joining train and test set

In [537]:
## Join train and test datasets in order to obtain the same number of features during categorical conversion
train_len = len(train)
dataset =  pd.concat(objs=[train, test], axis=0).reset_index(drop=True)

 ## Check for null and missing values

In [538]:
# Fill empty and NaNs values with NaN
dataset = dataset.fillna(np.nan)

# Check for Null values
dataset.isnull().sum()

Age             256
Cabin          1007
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64

** Age and Cabin features have an important part of missing values. **

** Survived missing values correspond to the join testing dataset (Survived column doesn't exist in test set and has been replace by NaN values when concatenating the train and test set) **

In [539]:
dataset.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803
4,35.0,,S,8.05,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450


In [540]:
#Removing all rows which do not have a 'Survived' column value 
dataset.dropna(subset = ['Survived'],inplace=True)
dataset.Survived.isnull().sum()

0

In [541]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 881 entries, 0 to 880
Data columns (total 12 columns):
Age            711 non-null float64
Cabin          201 non-null object
Embarked       879 non-null object
Fare           881 non-null float64
Name           881 non-null object
Parch          881 non-null int64
PassengerId    881 non-null int64
Pclass         881 non-null int64
Sex            881 non-null object
SibSp          881 non-null int64
Survived       881 non-null float64
Ticket         881 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 89.5+ KB


** Splitting into features and labels **

In [542]:
# converting the label values from float to int

y = dataset.Survived.astype(int)
y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [543]:
x = dataset.drop('Survived',axis = 1)
x.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Ticket
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,A/5 21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,PC 17599
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,STON/O2. 3101282
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,113803
4,35.0,,S,8.05,"Allen, Mr. William Henry",0,5,3,male,0,373450


## Seperating train and test datasets 

In [544]:
## Separate train dataset and test dataset

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify = y)

In [545]:
x_train.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Ticket
597,35.0,,C,26.55,"Homer, Mr. Harry (""Mr E Haven"")",0,605,1,male,0,111426
478,,,S,25.4667,"Lefebre, Miss. Jeannie",1,486,3,female,3,4133
259,,,Q,7.75,"Henry, Miss. Delia",0,265,3,female,0,382649
251,,,C,79.2,"Thorne, Mrs. Gertrude Maybelle",0,257,1,female,0,PC 17585
241,25.0,,S,7.775,"Lindahl, Miss. Agda Thorilda Viktoria",0,247,3,female,0,347071


** Selecting features on which to train our models **

In [546]:
x_train_cols = x_train[['Fare','Parch','Pclass','Sex','SibSp']]
x_train_cols.head()

Unnamed: 0,Fare,Parch,Pclass,Sex,SibSp
597,26.55,0,1,male,0
478,25.4667,1,3,female,3
259,7.75,0,3,female,0
251,79.2,0,1,female,0
241,7.775,0,3,female,0


In [547]:
# Doing the same with the test set
x_test_cols = x_test[['Fare','Parch','Pclass','Sex','SibSp']]

### Creating dummy variables

In [548]:
# Get dummy variables for categorical feature 'Sex' thus encoding it as 0 and 1
x_train_cat1 = x_train['Sex'].str.get_dummies()
x_train_cat1.head()

Unnamed: 0,female,male
597,0,1
478,1,0
259,1,0
251,1,0
241,1,0


In [549]:
# Doing the same with the test set
x_test_cat1 = x_test['Sex'].str.get_dummies()

In [550]:
# Concatenate both the dataframes along axis = 1 (columns)
x_train_cols = pd.concat(objs = [x_train_cols, x_train_cat1],axis = 1)
x_train_cols.head()

Unnamed: 0,Fare,Parch,Pclass,Sex,SibSp,female,male
597,26.55,0,1,male,0,0,1
478,25.4667,1,3,female,3,1,0
259,7.75,0,3,female,0,1,0
251,79.2,0,1,female,0,1,0
241,7.775,0,3,female,0,1,0


In [551]:
# Doing the same with the test set
x_test_cols = pd.concat(objs = [x_test_cols, x_test_cat1],axis = 1)

In [552]:
# Drop categorical 'Sex' column
x_train_cols.drop(columns = 'Sex',axis=1,inplace=True)
x_train_cols.head()

Unnamed: 0,Fare,Parch,Pclass,SibSp,female,male
597,26.55,0,1,0,0,1
478,25.4667,1,3,3,1,0
259,7.75,0,3,0,1,0
251,79.2,0,1,0,1,0
241,7.775,0,3,0,1,0


In [553]:
# Doing the same with the test set
x_test_cols.drop (columns = 'Sex',axis=1,inplace=True)
x_test_cols.head()

Unnamed: 0,Fare,Parch,Pclass,SibSp,female,male
685,56.4958,0,3,0,0,1
671,46.9,6,3,1,1,0
161,39.6875,1,3,4,0,1
210,113.275,0,1,1,1,0
593,27.0,1,2,2,1,0


In [554]:
y_train.head()

597    1
478    0
259    0
251    1
241    0
Name: Survived, dtype: int64

## Decision Tree

In [555]:
from sklearn.tree import DecisionTreeClassifier

In [556]:
tree_clf = DecisionTreeClassifier(random_state=42,max_depth = 3)

In [557]:
tree_clf.fit(x_train_cols,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')

In [558]:
y_pred = tree_clf.predict(x_test_cols)

In [559]:
from sklearn.metrics import classification_report,confusion_matrix

In [560]:
print('Confusion Matrix:')
print('\n')
print(confusion_matrix(y_test,y_pred))
print('\n')
print('Classification Report:')
print('\n')
print(classification_report(y_test,y_pred))

Confusion Matrix:


[[97 12]
 [27 41]]


Classification Report:


             precision    recall  f1-score   support

          0       0.78      0.89      0.83       109
          1       0.77      0.60      0.68        68

avg / total       0.78      0.78      0.77       177



## Bagging

** Reference: **

** Aurélien Géron. “Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. **

Bagging (Bootstrap Aggregating - sampling with replacement)

In [561]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(x_train_cols, y_train)
y_pred_bag = bag_clf.predict(x_test_cols)

## Out of Bag Boosting

With bagging, some instances might be sampled more than once while others might not be sampled at all. On average only about 63% of training instances are sampled. The remaining 37% of instances are called out-of-bag instances and since our training model has never seen this set of instances from the training set, it can be evaluated based on these out-of-bag instances.

In [562]:
oob_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,bootstrap=True, n_jobs=-1, oob_score=True)

In [563]:
oob_clf.fit(x_train_cols, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=500, n_jobs=-1, oob_score=True,
         random_state=None, verbose=0, warm_start=False)

In [564]:
oob_clf.oob_score_

0.80255681818181823

## Random Forest

We are going to train a Random Forest model, which is an ensemble of trees where sampling is done by the Bagging (Bootstrap Aggregating - sampling with replacement)

We are also going to specify some hyperparameters as follows:

• n_estimators = number of decision trees in the forest. Random forest aggregates all predictions via either hardsoft voting.

• max_leaf_nodes = maximum number of leaf nodes

• n_jobs = number of CPU cores to be utilized for the job. n=-1 means engaging all cores.

In [565]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1 ,random_state=42)

In [566]:
rf_clf.fit(x_train_cols,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=16,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [567]:
y_pred_rf=rf_clf.predict(x_test_cols)

In [568]:
print('Confusion Matrix:')
print('\n')
print(confusion_matrix(y_test,y_pred_rf))
print('\n')
print('Classification Report:')
print('\n')
print(classification_report(y_test,y_pred_rf))

Confusion Matrix:


[[99 10]
 [27 41]]


Classification Report:


             precision    recall  f1-score   support

          0       0.79      0.91      0.84       109
          1       0.80      0.60      0.69        68

avg / total       0.79      0.79      0.78       177



** Random Forest only gives us slightly better results **

## Hyperparameter Tuning

In [569]:
from sklearn.model_selection import GridSearchCV

The model is tuned on following hyperparameters using a grid search:
    
• min_samples_split: The number of samples in the node for the tree to split on

• bootstrap: Smapling with replacement

• n_estimators: Number of decsiion trees in our forest

• criterion: defines the cost function which the forest tries to minimize in order ot split the node. gini is a measure of the purity of the node

In [572]:
rf_param_grid = {"max_depth": [None],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ['gini','entropy']}

In [573]:
gsRFC = GridSearchCV(rf_clf,param_grid = rf_param_grid,n_jobs = -1)

In [574]:
gsRFC.fit(x_train_cols,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=16,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'max_depth': [None], 'min_samples_split': [2, 3, 10], 'min_samples_leaf': [1, 3, 10], 'bootstrap': [False], 'n_estimators': [100, 300], 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [575]:
# Best Estimator
RFC_best = gsRFC.best_estimator_
RFC_best

RandomForestClassifier(bootstrap=False, class_weight=None,
            criterion='entropy', max_depth=None, max_features='auto',
            max_leaf_nodes=16, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=300, n_jobs=-1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

In [576]:
# Best score
gsRFC.best_score_

0.79829545454545459

## Boosting

The main idea behind boosting methods is to train predictors sequentially, each trying to correct its predecessor.
Below I display AdaBoost, one of the most popular boosting methods.

## AdaBoost

Here, we train an AdaBoost classifier based on 200 Decision Stumps (Decision trees with the max_depth hyperparameter set to 1).

In [577]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5)

In [578]:
ada_clf.fit(x_train_cols, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=0.5, n_estimators=200, random_state=None)

# Thank You!