# Titanic Survival Predictions

### Ever wondered how you would have fared on the Titanic? Well, lets find out!

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

## Data Extraction and Preprocessing 

Let us load the data. The training data is in train.csv; the test data is in test.csv.

In [2]:
train = pd.read_csv('train.csv', index_col=0)
test = pd.read_csv('test.csv', index_col = 0)

Let us take a look at how the raw data looks like:

In [3]:
display(train.head())
display(test.head())

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


###### Let us check if of the data columns have missing values so we can know how to handle them properly.

In [4]:
columns = [train.Name, train.Sex, train.Age, train.SibSp, train.Parch, 
           train.Ticket, train.Fare, train.Cabin, train.Embarked]
str_columns = ["Name", 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

for i in range(len(str_columns)):
    print("No. of missing values in " + str_columns[i] + " : ", end = ' ')
    print(len(train.loc[columns[i].isnull()]))

No. of missing values in Name :  0
No. of missing values in Sex :  0
No. of missing values in Age :  177
No. of missing values in SibSp :  0
No. of missing values in Parch :  0
No. of missing values in Ticket :  0
No. of missing values in Fare :  0
No. of missing values in Cabin :  687
No. of missing values in Embarked :  2


From using common sense, we can rule out that the Name and the Ticket no. of a passenger will have no impact on their survival. So we can drop Name and Ticket no.


Although the Cabin type may have a slight impact on the survival rate, but almost 78% os the samples have a missing Cabin type. Due to the lack of information, we will drop the Cabin too.

In [5]:
# Dropping the irrelevant features
train = train.drop(['Name', 'Ticket', 'Cabin'], axis = 1)
test = test.drop(['Name', 'Ticket', 'Cabin'], axis = 1)

In [6]:
display(train.head())
display(test.head())

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,male,22.0,1,0,7.25,S
2,1,1,female,38.0,1,0,71.2833,C
3,1,3,female,26.0,0,0,7.925,S
4,1,1,female,35.0,1,0,53.1,S
5,0,3,male,35.0,0,0,8.05,S


Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
892,3,male,34.5,0,0,7.8292,Q
893,3,female,47.0,1,0,7.0,S
894,2,male,62.0,0,0,9.6875,Q
895,3,male,27.0,0,0,8.6625,S
896,3,female,22.0,1,1,12.2875,S


###### We now have to deal with the 177 missing Age values and the 2 missing Embarked Values

Since Age in continuos, the best way to deal with missing Age is to replace it with the average of the Age values.
Similarly, we will replace the missing Embarked values with the mode of the Embarked values as it is categorical.

In [7]:
mean_age = train.Age.mean()
mode_embarked = train.Embarked.value_counts()

print("Mean of Age: ", mean_age)
print("Mode of Embarked : ")
print(mode_embarked)

Mean of Age:  29.69911764705882
Mode of Embarked : 
S    644
C    168
Q     77
Name: Embarked, dtype: int64


###### So, we will replace the missing Age values with 29.7 (1 d.p. since Age data is displayed with 1 d.p.) and the missing Embarked values with "S"

In [8]:
# Let us make a copy of the DataFrame in case we need it later
train_without_replacement = train
test_without_replacement = test

# Replacing the missing Age values
train.Age = train.Age.fillna(29.7)
test.Age = test.Age.fillna(29.7)

# Replacing the missing Embarked values
train.Embarked = train.Embarked.fillna('S')

In [9]:
display(train.head())

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,male,22.0,1,0,7.25,S
2,1,1,female,38.0,1,0,71.2833,C
3,1,3,female,26.0,0,0,7.925,S
4,1,1,female,35.0,1,0,53.1,S
5,0,3,male,35.0,0,0,8.05,S


#### Now that we have filled in the missing values, let us deal with the categorical features in our dataset.

We will us One-Hot-Encoding to properly represent Pclass, Sex and Embarked. We will use the get_dummies function of pandas to get the one-hot-encoded features. 

We will also have to change the Pclass data type to string as get_dummies only works on string data.

In [10]:
train_dummies = pd.get_dummies(train)
test_dummies = pd.get_dummies(test)

# Changin Pclass to str dtype as get_dummies only works for string dtype
train_dummies['Pclass'] = train_dummies['Pclass'].astype(str)
test_dummies['Pclass'] = test_dummies['Pclass'].astype(str)

train_dummies = pd.get_dummies(train_dummies, columns=['Pclass'])
test_dummies = pd.get_dummies(test_dummies, columns=['Pclass'])

Doing the same thing but now for the training set without the Age. We do this since we replaced 177 missing Age values. So we will check to see if adding synthetic Age values really hurts the mean.

In [11]:
train_without_age = train_without_replacement.drop(['Age'], axis = 1)
test_without_age = test_without_replacement.drop(['Age'], axis = 1)


# One-hot-encoding
train_without_age_dummies = pd.get_dummies(train_without_age)
test_without_age_dummies = pd.get_dummies(test_without_age)

# Changin Pclass to str dtype as get_dummies only works for string dtype
train_without_age_dummies['Pclass'] = train_without_age_dummies['Pclass'].astype(str)
test_without_age_dummies['Pclass'] = test_without_age_dummies['Pclass'].astype(str)

train_without_age_dummies = pd.get_dummies(train_without_age_dummies, columns=['Pclass'])
test_without_age_dummies = pd.get_dummies(test_without_age_dummies, columns=['Pclass'])

display(train_without_age_dummies.head())
display(test_without_age_dummies.head())

Unnamed: 0_level_0,Survived,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,1,0,7.25,0,1,0,0,1,0,0,1
2,1,1,0,71.2833,1,0,1,0,0,1,0,0
3,1,0,0,7.925,1,0,0,0,1,0,0,1
4,1,1,0,53.1,1,0,0,0,1,1,0,0
5,0,0,0,8.05,0,1,0,0,1,0,0,1


Unnamed: 0_level_0,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
892,0,0,7.8292,0,1,0,1,0,0,0,1
893,1,0,7.0,1,0,0,0,1,0,0,1
894,0,0,9.6875,0,1,0,1,0,0,1,0
895,0,0,8.6625,0,1,0,0,1,0,0,1
896,1,1,12.2875,1,0,0,0,1,0,0,1


In [12]:
display(train_dummies.head())
display(test_dummies.head())

Unnamed: 0_level_0,Survived,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,22.0,1,0,7.25,0,1,0,0,1,0,0,1
2,1,38.0,1,0,71.2833,1,0,1,0,0,1,0,0
3,1,26.0,0,0,7.925,1,0,0,0,1,0,0,1
4,1,35.0,1,0,53.1,1,0,0,0,1,1,0,0
5,0,35.0,0,0,8.05,0,1,0,0,1,0,0,1


Unnamed: 0_level_0,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
892,34.5,0,0,7.8292,0,1,0,1,0,0,0,1
893,47.0,1,0,7.0,1,0,0,0,1,0,0,1
894,62.0,0,0,9.6875,0,1,0,1,0,0,1,0
895,27.0,0,0,8.6625,0,1,0,0,1,0,0,1
896,22.0,1,1,12.2875,1,0,0,0,1,0,0,1


There is only one Nan Value for 'Fare' in the Test dataset; we will deal with in the same way as we did for Age in train dataset; we will replace it with the mean value.

In [13]:
test_dummies.loc[test_dummies.Fare.isnull()]

Unnamed: 0_level_0,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1044,60.5,0,0,,0,1,0,0,1,0,0,1


In [14]:
mean_fare = train_dummies.Fare.mean()
print(mean_fare)

32.2042079685746


In [15]:
test_dummies.Fare = test_dummies.Fare.fillna(32.2042)
test_without_age_dummies.Fare = test_without_age_dummies.Fare.fillna(32.2042)

###### Althought it might seem counter-intuitive to replace it with the training Fare mean, it is the right thing to do!

### Let us take a look at the data once again before we build the models. We will take a step back and try to discuss what the features really mean.

In [16]:
display(train_dummies.head())

Unnamed: 0_level_0,Survived,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,22.0,1,0,7.25,0,1,0,0,1,0,0,1
2,1,38.0,1,0,71.2833,1,0,1,0,0,1,0,0
3,1,26.0,0,0,7.925,1,0,0,0,1,0,0,1
4,1,35.0,1,0,53.1,1,0,0,0,1,1,0,0
5,0,35.0,0,0,8.05,0,1,0,0,1,0,0,1


### Discussion

Let us go feature by feature:

1) Age - This may help as may be high age or low age indicate a higher survival rate due to getting more proprity for the  lifeboats; or it may be higher the age lower the chances of survival due to lack of physical energy.

2) Sibling & Spouse On Board - Maybe the higher this value, higher the chance of survival due to priorty on lifeboats.

3) Parents, Children & Sex - Parents, children and females may have higher chance of survival again due to priority on lifeboards.

4) Embarked - This may not have a very big difference but we may be wrong. Maybe, the place you Embarked was a rich area and you have prority on the lifeboats.

5) Passenger Class - This will surely make a difference as higer the passenger class, more the survival rate.

##### We will now convert the data into NumPy arrays so that scikit-learn can handle it.

In [17]:
X_train = train_dummies.loc[:, 'Age':].values
y_train = train_dummies.loc[:, 'Survived'].values
X_test = test_dummies.loc[:, 'Age':].values

In [18]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)

(891, 12)
(891,)
(418, 12)


In [19]:
X_train_without_age = train_without_age_dummies.loc[:, 'SibSp':].values
X_test_without_age = test_without_age_dummies.loc[:, 'SibSp':].values

In [20]:
print(X_train_without_age.shape)
print(X_test_without_age.shape)

(891, 11)
(418, 11)


###### So we have 12 features for now.

## Models before any Feature Engineering

##### Let us see how our models perform before we carry out any sort of feature engineering.

###### We will use GridSearchCV to tune the hyperparameters.

### Logistic Regression

In [21]:
# The hyperparameters to search over
param_grid = {'C':[0.01, 0.1, 1 ,10, 100]}

# Instantiating the model
grid_logreg = GridSearchCV(LogisticRegression(max_iter=10000), param_grid, cv=5)

In [22]:
# Fitting the model
grid_logreg.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=10000, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.01, 0.1, 1, 10, 100]}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring=None, verbose=0)

##### Let us see what was the best score while doing grid search and also the best hyperparameters found.

In [23]:
print("Best hyperparameters: {}".format(grid_logreg.best_params_))
print("Best cross-validation score for those hyperparameters: {:.2f}".format(grid_logreg.best_score_))

Best hyperparameters: {'C': 0.1}
Best cross-validation score for those hyperparameters: 0.80


### Support Vector Machine

###### We will repeat the same process by using grid search to select the best kernel AND hyperparameters.

In [24]:
# The kernel and associated hyperparameters to search over
param_grid = [{'kernel': ['rbf'], 'C': [0.01, 0.1, 1, 10, 100], 'gamma': [0.01, 0.1, 1, 10, 100]},
                {'kernel': ['linear'], 'C': [0.01, 0.1, 1, 10, 100]}]

# Instantiating the model
grid_svm = GridSearchCV(SVC(), param_grid, cv=5, n_jobs=-1)

In [25]:
# Fitting the model
grid_svm.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid=[{'C': [0.01, 0.1, 1, 10, 100],
                          'gamma': [0.01, 0.1, 1, 10, 100], 'kernel': ['rbf']},
                         {'C': [0.01, 0.1, 1, 10, 100], 'kernel': ['linear']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [26]:
print("Best hyperparameters:  {}".format(grid_svm.best_params_))
print("Best cross-validation score for those hyperparameters: {:.2f}".format(grid_svm.best_score_))

Best hyperparameters:  {'C': 100, 'kernel': 'linear'}
Best cross-validation score for those hyperparameters: 0.79


### Random Forest

In [27]:
forest = RandomForestClassifier(n_estimators=10)

# Since there are no hyperparameters to search over, we will just use the cross_val_score function
scores_forest = cross_val_score(forest, X_train, y_train)

In [28]:
print("Best cross-validation score for 10 Random Forests: {:.2f}".format(np.mean(scores_forest)))

Best cross-validation score for 10 Random Forests: 0.81


In [29]:
# Fitting the model
forest.fit(X_train, y_train)
# We will need this to generate test data predictions

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## Feature Engineering

We will make copies of the training data so that we don't corrupt the original data

In [30]:
X_train2 = X_train
X_test2 = X_test

print(X_train2.shape, X_test2.shape)
print(y_train.shape)

(891, 12) (418, 12)
(891,)


### Binning/Discretization

###### Let us look at the satistics of Age and Fare (the only two continuos features)

In [31]:
print("Age Statistics:")
display(train_dummies.Age.describe())

print("Fare Statistics:")
display(train_dummies.Fare.describe())

Age Statistics:


count    891.000000
mean      29.699293
std       13.002015
min        0.420000
25%       22.000000
50%       29.700000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

Fare Statistics:


count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

###### We immediately notice form the interquartile range and the max value for Fare, that the data is very skewed; so, binning does not make sense for Fare values.

###### We will instead try binning for Age since it seems to be more Nearly Normal

In [32]:
# Making the bins
bins = np.linspace(0, 80, 16)

# Creating bins for the data
which_bin_train = np.digitize(X_train2, bins=bins)

print("\nData points:\n", X_train2[:5])
print("\nBin membership for data points:\n", which_bin_train[:5])


Data points:
 [[22.      1.      0.      7.25    0.      1.      0.      0.      1.
   0.      0.      1.    ]
 [38.      1.      0.     71.2833  1.      0.      1.      0.      0.
   1.      0.      0.    ]
 [26.      0.      0.      7.925   1.      0.      0.      0.      1.
   0.      0.      1.    ]
 [35.      1.      0.     53.1     1.      0.      0.      0.      1.
   1.      0.      0.    ]
 [35.      0.      0.      8.05    0.      1.      0.      0.      1.
   0.      0.      1.    ]]

Bin membership for data points:
 [[ 5  1  1  2  1  1  1  1  1  1  1  1]
 [ 8  1  1 14  1  1  1  1  1  1  1  1]
 [ 5  1  1  2  1  1  1  1  1  1  1  1]
 [ 7  1  1 10  1  1  1  1  1  1  1  1]
 [ 7  1  1  2  1  1  1  1  1  1  1  1]]


In [33]:
# Transform using the OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Nncoder.fit finds the unique values that appear in which_bin
encoder.fit(which_bin_train)

# Transform creates the one-hot encoding
X_train_binned = encoder.transform(which_bin_train)

print(X_train_binned[:5])

[[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]]


In [34]:
print("Shape of binned training set: {}".format(X_train_binned.shape))

Shape of binned training set: (891, 43)


Let us now run logistic regression with the binned training set:

In [35]:
# The hyperparameters to search over
param_grid = {'C':[0.01, 0.1, 1 ,10, 100]}

# Instantiating the model
grid_logreg_binned = GridSearchCV(LogisticRegression(max_iter=10000), param_grid, cv=5)

# Fitting the model
grid_logreg_binned.fit(X_train_binned, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=10000, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.01, 0.1, 1, 10, 100]}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring=None, verbose=0)

In [36]:
print("Best hyperparameters: {}".format(grid_logreg_binned.best_params_))
print("Best cross-validation score for those hyperparameters: {:.2f}".format(grid_logreg_binned.best_score_))

Best hyperparameters: {'C': 10}
Best cross-validation score for those hyperparameters: 0.71


###### Binning the Age values decreased the the score to 71% from 80%. So binning did not help!

### Interactions

##### Let us see if adding interactions between the features helps the model.

First, we will check the score by just adding the old fetures to the binned features.

In [37]:
X_combined = np.hstack([X_train2, X_train_binned])
print(X_combined.shape)

(891, 55)


In [38]:
# The hyperparameters to search over
param_grid = {'C':[0.01, 0.1, 1 ,10, 100]}

# Instantiating the model
grid_logreg_combined = GridSearchCV(LogisticRegression(max_iter=10000), param_grid, cv=5)

# Fitting the model
grid_logreg_combined.fit(X_combined, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=10000, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.01, 0.1, 1, 10, 100]}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring=None, verbose=0)

In [39]:
print("Best hyperparameters: {}".format(grid_logreg_combined.best_params_))
print("Best cross-validation score for those hyperparameters: {:.2f}".format(grid_logreg_combined.best_score_))

Best hyperparameters: {'C': 1}
Best cross-validation score for those hyperparameters: 0.81


###### Combining the old features with the binnedd features helps! The score increased, though only slightly, to 81% from 80%.


The score with only binned data was 71%. This score only confirms our intuition from the previous model that binning is not useful. This is as the higher score now is only due to the old featueres being added back to the feature vector.

##### Let us just try interactions between the data to see if it helps:

In [40]:
X_product = np.hstack([X_train_binned, X_train2[:,0].reshape(891,1) * X_train_binned]) 

print(X_product.shape)

(891, 86)


In [41]:
# The hyperparameters to search over
param_grid = {'C':[0.01, 0.1, 1 ,10, 100]}

# Instantiating the model
grid_logreg_interactions = GridSearchCV(LogisticRegression(max_iter=10000), param_grid, cv=5)

# Fitting the model
grid_logreg_interactions.fit(X_product, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=10000, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.01, 0.1, 1, 10, 100]}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring=None, verbose=0)

In [42]:
print("Best hyperparameters: {}".format(grid_logreg_interactions.best_params_))
print("Best cross-validation score for those hyperparameters: {:.2f}".format(grid_logreg_interactions.best_score_))

Best hyperparameters: {'C': 0.01}
Best cross-validation score for those hyperparameters: 0.71


##### This doesn't help either.

### Getting lower scores from binning and interactions shows that this sort of feature engineering is not helping and it is best for us to simply stick with the original feature set.

### Rescaling the Data (Highest Kaggle Score: 0.78468)

##### It is known that SVM benefits from having all the features being on a similar scale. Let us see if that appliest to this dataset.

We will use the MinMaxScaler which scales all the data such that all the values are in the interval [0,1]

In [43]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train2)
X_test_scaled = scaler.transform(X_test2)

In [44]:
print(X_train_scaled.shape, X_test_scaled.shape)
print(y_train.shape)

(891, 12) (418, 12)
(891,)


In [45]:
# Hyperparameters to search over together with their associated kernels
param_grid = [{'kernel': ['rbf'], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]},
                {'kernel': ['linear'], 'C': [0.001, 0.01, 0.1, 1, 10, 100]}]

# Instantiating the model
grid_svm_scaled = GridSearchCV(SVC(), param_grid, cv=5, n_jobs=-1)

In [46]:
# Fitting the model
grid_svm_scaled.fit(X_train_scaled, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid=[{'C': [0.001, 0.01, 0.1, 1, 10, 100],
                          'gamma': [0.001, 0.01, 0.1, 1, 10, 100],
                          'kernel': ['rbf']},
                         {'C': [0.001, 0.01, 0.1, 1, 10, 100],
                          'kernel': ['linear']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [47]:
print("Best hyperparameters: {}".format(grid_svm_scaled.best_params_))
print("Best cross-validation score for those hyperparameters: {:.2f}".format(grid_svm_scaled.best_score_))

Best hyperparameters: {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}
Best cross-validation score for those hyperparameters: 0.82


#### This is the best cross-validation score so far.
### It also happens to give us the highest Kaggle Score overall of 0.78468

## The Effect of the Age Feature

###### Remember that our training data had 177 samples with missing Age values.
We replaced these with the mean value and that was fine as the Age distibution was net skewed and was Nearly Normal.

Let us take a look once again to see how the model perform if we remove the Age values completely.

Remember from the discussion, we hypothesized that Age would be helpful as very high age or very low age means higher the chances of survival due to higher priority on lifeboats.

### Logistic Regression

In [48]:
# The hyperparameters to search over
param_grid = {'C':[0.01, 0.1, 1 ,10, 100]}

# Instantiating the model
grid_logreg_without_age = GridSearchCV(LogisticRegression(max_iter=10000), param_grid, cv=5)

grid_logreg_without_age.fit(X_train_without_age, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=10000, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.01, 0.1, 1, 10, 100]}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring=None, verbose=0)

In [49]:
print("Best hyperparameters: {}".format(grid_logreg_without_age.best_params_))
print("Best cross-validation score for those hyperparameters: {:.2f}".format(grid_logreg_without_age.best_score_))

Best hyperparameters: {'C': 0.1}
Best cross-validation score for those hyperparameters: 0.79


### Support Vector Machine

In [50]:
# The kernel and associated hyperparameters to search over
param_grid = [{'kernel': ['rbf'], 'C': [0.01, 0.1, 1, 10, 100], 'gamma': [0.01, 0.1, 1, 10, 100]},
                {'kernel': ['linear'], 'C': [0.01, 0.1, 1, 10, 100]}]

# Instantiating the model
grid_svm_without_age = GridSearchCV(SVC(), param_grid, cv=5, n_jobs=-1)

# Fitting the model
grid_svm_without_age.fit(X_train_without_age, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid=[{'C': [0.01, 0.1, 1, 10, 100],
                          'gamma': [0.01, 0.1, 1, 10, 100], 'kernel': ['rbf']},
                         {'C': [0.01, 0.1, 1, 10, 100], 'kernel': ['linear']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [51]:
print("Best hyperparameters:  {}".format(grid_svm_without_age.best_params_))
print("Best cross-validation score for those hyperparameters: {:.2f}".format(grid_svm_without_age.best_score_))

Best hyperparameters:  {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
Best cross-validation score for those hyperparameters: 0.79


# Overall, the model that achieved the highest accuracy was Support Vector Machines with Normalized Data (Min-Max Scaling) with a Kaggle Score of 0.78468.

### With this we reach the end of the notebook. There is definitely room for more exploration and growth and it is possible to obtain a higher accuracy with more preprocessing and feature engineering.