# Building a first model for the Titanic Survival Problem

First we will import the necessary.

In [1]:
import pandas as pd
import awswrangler as wr
import boto3

from titanic.jobs.prepare import add_gender_feature, add_age_feature, add_family_size_feature, add_has_cabin_feature, add_categorical_fare_feature, add_title_feature

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)

## Feature Engineering

We will start by using only a limited subset of features based on our exploration:
- Sex
- Age
- Family size
- Cabin
- Fare

In [2]:
boto3.setup_default_session()
ssm = boto3.client('ssm', region_name='eu-west-1')
parameter = ssm.get_parameter(Name='/conveyor-samples/bucket/name')
bucket = parameter['Parameter']['Value']

df = wr.s3.read_csv(path=f"s3://datafy-cp-artifacts/conveyor-samples/titanic/data/train.csv")

add_gender_feature(df)
add_age_feature(df)
add_family_size_feature(df)
add_has_cabin_feature(df)
add_categorical_fare_feature(df)
add_title_feature(df)
df = df.drop(
    ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'SibSp', 'Parch', 'PassengerId'], axis=1
)
df

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  df['Age'] = df['Age']. \
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  df['Fare'] = df['Fare']. \


Unnamed: 0,Survived,Pclass,Age,Fare,SexNumerical,FamilySize,HasCabin,CategoricalFare,Title
0,0,3,22.0,7.2500,1,1,0,0,3
1,1,1,38.0,71.2833,0,1,1,3,2
2,1,3,26.0,7.9250,0,0,0,1,4
3,1,1,35.0,53.1000,0,1,1,3,2
4,0,3,35.0,8.0500,1,0,0,1,3
...,...,...,...,...,...,...,...,...,...
886,0,2,27.0,13.0000,1,0,0,1,4
887,1,1,19.0,30.0000,0,0,1,2,4
888,0,3,21.5,23.4500,0,3,0,2,4
889,1,1,26.0,30.0000,1,0,1,2,3


We have our dataset and separate the target variable from the dataset.

In [3]:
target = 'Survived'
features = [ col for col in df.columns if col != target ]
X, y = df[features], df[target]

Now its time to initialize our models. After initializing I am calculating the feature importances predicted by each model.

Note that I am passing a same random state to all the models. But why?

Using this seed parameter makes sure that anyone who re-runs your code will get the exact same outputs which is extremely important concept in data science.

In [4]:
parameters = {
    "random_state": 2,
}

rf = RandomForestClassifier(**parameters)
et = ExtraTreesClassifier(**parameters)
ada = AdaBoostClassifier(**parameters)
gb = GradientBoostingClassifier(**parameters)

models = [rf, et, ada, gb]
model_names = ['RandomForest', 'ExtraTrees', 'Ada', 'GradientBoost']
[ m.fit(X, y) for m in models ]
feature_importances = { name: m.feature_importances_ for name, m in zip(model_names, models) }

Now we take a mean of feature importance calculated by each model.

In [5]:
feature_df = pd.DataFrame(feature_importances)
feature_df.insert(0, 'features', features)

feature_df['mean'] = feature_df.mean(axis=1, numeric_only=True)
feature_df

Unnamed: 0,features,RandomForest,ExtraTrees,Ada,GradientBoost,mean
0,Pclass,0.066796,0.079641,0.04,0.122411,0.077212
1,Age,0.24436,0.236823,0.34,0.125614,0.236699
2,Fare,0.239435,0.19544,0.32,0.136967,0.222961
3,SexNumerical,0.188492,0.224235,0.02,0.480577,0.228326
4,FamilySize,0.075484,0.076792,0.14,0.0748,0.091769
5,HasCabin,0.041184,0.046937,0.04,0.043106,0.042807
6,CategoricalFare,0.036276,0.048421,0.02,0.001482,0.026545
7,Title,0.107973,0.09171,0.08,0.015044,0.073682


From the above data, we can see that Sex, Age and Fare played important role in predicting the target variable Survived.

## Model training and evaluation

Now we import train_test_split from sklearn package for splitting the data into train and test sets.

Here I am taking 20% of the data for testing and the rest 80% for training.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Voting Classifier is a machine learning model which ensembles predictions from a number of machine learning models and predicts the output based on voting.

So we are going to use voting classifier from sklearn for prediction.

Lets import the package and initialize the model and then fit the data in voting classifier.

In [7]:
vc = VotingClassifier(voting='soft', estimators=[
    (name, m) for name, m in zip(model_names, models)
])

In [8]:
vc.fit(X, y)

scores = cross_val_score(vc, X, y, cv=5, scoring='accuracy')
scoresBased on voting classifiers predictions lets calculate the cross validation scores.

In [9]:
scores = cross_val_score(vc, X, y, cv=5, scoring='accuracy')
scores

array([0.82122905, 0.80337079, 0.87078652, 0.76966292, 0.84269663])

Now we fit the data and then compare our predictions with actual data to get accuracy.

In [10]:
vc.fit(X_train, y_train)

In [11]:
pred = vc.predict(X_test)
100. * (pred == y_test).mean()

81.00558659217877

We get an accuracy of 79.32% which isn’t great. But that’s okay. Progress happens in small increments. Let's stop the experiment and save the data.