# Kaggle's Titanic Data Science Challenge

This notebook is a walkthrough a basic workflow for solving data science competitions at sites like Kaggle.

**Resources**:
- [Kaggle - Titanic](https://www.kaggle.com/c/titanic).
- [Kaggle - Titanic dataset](https://www.kaggle.com/competitions/titanic/data).
- [Kaggle - Titanic Data Science Solutions](https://www.kaggle.com/code/startupsci/titanic-data-science-solutions/notebook).
- [YouTube - Beginner Kaggle Data Science Project Walk-Through (Titanic)](https://www.youtube.com/watch?v=I3FBJdiExcg).
- [YouTube - Kaggle Titanic Survival Prediction Competition Part 1/2](https://www.youtube.com/watch?v=GSk-EEu1zkA&t=31s).
- [YouTube - Kaggle Titanic Survival Prediction Competition Part 2/2](https://www.youtube.com/watch?v=i5E2hruuLaQ).

## Setup

In [None]:
# Imports for data analysis.

import pandas as pd
import numpy as np
import random as rnd

In [None]:
# Imports for visualizations.

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# Imports for machine learning.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

## Data exploration

In [None]:
# Acquire data.

training_df = pd.read_csv('./data/train.csv')
testing_df = pd.read_csv('./data/test.csv')

In [None]:
# Peek into the training dataframe.

training_df.describe()

In [None]:
# Peek into the testing dataframe.

testing_df.describe()

In [None]:
# Preview training dataframe.

training_df.head()

In [None]:
# Check the type of each feature in the training dataframe.

training_df.info()

In [None]:
# Check the type of each feature in the test dataframe.

testing_df.info()

### Distribution of categorical and non-categorial features

How representative is the training dataset of the actual problem domain?
* The total samples are 891 or 40% of the actual number of passengers on board of the Titanic.
* Survived is a categorical feature with 0 or 1 values.
* Around 38% samples survived representative of the actual susrvival rate at 32%.
* Most passengers (>75%) did not travel with parents or children.
* Fared varied significantly with few passengers (< 1%) paying as high as $512.
* Few elderly passengers (<1%) withing age range 65-80.

What is the distribution of categorical features?
- Names are unique across the dataset.
- Sex variable as two possible values with 65% males.
- Cabin values have several duplicates across samples (several passengers shared a cabin).
- Embarked takes three possible values depending on the port.
- Ticket feature has high ratio (22%) of duplicate values.

In [None]:
training_df.describe(include=['O'])

### Assumptions based on the data analysis

> The following assumptions are validated further before taking appropriate actions.

**Completing**:
1. Complete the Age feature sice its definitely correlated to survival.
1. Complete the Embarked feature since its also correlated with survival.

**Correcting**:
1. Ticker feature is dropped since it contains a high ratio of duplicates and there may not be a correlation between Ticker and survival.
1. Cabin feature is dropped as it is highly incomplete or contains many `Null`s.
1. PassengerId is dropped from training dataset as it doesn't contribute to survival.
1. Name feature is relatively non-standard and may not contribute directly to survival, so maybe its dropped.

**Creating**:
1. Create a new feature called Family based on Parch and SibSp to get total count of family members on board.
1. Engineer the Name feature to extract Title as a new feature.
1. Create a new feature for Age bands.
1. Create a Fare range feature.


## Analysis

In [None]:
training_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

We observe signigicant correlation (>0.5) among Pclass = 1 and Survived.

In [None]:

training_df[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

We confirm that Sex=female had very high survival rate at 74%.

In [None]:
training_df[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:

training_df[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

As mentioned before, SibSp and Parch features have zero correlation for certain values. So it may be best to derive a feature or a set of features from these.

### Data visualization

A histogram chart is useful for analyzing continuous numerical variables like Age where banding or ranges will help identify useful patterns. The histogram can indicate distributions of samples using automatically defined bins or equally ranged bands.

In [None]:
grid = sns.FacetGrid(training_df, col='Survived')
grid.map(plt.hist, 'Age', bins=20)

In [None]:
grid = sns.FacetGrid(
    training_df,
    col='Survived',
    row='Pclass',
    height=2.2,
    aspect=1.6,
)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()

### Correlating categorical features

No we can correlate categorical features with our solution goal.

**Observations**
- Female passengers had much better survival rate than males.
- Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass and Embarked and in turn Pclass and Survived, not necessarily direct correlation between Embarked and Survived.
- Males had better survival rate in Pclass=3 compared with Pclass=2 for C and Q ports.
- Ports of embarkation have varying survival rates for Pclass=3 and among male passengers.

**Decisions**
- Add Sex feature to model training.
- Complete and add Embarked feature to model training.

In [None]:
grid = sns.FacetGrid(
    training_df,
    row='Embarked',
    height=2.2,
    aspect=1.6
)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()

We may also want to correlate categorical features (wit non-numeric values) and numeric signatures. We can consider correlating Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical numeric).

**Observations**
- Higher fare paying passengers had better survival.
- Port of embarkation correlates with survival rates.

In [None]:
grid = sns.FacetGrid(
    training_df,
    row='Embarked',
    col='Survived',
    height=2.2,
    aspect=1.6
)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, errorbar=None)
grid.add_legend()

## Correction

### Dropping features
Based on our assumptions we're going to drop the Cabin and Ticket features.

> We're going to do both operations for the training and testing datasets to stay consistent.

In [None]:
training_df = training_df.drop(['Ticket', 'Cabin'], axis=1)
testing_df = testing_df.drop(['Ticket', 'Cabin'], axis=1)

### Creating new features

We want to analyze if Name feature can be engineered to extract titles and test correlation between titles and survival, before dropping Name and PassengerId features.

**Observations**:
- Most titles band Age groups accurately.
- Survival among Title bands varies slightly.
- Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev,...)

**Decisions**:
- Retain the Title feature for model training.

In [None]:
training_df['Title'] = training_df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
testing_df['Title'] = testing_df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(training_df['Title'], training_df['Sex'])

In [None]:
# Replace titles with a more common name.

for dataset in [training_df, testing_df]:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

training_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()


In [None]:
# Convert categorical titles to ordinal.

title_map = {'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Rare': 5}

for dataset in [testing_df, training_df]:
    dataset['Title'] = dataset['Title'].map(title_map)
    dataset['Title'] = dataset['Title'].fillna(0)

training_df.head()

In [None]:
# Drop Name feature from testing and training datasets and
# PassengerId from training.

training_df = training_df.drop(['Name', 'PassengerId'], axis=1)
testing_df = testing_df.drop(['Name'], axis=1)

Now we can also convert categorical features to numerical values. This is required by most algorithms and doing so also help us achieving the feature completing goal.

In [None]:
# Map Sex feature.

for dataset in [training_df, testing_df]:
    dataset['Sex'] = dataset['Sex'].map({'female': 1, 'male': 0}).astype(int)

training_df.head()

### Completing numerical continuous feature

We should start estimating and completing features with missing or null values. The first feature will be the Age feature.

The method considered consists on using other correlated features. In our case we note correlation among Age, Gender and Pclass. Guess Age values using median values for Age across sets of Pclass and Gender feature combinations. So, median Age for Pclass=! and Gender=0, Pclass=1 and Gender=1, and so on...

In [None]:
grid = sns.FacetGrid(training_df, row='Pclass', col='Sex', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()

In [None]:
# Preparing an empty array to contain guessed Age values
# based on Pclass times Gender combinations.

guess_ages = np.zeros((2, 3))

In [None]:
# Iterate over Sex and Pclass to calculate guessed values of 
# Age for the six combinations.

for dataset in [training_df, testing_df]:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & (dataset['Pclass'] == j + 1)]['Age'].dropna()

            age_guess = guess_df.median()

            # Conver random age float to nearest .5 age.
            guess_ages[i, j] = int(age_guess/0.5 + 0.5) * 0.5

    for i in range(0, 2):
        for j in range (0, 3):
            dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j + 1), 'Age'] = guess_ages[i, j]

    dataset['Age'] = dataset['Age'].astype(int)

training_df.head()

In [None]:
# Create Age bands and determine correlations with Survived.

training_df['AgeBand'] = pd.cut(training_df['Age'], 5)
training_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

In [None]:
# Replace Age ordinals based on bands.

for dataset in [training_df, testing_df]:
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']

training_df.head()

In [None]:
# Remove AgeBand feature.

training_df = training_df.drop(['AgeBand'], axis=1)

training_df.head()

We can create now a new feature for FamilySize which combines Parch and SibSp. This will enable to drop both from the datasets.

In [None]:
for dataset in [training_df, testing_df]:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

training_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# Create IsAlone feature.

for dataset in [testing_df, training_df]:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

training_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

In [None]:
# Drop Parch, SibSp, and FamilySize in favor of IsAlone.

training_df = training_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
testing_df = testing_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)

training_df.head()

To complete the categorical Embarked feature that takes S, Q and C values based on port of embarkation we will simply fill these with the most common occurance.

In [None]:
freq_port = training_df.Embarked.dropna().mode()[0]
freq_port

In [None]:
for dataset in [training_df, testing_df]:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)

training_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Now we can also convert the Embarked feature into a numeric feature.

In [None]:
embarked_map = {'S': 0, 'C': 1, 'Q': 2}

for dataset in [testing_df, training_df]:
    dataset['Embarked'] = dataset['Embarked'].map(embarked_map)

training_df.head()

We now complete the Fare feature for single missing value in the test dataset using mode to get the value that occurs most frequently for this feature.

In [None]:
testing_df['Fare'].fillna(testing_df['Fare'].dropna().median(), inplace=True)

testing_df.head()

In [None]:
# Create FareBand feature.

training_df['FareBand'] = pd.qcut(training_df['Fare'], 4)
training_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

In [None]:
# Convert Fare feature to orfinal values based on FareBand.

for dataset in [testing_df, training_df]:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

training_df = training_df.drop(['FareBand'], axis=1)

In [None]:
training_df.head()

In [None]:
testing_df.head()

## Training

Since our problem is a classification and regression problem. We want to identify relationship between ouput (Survived or not) with other varibales or features. We are going to use supervised learning to training our model.

Logistic regression is a useful model to run early in the workflow.

In [None]:
# Logistic regression.

X_train = training_df.drop("Survived", axis=1)
Y_train = training_df["Survived"]
X_test  = testing_df.drop("PassengerId", axis=1).copy()

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)

Y_pred = logreg.predict(X_test)

acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

acc_log

In [None]:
# Calculate coefficient of the features.

coeff_df = pd.DataFrame(training_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df['Correlation'] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

**Observations**:
- Sex is highest positive coefficient, implying that as the Sex feature increases, the probability of Survived=1 increases the most.
- Inversely as Pclass increases, probability of Survived=1 decreases the most.
- Age*Class is a good artifical feature to model as it has second highest negative correlation with Survived.
- Titles is the second highest positive correlation and Embarked the third.

### Model evaluation.

In [None]:
# Support Vector machines

svc = SVC()
svc.fit(X_train, Y_train)

Y_pred = svc.predict(X_test)

acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

In [None]:
# KNN score.
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, Y_train)

Y_pred = knn.predict(X_test)

acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

In [None]:
# Gaussian Naïve Bayes.

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)

acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

In [None]:
# Perceptron.

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)

Y_pred = perceptron.predict(X_test)

acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron

In [None]:
# Linear SVC.

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc

In [None]:
# Stochastic Gradient Descent.

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)

Y_pred = sgd.predict(X_test)

acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd

In [None]:
# Desicion Tree.

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)

Y_pred = decision_tree.predict(X_test)

acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

In [None]:
# Random forest.

random_forest = RandomForestClassifier()
random_forest.fit(X_train, Y_train)

Y_pred = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

While both Decision Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision tree's habit of overfitting to the training set.

In [None]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naïve Bayes', 'Perceptron', 'Stochastic Gradient Decent', 'Linear SVC', 'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, acc_random_forest, acc_gaussian, acc_perceptron, acc_sgd, acc_linear_svc, acc_decision_tree]})

models.sort_values(by='Score', ascending=False)

## Submission

Submission to Kaggle.

In [None]:
submission = pd.DataFrame({
    'PassengerId': testing_df['PassengerId'],
    'Survived': Y_pred,
})

submission

In [None]:
# Export.

submission.to_csv('./output/submission.csv', index=False)