# Titanic - Machine Learning from Disaster
## Predicting survivors of the Titanic shipwreck

<p align='center'>
    <img src='img/titanic.jpg'>
</p>

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we want to build a model that predicts which passengers survived the Titanic shipwreck, using data (ie name, age, gender, socio-economic class, etc).

To solve the problem, we will follow these step:

- **0.0.** Data Collection.
- **1.0.** Data description.
- **2.0.** Feature Engineering
- **3.0.** Filtering the features
- **4.0.** Exploratory Data Analysis
- **5.0.** Data preparation
- **6.0.** Feature selection
- **7.0.** Machine Learning Modelling
- **8.0.** Hyperparameter fine tuning
- **9.0.** Translation and interpretation of the error
- **10.0.** Deploy model to production

# 0.0 Imports

In [1]:
import pandas as pd
import numpy  as np

from sklearn.ensemble import RandomForestClassifier

In [2]:
train = pd.read_csv( 'datasets/titanic/train.csv' )
test = pd.read_csv( 'datasets/titanic/test.csv' )

## 0.1. Helper functions

in this step, we'll document all the functions that will be used to solve the problem.

In [21]:
def jupyter_settings():
    %matplotlib inline
    %pylab inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
    
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()

In [20]:
jupyter_settings()

Populating the interactive namespace from numpy and matplotlib


## 0.2. Loading Data

### Overview

Here we will import the data we will need to predict the survivors of the wreck.

The data has been split into two groups:

- training set (train.csv)
- test set (test.csv)

The training set will be used to build our machine learning models. Our model will be based on “characteristics” such as gender and class of passengers. We will also use feature engineering to create new features.

The test set will be used to see how well our model performs on unseen data. For each passenger in the test set, we'll use the model we've trained to predict whether or not they survived the sinking of the Titanic.

We also have gender_submission.csv, a set of predictions that assume all and only women survive, and we will use that as the baseline for our solution.

### Data Dictionary

| Variable | Definition                                 | Key                                            |
|----------|--------------------------------------------|------------------------------------------------|
| survival | Survival                                   | 0 = No, 1 = Yes                                |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| sibsp    | # of siblings / spouses aboard the Titanic |                                                |
| parch    | # of parents / children aboard the Titanic |                                                |
| ticket   | Ticket number                              |                                                |
| fare     | Passenger fare                             |                                                |
| cabin    | Cabin number                               |                                                |
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |


### Data assumptions

pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.

We'll load **"train.csv"** to build a dataframe that will call **"df_raw"**. The **"test.csv"** dataset will only be used in the final sections.

In [22]:
df_raw = pd.read_csv( 'datasets/titanic/train.csv' )

df_raw.sample()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
544,545,0,1,"Douglas, Mr. Walter Donald",male,50.0,1,0,PC 17761,106.425,C86,C


# 1.0. Data Description

in this step we will perform the following tasks:

- Data Dimensions
- Data Types
- Check missing values
- Fillout missing values
- Change Types
- Descriptive Statistical

This step is very important for us to know how challenging the problem is.

In [23]:
df1 = df_raw.copy()

## 1.1. Data Dimensions

Let's see the size of the dataset dimensions.

In [25]:
print('Number of rows:', df1.shape[0])
print('Number of columns:', df1.shape[1])

Number of rows: 891
Number of columns: 12


## 1.2. Data Types

In this step we want to see what the data types are. This is important for us to treat the data correctly, and change them if necessary.

In [26]:
df_raw.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

# 1.3. Check NA

In this step, we want to see if there are any missing values in the dataset. If so, we must solve this problem because our machine learning algorithms are not able to handle null values.

In [31]:
df1.isna().sum().sort_values( ascending=False )

Cabin          687
Age            177
Embarked         2
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64

## 1.4. Fillout NA

As we saw in the previous step, there are two variables in the dataset that have missing values: Cabin, Age and Embarked.
We'll explain how to solve the problem in each variable.