# Predicting Survivors of the Titanic 

This notebook looks into one of the most popular online Kaggle competition projects in building a machine learning model to predict if a passenger on the titanic would survive! 

For this project we're going to take my personal approach to solving ML problems: 

 1. Problem Definition 
 2. Data 
 3. Evaluation (Metrics)
 4. Features 
 5. Modelling 
 6. Iterative Experimentation 
 7. Presenting Analysis
 
# 1. Problem Definition
In a statement, 
> Given passenger information aboard the titanic, can we predict who will survive the shipwreck?

# 2. Data 

The data can be found at kaggle on the following link: [Titanic Data](https://www.kaggle.com/c/titanic/data)

The data is split into two separate files: 
 1. train.csv (What we build the model on) 
 2. test.csv (What we test our model on) 
 
Since this is a competition on Kaggle - we will be submitting a **submissions.csv** to Kaggle - this essentially have passenger id's from the test.csv document along with a column indicating from our model if they survived or not.

# 3. Evaluation 

Our Goal is the following: 

>Predict if a passenger survived the sinking of the Titanic or not. For each passenger in the test set, we must predict a 0 or 1 value if they did or did not survive. 

Evaluation Metric for our model is: 
>**Accuracy**

Formally:
> Accuracy is the fraction of predictions our model got right. 

Mathematically: 

$Accuracy = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$

For our problem:
> The accuracy score is the percentage of passengers you correctly predict to survive the shipwreck.

# 4. Features 

**Data Dictionary** 

This allows us to describe the different features (columns) within our data, we will be using these features along with *engineering* our own later on for predicting if a passenger will survive: 

1. survival - Key flagging if survived 
    - 0 = No
    - 1 = Yes
2. pclass - Ticket class 
    - 1 = 1st Class 
    - 2 = 2nd Class 
    - 3 = 3rd Class 
3. sex - Gender 
4. Age - Age in years 
5. sibsp - # of siblings/spouses aboard the Titanic
6. parch - # of parents/children aboard the Titanic
7. ticket - Ticket Number
8. fate - Passenger fare 
9. cabin - Cabin number 
10. embarked - Port of Embarkation
    - C = Cherbourg
    - Q = Queenstown
    - S = Southampton

Additional Note on Features 

> pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

>age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

> sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

>parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.



# 5. Modelling 

Since the problem is a Yes/No outcome dictating if a passenger would survive or not, we can immediately identify that this is a supervised machine learning problem within the area of classification. 

We'll employ the following methods and models to predict a binary (0,1) outcome: 

1. Logistic Regression (Baseline Model - for comparing our later models score to) 
2. Regularized Logistic Regression (Lasso, Ridge and Elastic Net)
3. Random Forest Classifier
4. XGBoost Classifier

# 6. Iterative Experimentation 

Upon creating our baseline model of Logistic Regression, we will then improve upon this model using regularized methods along with ensemble classifiers (including gradient boosted trees). During this stage we will be using hyperparameter tuning and specialized feature engineering to see if we can improve upon the model evaluation metric: **Accuracy**. 

Once we have finalised the best model to use, we will generate the submitted data of our predictions along with pickling the model for future model analysis. 

# 7. Presenting Analysis

Once our model has been finalised and chosen - we will take key select findings from this project and place them into a singular notebook explaining the methods and techniques that enabled us to pick the best models possible.