###### Setup:

In [2]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### 1.) Load Data

In [3]:
train_data = pd.read_csv('../../data/train.csv')
test_data = pd.read_csv('../../data/test.csv')

### 2.) Bird View

Let's take a peek at the top few rows:

In [4]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's explicitly set the `PassengerId` column as the index column:

In [5]:
train_data = train_data.set_index('PassengerId')
test_data = test_data.set_index('PassengerId')

Let's get more info from ***training data***; 
* How much data is missing?
* How many entries training data have?
* How many entries are there?

In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


**Data Dictionary:**
| Variable | Definition | Key | DType | E.G. |
| -------- | ---------- | --- | ----- | ---- |
| `Survived` | Survival **(TARGET)** | 0 = No, 1 = Yes | int | 1 |
| `Pclass` | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | int | 1 |
| `Name` | Passengers name | | object | Braund, Mr. Owen Harris |
| `Sex` | Sex | 'male', 'female' | object | male |
| `Age` | Age in years| | float | 35.0 | 177 |
| `SibSp` | # of siblings / spouses aboard the Titanic | | int | 1 |
| `Parch` | # of parents / children aboard the Titanic | | int | 1 |
| `Ticket` | Ticket number | | object | PC 17599 |
| `Fare` | Passenger fare | | float | 71.2833 |
| `Cabin` | Cabin number | | object | C85 |
| `Embarked` | Port of Embarkation | C = Cherbourg,<br> Q = Queenstown,<br> S = Southampton | object | C |

**Variable Notes:**
* `Pclass`: A proxy for socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower
* `Age`: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
* `Sibsp`: The dataset defines family relations in this way...
  * Sibling = brother, sister, stepbrother, stepsister
  * Spouse = husband, wife (mistresses and fiancés were ignored)
* `Parch`: The dataset defines family relations in this way...
  * Parent = mother, father
  * Child = daughter, son, stepdaughter, stepson
  * Some children travelled only with a nanny, therefore parch=0 for them.

**NaN Value Count from training Data:**
* `Age` - 177
* `Cabin` - 687
* `Embarked` - 2

Okay, **`Age`** **`Cabin`** and **`Embarked`** has null values, especially `Cabin` feature 77% and `Age` feature 19% null. `Embarked` have almost none. How we will handle these NaN's?

For `Age` attribute we can use 
* median value or 
* we can use reasonable easy regression model for prediction (***Use this***)

For `Cabin` attribute: There is a lot of nulls here,
* Deleting the column -this probably not preferable-, or
* We can transfer deck letters (like A, B, C ...) from `Cabin` feature to another column, and we can predict the deck letters with creating a machine learning model based on the passenger's *age*, *gender* and *fare*. This could be a useful column for the model (**Use this**). or
* We can create column `known cabin`, if not null assign *yes* value, if null assign *no* value.

For `Embarked`
* We can use most_common approach, or 
* We can use simple classification for it (**Use This**)

---