# Titanic

## Problem
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage,
the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough
lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some
element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Your job is to figure out who will survive.

### Data Dictionary
| Variables       | Definition     | Key     |
| :------------- | :----------: | -----------: |
|survival 	| Survival|  	0 = No, 1 = Yes|
|pclass |	Ticket class| 	1 = 1st, 2 = 2nd, 3 = 3rd, <br>pclass: A proxy for socio-economic status (SES) <li>1st = Upper <li> 2nd = Middle <li> 3rd = Lower|
|sex | 	Sex||
|Age |	Age in years ||
|sibsp |	# of siblings / spouses aboard the Titanic ||
|parch |	# of parents / children aboard the Titanic ||
|ticket |	Ticket number ||
|fare |	Passenger fare ||
|cabin |	Cabin number ||
|embarked |	Port of Embarkation |	<li>C = Cherbourg  <li>Q = Queenstown <li>S = Southampton|


### Load the dataset


In [1]:
import pandas as pd

titanic_ds = pd.read_csv("data/titanic/train.csv")


In [2]:
titanic_ds.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### What fields are definitely irrelevant?
- __Passenger name__ : Will not help in determining the survival chances.
- __Ticket Number__ : Does not tell us anything. This is a random alphanumeric.
- __Passenger Id__ : This is just a serial number id.
- __Cabin Number__ : This is not required. Looking at the data, it seems that cabin is alloted to first class passengers. Third class passengers don't get personal cabins.

  
### Drop the irrelevant fields.


In [6]:
titanic_ds_rel = titanic_ds.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
titanic_ds_rel.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


### Data analysis
- How many fields need to be encoded?
- Is data missing any values?
- What are the basic facts about?

   #### How many fields need to be encoded?
   - _Pclass_ : is already encoded as 1,2,3 but can be furher condensed to one hot encoding
   - _Embarked_ : Needs to be encoded.
   - _Sex_ : Needs to be encoded.

   But we before we start encoding, we need to check if there are any missing values. We also need to understand what are the different type of categoriacal values for each feature.

   #### Is dataset missing any values?
     

In [7]:
titanic_ds_rel.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

### Can we try and fill age by trying to find out if it is a child or a parent?  
__sibsp__ : This tells whether the parents had any siblings  
__parch__ : If this value is present then it defnitely means that this a minor

Let's check the above statement, to see if our understanding is correct. Let's display top 10 rows where age < 16. Let's checkout the value of Parch. 


In [26]:
titanic_ds_rel.loc[ titanic_ds_rel['Age'].isnull() ]  


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
5,0,3,male,,0,0,8.4583,Q
17,1,2,male,,0,0,13.0000,S
19,1,3,female,,0,0,7.2250,C
26,0,3,male,,0,0,7.2250,C
28,1,3,female,,0,0,7.8792,Q
...,...,...,...,...,...,...,...,...
859,0,3,male,,0,0,7.2292,C
863,0,3,female,,8,2,69.5500,S
868,0,3,male,,0,0,9.5000,S
878,0,3,male,,0,0,7.8958,S


### Facts about children
- Some children were travelling with thier nanny only. No parents accopanied them. Hence __PARCH__ =0.

__Assumption__ : A third class passenger will never be able to afford a nanny. 

### Drop all passengers where __PCLASS__ ==3

In [32]:
print(titanic_ds_rel.loc[( titanic_ds_rel['Age'].isnull()) & (titanic_ds_rel['Pclass'] != 3) ].count())
titanic_ds_rel.loc[( titanic_ds_rel['Age'].isnull()) & (titanic_ds_rel['Pclass'] != 3) ]

Survived    41
Pclass      41
Sex         41
Age          0
SibSp       41
Parch       41
Fare        41
Embarked    41
dtype: int64


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
17,1,2,male,,0,0,13.0,S
31,1,1,female,,1,0,146.5208,C
55,1,1,male,,0,0,35.5,S
64,0,1,male,,0,0,27.7208,C
166,1,1,female,,0,1,55.0,S
168,0,1,male,,0,0,25.925,S
181,0,2,male,,0,0,15.05,C
185,0,1,male,,0,0,50.0,S
256,1,1,female,,0,0,79.2,C
270,0,1,male,,0,0,31.0,S


__ Total nummber of rows where age is missing and Pclass != 3 reduces from 177 --> 41. __
### Fact: Sibsp
If Sibsp is present then it is definitely an adult becuase as per the data definition, Sibsp = # of Siblings/ Spouse.

__Assumption__ : If we exclude missing age rows where Sibsp =1, we will narrow down the possibility of people with missing ages that they may or may not be a minor child.

In [36]:
print(titanic_ds_rel.loc[( titanic_ds_rel['Age'].isnull()) & (titanic_ds_rel['Pclass'] != 3) & (titanic_ds_rel['SibSp'] ==0) ].count())

titanic_ds_rel.loc[( titanic_ds_rel['Age'].isnull()) & (titanic_ds_rel['Pclass'] != 3) & (titanic_ds_rel['SibSp'] ==0) ]

Survived    35
Pclass      35
Sex         35
Age          0
SibSp       35
Parch       35
Fare        35
Embarked    35
dtype: int64


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
17,1,2,male,,0,0,13.0,S
55,1,1,male,,0,0,35.5,S
64,0,1,male,,0,0,27.7208,C
166,1,1,female,,0,1,55.0,S
168,0,1,male,,0,0,25.925,S
181,0,2,male,,0,0,15.05,C
185,0,1,male,,0,0,50.0,S
256,1,1,female,,0,0,79.2,C
270,0,1,male,,0,0,31.0,S
277,0,2,male,,0,0,0.0,S


__So now we have 35 people who are either minor or adult__

### Can we further narrow down the list?
We can look at the option to see if a child fare is less compared to an adult? If we can see in data where age is less than 16 and compare the fare paid for the same __Pclass__ and same __Embarked__ then we can take mean age and fill in the values minors.

### Let's find out mean fare paid by minors under age 10 and lower who emabreked from same port and have the same Pclass.

__Fare paid by Class 1 minor and adult boarding from S, Q or C.__