# Investigate Titanic's Data
## Questions to ask ourselves
### What factors made people more likely to survive?
* Sex
* Class
* Age
* How much they paid

In [1]:
#imports
import pandas as pd
import numpy as np

In [22]:
raw_data = pd.read_csv('titanic_data.csv')

In [26]:
raw_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


## Data Wrangling
We need to find the amount of nulls that our data has.

`describe` function might be useful

In [9]:
raw_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


we realise however that in this way we are not able to see NA in non-numeric columns.

We move to another option:

In [12]:
raw_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### How do we treat nulls
#### In AGE
Out of 891 rows, we have 177 *NaN*, which represent roughly a 20%. If we replace this NaN with some other value we should be guard value, so it does not affect the rest of the values. 

### In Cabin
Out of 891 rows, 687 are nulls, representing an astounding 77%. **Ignoring this column** altogether makes more sense.

### In Embarked
Only 2 NaN in this column make it possible to simply **ignore this rows**. We could also decide another value and see how they behave.

### Code
#### Age

In [24]:
clean_data = raw_data.copy()
clean_data['Age'] = clean_data['Age'].fillna(-1)

#### Cabin

In [25]:
clean_data.drop('Cabin', axis=1, inplace=True)

#### Embarked
Before deleting anything, let's check the rows

In [27]:
raw_data[raw_data['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38,0,0,113572,80,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62,0,0,113572,80,B28,


It looks a bit strange that they both survived, are in the same Cabin and we lack their Embarked information, using the same ticket.

Instead of deleting them we will leave the rows for now.

## Data Exploration

In [None]:
Survival by age