# Titanic Kaggle Competition

## Data Analysis
The first phase is to analize the dataset, in order to discover some information about available data.

Context of dataset:
- On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew (32% survival rate).
- There were not enough lifeboats for the passengers and crew.
- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In [1]:
import pandas as pd
import numpy as np
import random as rnd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

Let's take a look to features name:

In [3]:
train.columns.values

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

Features type:
- Categorical:
    - Nominal:
        - Survived
        - Sex
        - Embarked
    - Ordinal:
        - Pclass
- Numerical:
    - Continuos:
        - Age
        - Fare
    - Discrete:
        - SibSp
        - Parch

We need to know if there are some null values, in order to correct or ignore some feature.

In [4]:
print(train.info())
print("-"*50)
print(test.info())
print("-"*50)
print(train.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket      

- Cabin feature is mostly incomplete both in training and test dataset. This feature may be useful, as there may be a correlation between cabin position and survival. But there may not be sufficient information to complete it correctly and cabin position is probably correlate to fare. So it may be dropped.
- There may not be a correlation between Ticket and survival.
- We can complete the Embarked feature (only 2 null values).
- We have to complete Age feature as we know it is correlated to survival.

In [5]:
train[["Pclass", "Survived"]].groupby(["Pclass"], as_index=False).mean().sort_values(by="Survived", ascending=False)

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


In [6]:
train[["Sex", "Survived"]].groupby(["Sex"], as_index=False).mean().sort_values(by="Survived", ascending=False)

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


In [7]:
train[["SibSp", "Survived"]].groupby(["SibSp"], as_index=False).mean().sort_values(by="Survived", ascending=False)

Unnamed: 0,SibSp,Survived
1,1,0.535885
2,2,0.464286
0,0,0.345395
3,3,0.25
4,4,0.166667
5,5,0.0
6,8,0.0


In [8]:
train[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Parch,Survived
3,3,0.6
1,1,0.550847
2,2,0.5
0,0,0.343658
5,5,0.2
4,4,0.0
6,6,0.0


This confirm there are a correlation between Pclass/Sex and Survived.
There may also be a correlation between SibSp/Parch and Survived, but there are some values with 0 correlation.

To complete Age feature we may consider Title of people, rather than put average age. So we need to add this new feature, extrapolating it from Name feature. Extract title may be useful to obtain additional information about social stuatus too.

## Data wragling

So resume what we discover from data analysis:
- ...

First of all, we extract and removing the Survived feature and combine the two set, to engineer new features.

In [9]:
survived = train['Survived']
train.drop(['Survived'], 1, inplace=True)
titanic = train.append(test)
titanic.reset_index(inplace=True)
titanic.drop(['index', 'PassengerId'], inplace=True, axis=1)

Now we can extract the passenger title and maps the titles to categories.

In [10]:
titanic["Title"] = titanic["Name"].map(lambda name:name.split(',')[1].split('.')[0].strip())

Let's see what are the different titles

In [14]:
titanic.groupby(['Title'], as_index=False).size()

Title
Master      61
Miss       262
Mr         757
Mrs        200
Officer     23
Royalty      6
dtype: int64

There are titles with just few people, so we can combined them in a single category.

In [12]:
Title_Dictionary = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Jonkheer": "Royalty",
    "Don": "Royalty",
    "Dona": "Royalty",
    "Sir" : "Royalty",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess":"Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr" : "Mr",
    "Mrs" : "Mrs",
    "Miss" : "Miss",
    "Master" : "Master",
    "Lady" : "Royalty"
}
titanic['Title'] = titanic['Title'].map(Title_Dictionary)

Let's see mean age of these categories

In [17]:
titanic[["Title", "Age"]].groupby(['Title'], as_index=False).mean().sort_values(by='Age', ascending=False)

Unnamed: 0,Title,Age
4,Officer,46.272727
5,Royalty,41.166667
3,Mrs,36.866279
2,Mr,32.252151
1,Miss,21.795236
0,Master,5.482642


We use this data to fill missing ages.

In [15]:
titanic[["Title", "Age"]].groupby(['Title'], as_index=False).mean()

Unnamed: 0,Title,Age
0,Master,5.482642
1,Miss,21.795236
2,Mr,32.252151
3,Mrs,36.866279
4,Officer,46.272727
5,Royalty,41.166667


## Model training

...

## Conclusion

...