# Titanic Kaggle Competition

## Data Analysis
The first phase is to analize the dataset, in order to discover some information about available data.

Context of dataset:
- On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew (32% survival rate).
- There were not enough lifeboats for the passengers and crew.
- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In [2]:
import pandas as pd
import numpy as np
import random as rnd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
titanic = [train, test]

Let's take a look to features name:

In [11]:
train.columns.values

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

Features type:
- Categorical:
    - Nominal:
        - Survived
        - Sex
        - Embarked
    - Ordinal:
        - Pclass
- Numerical:
    - Continuos:
        - Age
        - Fare
    - Discrete:
        - SibSp
        - Parch

We need to know if there are some null values, in order to correct or ignore some feature.

In [10]:
print(train.info())
print("-"*50)
print(test.info())
print("-"*50)
print(train.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket      

- Cabin feature is mostly incomplete both in training and test dataset. This feature may be useful, as there may be a correlation between cabin position and survival. But there may not be sufficient information to complete it correctly and cabin position is probably correlate to fare. So it may be dropped.
- There may not be a correlation between Ticket and survival.
- We can complete the Embarked feature (only 2 null values).
- We have to complete Age feature as we know it is correlated to survival.

In [6]:
print(train[["Pclass", "Survived"]].groupby(["Pclass"], as_index=False).mean().sort_values(by="Survived", ascending=False))
print("-"*10)
print(train[["Sex", "Survived"]].groupby(["Sex"], as_index=False).mean().sort_values(by="Survived", ascending=False))
print("-"*10)
print(train[["SibSp", "Survived"]].groupby(["SibSp"], as_index=False).mean().sort_values(by="Survived", ascending=False))

   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363
----------
      Sex  Survived
0  female  0.742038
1    male  0.188908
----------
   SibSp  Survived
1      1  0.535885
2      2  0.464286
0      0  0.345395
3      3  0.250000
4      4  0.166667
5      5  0.000000
6      8  0.000000


This confirm there are a correlation between Pclass/Sex and Survived.
There may also be a correlation between SibSp and Survived.

To complete Age feature we may consider Title of people, rather than put average age. So we need to add this new feature, extrapolating it from Name feature.