<h2> Exploratory Data Analysis Practice</h2>

<h4>Titanic: Machine Learning from Disaster <a href="https://www.kaggle.com/c/titanic/data">On Kaggle</a></h4>

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

In [1]:
# 1. Importing libraries and viewing the different variables/columns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

titanic = pd.read_csv("./titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<i><b>Observing above data, we can say passengerid, name , sex, ticket, cabin, embarked are general properties and they are non-numeric features so we can drop them.</b></i>

In [2]:
# Drop non-numeric or categorical features

drop_list = ["PassengerId", "Name", "Ticket", "Sex", "Cabin","Embarked"]
titanic.drop(drop_list, axis=1, inplace=True)
titanic.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
0,0,3,22.0,1,0,7.25
1,1,1,38.0,1,0,71.2833
2,1,3,26.0,0,0,7.925
3,1,1,35.0,1,0,53.1
4,0,3,35.0,0,0,8.05


<i><b>We have droped the unwanted features. Now we need to details about the availble numeric features.</b></i>

In [3]:
titanic.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


<i><b>We can observe from above data: 
    <ol>
        <li>Age have missing values, as count is only 714 whereas other features count is 891.</li>
        <li>As survived is binary, from mean we can say 38.38% people survived.</li>
        <li>Pclass is either 1, 2 or 3. Similary sibsp range from 0 to 8 and parch from 0 to 6.</li>
    </ol>
    Now we need to determine which features are strong indicator for survival i.e observe each feture and for that how many survived and how many didn't. 
</b></i>

In [4]:
# Group the data into two: survived and not survived. Find mean of all features for each group.
titanic.groupby('Survived').mean()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2.531876,30.626179,0.553734,0.32969,22.117887
1,1.950292,28.34369,0.473684,0.464912,48.395408


<ul><li><i><b>A simple observation we can make from above is average age of survived people is 28.34 and not survived is 30.62. But the age have missing values so we need to observe it more.</b></i></li>
    <li><i><b>Fare and pclass between two groups stand out, so they might be useful </b></i></li></ul>

In [None]:
# Check weather age is missing
titanic.groupby(titanic["Age"].isnull()).mean()