<h1>Purpose : To identify what attributes contributed
to the survivors.</h1>

<h3>Attributes</h3>
<ul>
    <li>Age, Sex</li>
    <li>Class</li>
    <li>Embarked</li>
</ul>

In [None]:
from IPython.display import Image
Image(url='http://1.bp.blogspot.com/-Fvx5ut4Tezw/VLryIq1oiuI/AAAAAAAABY4/XuEDMiT3mJE/s1600/titanic-7.jpg',width=400)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Download data from <a href="https://www.kaggle.com/c/titanic">Kaggle</a>, save, and open it as follows:

In [None]:
#data
test=pd.read_csv("data/Titanic/test.csv")
train=pd.read_csv("data/Titanic/train.csv")

In [None]:
#General info.
train.describe()

In [None]:
train.info()

In [None]:
test.describe()

In [None]:
test.info()

In [None]:
train.head()

Data Cleaning
---
**Age, Sex**

1. remove the missing data, the simplest way, (but is it always reasonable );
2. replace missing data with resonable value, i.e. **average with one standard error**, $\bar X-1\cdot\bar S$,
   sample mean and standard deviation for $\mu$ and $\sigma$ respectively. 

In [None]:
from numpy import exp,pi,sqrt
from scipy.stats import gaussian_kde
import seaborn as sns 

plt.figure(figsize=(10,4))

# sample data
p=1000
x=np.random.normal(loc=0.0, scale=1.0, size=p)
sns.kdeplot(x, bw=0.5,label='Sample Data')

# pdf of standard normal random variable
t=np.linspace(np.min(x),np.max(x),p)
f=1/sqrt(2*pi)*exp(-t**2/2)
plt.plot(t,f,'r--',label='pdf of $N(0,1)$')

plt.plot([1,1],[0,exp(-1/2)/sqrt(2*pi)],'r-')
plt.plot([-1,-1],[0,exp(-1/2)/sqrt(2*pi)],'r-')
plt.text(-0.5,0.1,'68.2%',size=16,color='red')
plt.text(-1.4,-0.03,'$-1\cdot\sigma$',color='red');
plt.text(0.8,-0.03,'$1\cdot\sigma$',color='red');
plt.legend()

In [None]:
#count the number of missin values.
missing_values=train["Age"].isnull().sum()
missing_values

In [None]:
#there are 177 missing values. Fill in rational values.
age_mean=train["Age"].mean()
age_std=train["Age"].std()

print("mean : "+str(age_mean))
print("std : "+str(age_std))

In [None]:
#generate random values.
random_values=np.random.randint(age_mean-age_std,age_mean+age_std,size=177)

In [None]:
#fill in the random values to null values.
train["Age"][train["Age"].isnull()==True]=random_values

In [None]:
#FaceGrid
fig = sns.FacetGrid(train, hue="Sex",aspect=3)
fig.map(sns.kdeplot,'Age',shade= True)

# set the limit of X-axis
oldest = train['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

Create addition class, in ** person**, which classifies whether the person is a child or not, (age &lt; 15): 

In [None]:
def male_female_child (passenger):
    age, sex = passenger
    if age < 15:
        return "child"
    else:
        return sex

In [None]:
#add a new colum
train["person"]=train[["Age","Sex"]].apply(male_female_child, axis=1)

In [None]:
train.head(10)

In [None]:
fig = sns.FacetGrid(train, hue="person",aspect=3)
fig.map(sns.kdeplot,'Age',shade= True)
oldest = train['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

In [None]:
#the relationship between "Age" and "Survived".
sns.lmplot("Age", "Survived", data=train)

<p>Generally speaking, the younger passengers were, the more passengers survived.</p>

In [None]:
sns.lmplot("Age", "Survived", hue= "Sex", data=train, palette="winter")

<h2>Summary</h2>
<p>There were three classes, male, female, child. (1)male: the older passengers were, the less possibility they had. (2)female: opposite to male, the older passengers were, the more possibility they had. (3)child: Survived with high possibility. We can guess that children were given the first priority to evacuate.</p>

<h1>Class</h1>

In [None]:
#count the number of passengers in each class.
print(train.Pclass.value_counts(dropna=False))

In [None]:
sns.countplot("Pclass" ,data=train)

In [None]:
#more specific
sns.countplot("Pclass", data=train, hue="Sex")

In [None]:
#more specific
sns.countplot("Pclass", data=train, hue="person")

In [None]:
#The relationship between "Pclass" and "Survived".
sns.factorplot("Pclass","Survived",data=train,order=[1,2,3],aspect=2)

In [None]:
sns.factorplot("Pclass","Survived", hue="person", data= train, order=[1,2,3], aspect=2)

<h2>Summary</h2>
<p>Pclass1 was the highest one. Pclass3 was the lowest one. Therefore, Class1  passengers were three times as safe as Class3 passengers. However, if you focus on "Sex", the result was slightly different.  </P>

<h1>Place</h1>

In [None]:
sns.countplot("Embarked", data=train)

In [None]:
sns.countplot("Embarked", data=train, hue="Pclass")

In [None]:
sns.factorplot("Embarked","Survived", data= train, aspect=2)

<h2 style="font-family:georgia;background:black;color:yellow;">Question</h2>
<p style="font-family:georgia;font-size:1.4em">The persons embarked at C own the highest rate of alive in this ship wreck. Could we conclude that it was because most passengers embarked at "C" also bought "1" Class tickets, compared with others doing?</p>

In [None]:
from IPython.display import Image
Image(url='http://1.bp.blogspot.com/-E4r73C2lSyw/UFDGu4_9jBI/AAAAAAAAANc/ao7fY_fLotQ/s1600/TITANIC.jpg')

In [None]:
sns.factorplot('Pclass','Survived',data=train, aspect=2)

In [None]:
sns.factorplot('Embarked',data=train,hue='Pclass', kind='count',aspect=2)

In [None]:
train["Deck"] = train.Cabin.str[0].map(lambda s: np.nan if s == "T" else s)
train["Deck"].fillna('Null', inplace=True)


sns.factorplot('Deck',data=train, kind='count', palette='spring_d',aspect=2)