<p><h1><span style="color:#2f4f4f"><u>Project Titanic Data Analysis</u></span></h1></p>

The findings in the report presented below are tentative.

## Introduction
This is my final project to conduct data analysis on the Titinic data provided on 891 passengers onboard the ship. 

## Questions Regarding This Dataset
1. What is the survival rate for each gender?
2. Does the class of travel determine the survival rate? 
3. Does age affect the survival rate?

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import csv

In [None]:
# reads and put titanic_data.csv file into a dataframe.
titanic_df = pd.read_csv('titanic_data.csv')

# shows what are in the fields of titanic_data.csv
print (titanic_df.head())

In [None]:
# shows the number of missing 'age' values in the data
titanic_df[titanic_df['Age'].isnull()]['PassengerId'].count()

In [None]:
# shows the number of missing 'age' values by Pclass
print ("Pclass 1:")
print (titanic_df[(titanic_df['Age'].isnull()) & (titanic_df['Pclass']==1)]['Sex'].value_counts())
print ("")
print ("Pclass 2:")
print (titanic_df[(titanic_df['Age'].isnull()) & (titanic_df['Pclass']==2)]['Sex'].value_counts())
print ("")
print ("Pclass 3:")
print (titanic_df[(titanic_df['Age'].isnull()) & (titanic_df['Pclass']==3)]['Sex'].value_counts())

### Cleaning up missing values under 'Age' column
The missing values under the 'Age' column in this dataset was being replaced by the mean of the age in each class and sex. This is an <em><b>assumption</em></b> made for the lack of information. 

In [None]:
# using Pclass and Sex to find the median age for each Pclass and Sex
print ("Age median values by Age and Sex:")
print (titanic_df.groupby(['Sex','Pclass'], as_index=False).median().loc[:, ['Sex','Pclass', 'Age']])

# then, apply transformation to Age missing values with regard to Pclass and Sex
titanic_df.loc[:, 'Age'] = titanic_df.groupby(['Sex','Pclass']).transform(lambda x: x.fillna(x.median()))

### Question 1: What is the survival rate for each gender?

In [None]:
# Categorising by age, where a Child is betwen 0-13 yo, an Adult 14-40 yo and Senior 41-90 yo
titanic_df['Age_Cust'] = pd.cut(titanic_df['Age'], bins = [0,13,40,90] , labels=['Childs','Adults','Seniors'])

In [None]:
sns.factorplot('Age_Cust', data = titanic_df, kind='count')
plt.ylabel('Number of people on board ship')
plt.title('Age distribution in ship')

#### This is confirmed by looking at the age distribution, a huge percentage belongs to the adult group between age 14 to 40.

### Now we look at the survival rate across the ages

In [None]:
figure = plt.figure(figsize=(15,6))
plt.hist([titanic_df[titanic_df['Survived']==1]['Age'], titanic_df[titanic_df['Survived']==0]['Age']], stacked=True, bins =50, color = ['g','r'], label = ['Survived','Dead'])
plt.title('Survive/Fatality counts across all ages')
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.grid(True)
plt.legend()

In [None]:
sns.barplot('Age_Cust', 'Survived', data = titanic_df, ci=None)
plt.title('Survival by Age Category')
plt.ylabel('Mean of Survival')

#### The first chart shows the survival and fatality counts across all ages. It is evident that more passengers under the age of 7 survived as compared to the older passengers. 

#### The second chart shows clearly that a child (0-13 yo) is much likely to survive than that of the adults or seniors. 

In [None]:
sns.factorplot('Age_Cust', 'Survived', data = titanic_df, hue='Sex', kind='bar', ci=None)
plt.title('Survival by Age and Sex')
plt.ylabel('Mean of Survival')

#### Comparing the survival by age and sex, it suggests that the children and women are given the priority to escape during the shipwreck.

In [None]:
sns.factorplot('Age_Cust', 'Survived', data = titanic_df, hue='Sex', kind='bar', col='Pclass', ci=None)

#### And it seems evident that the ones who paid a higher fare do have a better chance at survival.

### Just Curious - Which are the 3 Biggest Families Onboard?
Assumption: The ones with the same last name are part of the same family.

In [None]:
total_counts = titanic_df['Sex'].value_counts()
survived = titanic_df[titanic_df['Survived']==1]['Sex'].value_counts()
dead = titanic_df[titanic_df['Survived']==0]['Sex'].value_counts()
df = pd.DataFrame([survived,dead])
df.index = ['Survived','Dead']
df.plot(title="Survival Count by Gender", kind='barh',stacked=True)

print ("Total on Board:")
print (total_counts)
print ("")
print ("Total Survived:")
print (survived)

In [None]:
survival_rates_by_gender = survived / total_counts

print ("Survival Rates:")
print (survival_rates_by_gender)

#### Of the 577 males onboarded the ship, 109 males survived the shipwreck, a 19% survival rate. Of the 314 females onboarded the ship, 233 females survived the shipwreck, a 74% survival rate.

#### Purely by looking at the survival rates between genders, females are more likely to survive than males. This could be largely due to the evacuation protocal when the shipwreck happened. Women are likely given the first priority to evacuate before men.

### Question 2: Does the class of travel determine the survival rate?

In [None]:
# calculating the survived class and dead class
survived_class = titanic_df[titanic_df['Survived'] == 1]['Pclass'].value_counts()
dead_class = titanic_df[titanic_df['Survived'] == 0]['Pclass'].value_counts()

In [None]:
df = pd.DataFrame([survived_class, dead_class])
df.index = ['Survived', 'Dead']
df.plot(kind='bar', title= 'Survival Count by PClass', figsize=(9, 8))
plt.legend(['1st class', '2nd class', '3rd class'], loc='upper left')

In [None]:
print (survived_class/(survived_class+dead_class))

#### The chart above shows a representation of survival counts between different classes and clearly shown that PClass 1 passengers have the highest survival count. 

In [None]:
titanic_df.groupby(['Pclass'])['Fare'].mean().plot(kind='bar')
plt.ylabel('Fare')
plt.title('Mean fare by Pclass')

#### The above chart extracts the mean fare of each class. It's evident that class 1 passengers paid a distinctively higher fare and that seems to suggest higher fare meant higher chance in survival.

### Question 3: Does age affect the survival rate?
Earlier, the 177 missing 'Age' fields were filled in with the mean passenger's age of similar Pclass and Sex. <br>

<span style = "color:gray">
<b>Note:</b>
<br>
<em><b>177 passengers missing age: </b><br>
Pclass 1 = 21m, 9f <br>
Pclass 2 = 9m, 2f <br> 
Pclass 3 = 94m, 42f
<br><br>
<b>Mean age of passengers to be filled in: </b><br>
Female, PClass 1 = 35 years old<br>
Female, PClass 2 = 28 years old<br>
Female, PClass 3 = 21.5 years old<br>
Male, PClass 1 = 40 years old<br>
Male, PClass 2 = 30 years old<br>
Male, PClass 3 = 25 years old</em>
</span>
<br><br>
Below chart shows a comparison of "Default" age where the null fields were dropped from the statisics vs. the imputed age values where the mean age were pre-filled in based on the class and sex of the passenger. 

In [None]:
# take a look at the defaulted age vs. the one we filled using mean of Pclass and Sex
figure = plt.figure(figsize=(15,6))
titanic_df2 = pd.read_csv("titanic_data.csv")
sns.distplot(titanic_df2.Age.dropna(), kde = False, label = 'Default')
sns.distplot(titanic_df.Age, kde = False, label = 'Imputed')
plt.title('Comparison between Default and Imputed Age Values')
plt.ylabel('Number of passengers')
plt.legend()

#### Using this way to impute the age did affect the age distribution with tremedous peak at 25-yo and also 40-yo bars.

#### The results are skewed because 94 of the male passengers' age are being pre-filled with the mean age of 25 years old for Pclass 3 male passengers. Likewise, there seem to be an unusual spike under 40-yo bar. That is also due to the fact that 21 of the male passengers' age are pre-filled with mean age of 40 years old for Pclass 1 passengers.

In [None]:
survived = titanic_df[titanic_df['Survived']==1]['Age'].describe()
dead = titanic_df[titanic_df['Survived']==0]['Age'].describe()

print ("Survived Age:")
print (survived)
print ("")
print ("Dead Age:")
print (dead)

#### The mean survival age is at 28 years old, not hugely different from the mean fatality age at 30 years old. This could also meant that a large percentage of passengers falls within the adult age range of 14 - 40 years old. 

#### Exploring that further, we categorize the age of a child, adult and senior accordingly.

In [None]:
def last_name(name):
    split_name = name.split(",")
    last_name = split_name[0]
    return last_name
    
family = titanic_df['Name'].apply(last_name)
print (family.value_counts().sort_values(ascending=False).head())

#### The biggest 3 families on-board Titanic are the Andersson, Sage and Johnson.

### Going deeper into Andersson's Family...

In [None]:
titanic_df[(titanic_df['Name'].apply(last_name))=='Andersson']

#### Supposedly the assumption of all 9 under "Andersson" are from the same family, there are only 2 survivals - female age 17 and male age 27. Ironically, the fare paid by both of them are the lowest in the family and both hold a different ticket number.

### Comparing Survival with Gender, Age and Fare