# Titanic Data Analysis
## Overview
The data have been obtained from the Kaggle website and contain demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.
### Data Dictionary Table
|  Variable |  Definition                                  |  Key                                            |
|-----------|----------------------------------------------|-------------------------------------------------|
|  Survival |  Survival                                    |  0 = No, 1 = Yes                                |
|  Pclass   |  Ticket class                                |  1 = 1st, 2 = 2nd, 3 = 3rd                      |
|  Sex	    |  Sex	                                       |                                                 |
|  Age	    |  Age in years                                |                                                 |
|  Sibsp    |  # of siblings / spouses aboard the Titanic  |                                                 |
|  Parch    |  # of parents / children aboard the Titanic  |                                                 |
|  Ticket   |  Ticket number                               |                                                 |
|  Fare     |  Passenger fare                              |                                                 |
|  Cabin    |  Cabin number                                |                                                 |
|  Embarked |  Port of Embarkation                         |  C = Cherbourg, Q = Queenstown, S = Southampton |
### Variable Notes

**Pclass:** A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

**Age:** Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**Sibsp:** The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

**Parch:** The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

## Exploring the factors that made people more likely to survive
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. Examining the Data Dictionary Table, the following questions can be considered as sanity checks for the survival chance:
1. SES: How did socio-economic status affect the survival?
2. Sex: Did female have a priority to lifeboats?
3. Age: Did children have better chances to survive than adults?
4. Family: Comparison of passengers travelling alone with those travelling with their family?
5. Fare: How fare relates with SES?
6. Crew: What happened to the crew?

Finally, a hypothesis testing will be used to make more valid judgements for one of the above factors.

## Load Data from CSV

In [None]:
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks as cf
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
import plotly.figure_factory as ff
from scipy.stats import ttest_ind

# Initiate the Plotly Notebook mode.
init_notebook_mode()

# Show plots inline
%pylab inline

# Read in the data from titanic-data.csv and store the results in a variable.
# Then look at the first 5 rows of the dataframe.
titanic_df = pd.read_csv('titanic-data.csv')
titanic_df.head()

In [None]:
# Generate some statistics.
titanic_df.describe()

## Investigating the Data
Looking at the above tables, it can be clearly seen that there are some minor or major problems, such as missing values in the Cabin number column, as well as in the Age column. From the statistics table it can be easily computed that 20% of our age data are missing. In the following lines of code, it will be investigated:
* the number of missing values,
* the number of duplicated entries,
* the type of data, and
* if every value is in accordance with the Data Dictionary Table.

In [None]:
# Count the NaN missing values in column Cabin.
cabin_missing_values = titanic_df['Cabin'].isnull().sum()
print('Number of missing values on Cabin column: {}'.format(cabin_missing_values))

In [None]:
# Detect if there are more missing values.
missing_values = titanic_df.isnull().sum()
print('Number of missing values: \n{}'.format(missing_values))

In [None]:
# Find if there are any duplicated records
duplicated_records = titanic_df.duplicated().sum()
print('Number of duplicated records: {}'.format(duplicated_records))

In [None]:
# Check the data types
titanic_df.dtypes

In [None]:
# Make a stripped down df with columns of interest, in order to look into their unique values.
check_columns_df = titanic_df.drop(['PassengerId', 'Name', 'Age', 'Ticket', 'Fare', 'Cabin'], axis='columns')

# A function that finds the unique values.
def find_unique_entries(columnName):
    print('Unique values of {} column'.format(columnName))
    print(check_columns_df[columnName].unique())
    print('')

# Iterate over the check_columns df.
for column in check_columns_df:
    find_unique_entries(column)

**Note:** From the previous analysis, it can be seen that there are not adequate Cabin data, hence it cannot be done any statistical analysis with these data. Additionally, any alphanumeric or irrelevant data with the analysis can be removed. On the other hand, it can be considered that there are still several Age entries to make statistical computations. There is also no need to fix any data types, as all the values are as expected.

## Cleaning the Data
Remove the columns that are considered irrelevant to the analysis.

In [None]:
# Remove PassengerId, Name, Ticket, Cabin, Embarked columns and make a new stripped down df.
titanic_cleaned_df = titanic_df.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis='columns')
titanic_cleaned_df = titanic_cleaned_df.rename(columns={'PassengerId':'Passengers'})
titanic_cleaned_df.head()

## Descriptive Statistics
### Visualised Overview
Create some plots to familiarise with the data.

In [None]:
# Make a histogram of the passenger fares.
plt.figure(figsize=(15,5)) # define the plots size
plt.subplot(1,2,1) # Put first plot in first column of line.
plt.hist(titanic_cleaned_df['Fare'], bins=70, alpha=0.7)
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Distribution of Fares')

# Same histogram grouped by gender.
plt.subplot(1,2,2) # Put second plot in second column of line.
for Sex, Fare in titanic_cleaned_df.groupby('Sex')['Fare']:
    Fare.hist(bins=70, alpha=0.5, label=Sex)
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Distribution of Fares by Gender')
plt.legend()

The histograms shows that the distribution of fares is positively skewed as it was expected. There are also several zero fares values. Likely, these entries belong to crew members.

In [None]:
# Calculate the '0' fares.
zero_fares = titanic_df.loc[titanic_df['Fare'] == 0]['Fare'].count()
print('Number of zero fares: {}'.format(zero_fares))

Create some bar charts to present the number of passengers by ticket class and family relations.

In [None]:
# Make a function that creates a plot.
def create_plot(column, byVariable=None):
    sns.countplot(x=column, hue=byVariable, data=titanic_cleaned_df, palette='Set2', alpha=0.7)
    plt.ylabel('Number of Passengers')
    
# Call the plot function to create several graphs.
plt.figure(figsize=(15,10))
plt.subplot2grid((2,2), (0,0))
create_plot('Pclass')
plt.xlabel('Ticket Class')
plt.subplot2grid((2,2), (0,1))
create_plot('Pclass', 'Sex')
plt.xlabel('Ticket Class')
plt.subplot2grid((2,2), (1,0))
create_plot('SibSp')
plt.xlabel('No. of Siblings / Spouses aboard the Titanic')
plt.subplot2grid((2,2), (1,1))
create_plot('Parch')
plt.xlabel('No. of Parents / Children aboard the Titanic')

Create a box chart of ticket class and age.

In [None]:
# Create the box plot for Pclass and Age.
plt.figure(figsize=(10,5))
sns.boxplot(x='Pclass', y='Age', data=titanic_cleaned_df, palette='Set2')
plt.xlabel('Ticket Class')

Analyzing the previous box plot, it can be inferred that the median value of ticket class 1 is higher than that of class 2 and 3. It means that class 2 and 3 had younger passengers than class 1. Additionally, several outlier values appear in class 2 and 3.

### Answering the questions
#### 1. Looking into the socio-economic status factor

In [None]:
# Make a function that summarizes a df.
def summarise_df(df, column):
    # Aggregate columns using dict of {column -> function}.
    sum_df = df.groupby([column], as_index=False)['Survived'].agg({'Passengers':'count',\
                                                                   'Survived':'sum'})
    # Calculate percentage proportions.
    sum_df['Percentage'] = round((sum_df['Survived']*100/sum_df['Passengers']), 2)
        
    return sum_df

# Create a summarised ticket class table using plotly.
plotly.offline.iplot(ff.create_table(summarise_df(titanic_cleaned_df, 'Pclass')))

In [None]:
# Use cufflinks with plotly to make interactive and visually better graphics.
# Enable cufflinks offline mode.
cf.go_offline()

# Change cufflinks theme to pearl.
cf.set_config_file(theme='pearl')

# Make a function that creates a plot of a summarized df.
def create_sum_plot(df, column, kind, xLabel):
    summarised_df = summarise_df(df, column)
    summarised_df.iplot(kind=kind, fill=True, x=column,\
                        y=['Passengers', 'Survived'],\
                        xTitle=xLabel, yTitle='Number of Passengers',\
                        title='Passengers Survival by {}'.format(xLabel))

In [None]:
# Call the previous function to create a plot that depicts the passengers survival by ticket class
create_sum_plot(titanic_cleaned_df, 'Pclass', 'bar', 'Ticket Class')

From the previous table and plot, it can be clearly seen that passengers with first class tickets had almost 3 times more chances to survive than the low 3rd class passengers and the 2nd class passengers had about the double.
#### 2. Looking into the gender factor
The following table shows that females survival rate was about 4 times greater than that of males.

In [None]:
# Create a summarised by Sex table.
plotly.offline.iplot(ff.create_table(summarise_df(titanic_cleaned_df, 'Sex')))

#### 3. Looking into the age factor

In [None]:
# Create a line graph of the summarized age df.
create_sum_plot(titanic_cleaned_df, 'Age', 'line', 'Age')

A first glance shows that children had a better chance to survive than adults. A deeper comparison analysis in the survival rate of children and adults confirms this difference.

In [None]:
# First, age missing values are removed to have more reliable statistics.
age_df = titanic_cleaned_df.dropna()

# # Group passengers to children and adults.
children_df = age_df[age_df['Age'] < 18]
adults_df = age_df[age_df['Age'] >= 18]

# # Calculate the survival proportion of children and adults.
children_survival_rate = round(len(children_df[children_df['Survived'] == 1])*100/len(children_df), 2)
adults_survival_rate = round(len(adults_df[adults_df['Survived'] == 1])*100/len(adults_df), 2)

print('Survival proportion of children: {}%'.format(children_survival_rate))
print('Survival proportion of adults: {}%'.format(adults_survival_rate))

#### 4. What happened to the families?

In [None]:
# Create a summarised by SibSp table.
plotly.offline.iplot(ff.create_table(summarise_df(titanic_cleaned_df, 'SibSp')))

In [None]:
# Create a summarised by Parch table.
plotly.offline.iplot(ff.create_table(summarise_df(titanic_cleaned_df, 'Parch')))

In [None]:
# Visualise the previous two tables.
create_sum_plot(titanic_cleaned_df, 'SibSp', 'bar', 'No. of Siblings / Spouses Aboard the Titanic')
create_sum_plot(titanic_cleaned_df, 'Parch', 'bar', 'No. of Parents / Children Aboard the Titanic')

It seems that passengers travelled with a family had a better survival chance. The following comparison between solitary travellers and families clarifies that.

In [None]:
# Group to passengers travelling with families and solitary passengers.
families_df = titanic_cleaned_df[(titanic_cleaned_df['SibSp'] != 0) | (titanic_cleaned_df['Parch'] != 0)]
solitary_df = titanic_cleaned_df[(titanic_cleaned_df['SibSp'] == 0) & (titanic_cleaned_df['Parch'] == 0)]

# Calculate the survival rate.
families_survival_rate = round(len(families_df[families_df['Survived'] == 1])*100/len(families_df), 2)
solitary_survival_rate = round(len(solitary_df[solitary_df['Survived'] == 1])*100/len(solitary_df), 2)

print('Survival proportion of passengers travelling with a family member: {}%'.format(families_survival_rate))
print('Survival proportion of passengers travelling alone: {}%'.format(solitary_survival_rate))

#### 5. Check the correlation between fare and SES?

In [None]:
# Find the correlation of the Fare with Pclass series.
titanic_cleaned_df['Fare'].corr(titanic_cleaned_df['Pclass'])

It seems that there is a moderate negative correlation between the fare and the ticket class, which is translated to a more expensive ticket for a higher class. Therefore, a corresponding better survival rate for the higher fare presumably expected.

#### 6. What happened to the crew?

In [None]:
# Select the passengers with a zero fare.
titanic_df.loc[titanic_df['Fare'] == 0]

A quick look at the above table shows that all these passengers were males, embarked from Southampton, travelling alone and only one of them survived. Additional info is required to confirm that these passengers were crew of the Titanic.

## Inferential Statistics
### Hypothesis Testing
The significance of Age variable will be checked using an unpaired t-test for children and adults. Assuming that children have better chances of survival than its population, the hypotheses are:

H0: There is no significant difference in the chances of survival of children and adults.

H1: There is a better chance of survival for children.

In [None]:
# Summarise the age grouped dfs.
children_sum_df = summarise_df(children_df, 'Age')
adults_sum_df = summarise_df(adults_df, 'Age')

# Find the variance.
var_children = round(children_sum_df['Percentage'].var(), 0)
var_adults = round(adults_sum_df['Percentage'].var(), 0)

print('Variance of children survival percentage: {}'.format(var_children))
print('Variance of adults survival percentage: {}'.format(var_adults))

# Unpaired t-test with equal variance turned to false.
two_sample_t_test = ttest_ind(children_sum_df['Percentage'], adults_sum_df['Percentage'], equal_var=False)

print('')
print('The t-statistic is %.5f and the p-value is %.5f.' % two_sample_t_test)

Since p-values < .001, the null hypothesis can be rejected and these results are statistically significant. Consequently, children had better chances of survival than adults.

## Conclusions
From the previous analysis it was deduced that:
* Passengers with a higher ticket class had at least a double chance to survive.
* Females had 4 times more chances to survive than the males.
* Children had 1.4 times greater survival rate than adults.
* Family passengers survival rate was 1.7 better than solitary ones.

However, a statistically significant conclusion can be made only for the age factor, since the t-test indicated that children had obviously more posibilities to survive (with less than one in a thousand chance of being wrong).

## References
https://www.kaggle.com/c/titanic

https://www.python.org/doc/

http://pandas.pydata.org/pandas-docs/stable/

http://matplotlib.org

https://seaborn.pydata.org/

https://plot.ly/python/

http://stackoverflow.com