## Passenger Age vs. Suvival

This tutorial is based on the following resources:
* Introduction to Logistic Regression: https://medium.com/@anishsingh20/logistic-regression-in-python-423c8d32838b
* Data Visualization with Seaborn: https://www.datacamp.com/courses/data-visualization-with-seaborn?tap_a=5644-dce66f&tap_s=210732-9d6bbf
* Data Cleaning in Python: https://www.datacamp.com/courses/cleaning-data-in-python?tap_a=5644-dce66f&tap_s=210732-9d6bbf

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv('titanic.csv')
df.head()

In [None]:
df.shape

## Data Dictionary

1. PassengerID - type should be integers
2. Survived - survived or not
3. Pclass - class of Travel of every passenger
4. Name - the name of the passenger
5. Sex - gender
6. Age - age of passengers
7. SibSp - No. of siblings/spouse aboard
8. Parch - No. of parent/child aboard
9. Ticket - Ticket number
10. Fare - what Prices they paid
11. Cabin - cabin number
12. Embarked - the port in which a passenger has embarked. (C - Cherbourg , S -Southhampton , Q -Queenstown)

In [None]:
df.count()

## Missing data

In [None]:
# We can use seaborn to create a simple heatmap to see where we are missing data
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

* Roughly 20 percent of the Age data is missing. 
* The proportion of Age missing is likely small enough for reasonable replacement with imputation. 
* The Cabin column is just missing too much data. We can drop the column, or change it to another feature such as **"Cabin Known: 1 or 0"**

In [None]:
#count-plot of people survided 
sns.set_style('whitegrid')
sns.countplot(x='Survived', hue='Sex', data=df, palette='RdBu_r')

* People who did not survive were much more likely to be male 
* People who did survive were almost twice as likely to be female

In [None]:
#no. of people who survived according to their Passenger Class
sns.set_style('whitegrid')
sns.countplot(x='Survived', hue='Pclass', data=df)

* People who did not survive were more likely to be belonging to third class i.e the lowest class
* People who did survive were more likely to belong to higher classes

In [None]:
#distribution plot of age of the people
sns.distplot(df['Age'].dropna(), kde=True, bins=30, color='Green')

* The average age group of people to survive is somewhere between 20 to 30
* The older you get the less are chances of your survival

In [None]:
#countplot of the people having siblings or spouses
sns.countplot(x='SibSp',data=df)

* Most of the people on board are single (option 0) 
* The second most likely group is people with spouses (option 1)


In [None]:
#distribution plot of the ticket fare
df['Fare'].hist(color='green',bins=40,figsize=(8,4))

* Most of the purchase prices are between 0 and 50 
* Tickets are more distributed towards cheaper fare prices (most passengers are in cheaper third class)

## Data Cleaning

In [None]:
#boxplot with age on y-axis and Passenger class on x-axis.
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=df,palette='winter')

* The wealthier passengers in the higher classes tend to be older
* We’ll use the average age values to impute based on Pclass for Age

In [None]:
df.query('Pclass==1')['Age'].mean()

In [None]:
df.query('Pclass==2')['Age'].mean()

In [None]:
df.query('Pclass==3')['Age'].mean()

In [None]:
df.groupby('Pclass').agg({'Age':'mean'})

In [None]:
def impute_age(cols):
    age = cols[0]
    p_class = cols[1]
    
    if pd.isnull(age):
        if p_class == 1:
            return 38
        elif p_class == 2:
            return 29
        else:
            return 25
    else:
        return age # Return age without making any changes

In [None]:
# Apply the impute_age function to the training dataset
df['Age'] = df[['Age','Pclass']].apply(impute_age,axis=1)

# Check the heatmap again
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
# Drop the Cabin column and the row in Embarked that is NaN.
df.drop('Cabin',axis=1,inplace=True)
df.dropna(inplace=True)
df.head()