# **Titanic Survivors with Decision Trees**

The code below is taken from Manav Sehgal's submission on [kaggle.com](https://www.kaggle.com/startupsci/titanic-data-science-solutions/notebook).

You are encouraged to go to the link above and check the full code. In this lab, you will do the necessary steps to explore the data and prepare it for sklearn algorithms.

**About the data set**

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.

Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

**Import libraries**

In [212]:
# data analysis and wrangling
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
#import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Acquire data

In [213]:
# Acquire the training data set
train_df = pd.read_csv('SupervisedLearning/Titanic_survivors/train.csv')

# Acquire the testing data set 
test_df = pd.read_csv('SupervisedLearning/Titanic_survivors/test.csv')

# Combine these datasets to run certain operations on both datasets together.
combine = [train_df, test_df]

# Inspect data

In [214]:
#TODO: Write code to display the features available in train_df
#Hint: Use columns.values


In [0]:
#TODO: Write code to inspect the first five rows of train_df


**Which features are categorical?**

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

In the text box below, list all the categorical features.


**answer here:**

**Which features are numerical?** 

These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.

In the text box below, list all the continuous and discrete features.

**answer here:**

**Which features are mixed data types?**

Numerical, alphanumeric data within same feature. These are candidates for correcting goal.

* Ticket is a mix of numeric and alphanumeric data types. Cabin is alphanumeric.



**Which features may contain errors or typos?**

This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting.

* Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.

**Which features contain blank, null or empty values?**

Write code below to find the answer.

In [215]:
#TODO: Write code to get the information on train_df  


In [216]:
#TODO: Write code to get the information on test_df 


**answer here:** 

# Clean data

**1. Correcting**

This is a good starting goal to execute. By dropping features we are dealing with fewer data points. Speeds up our notebook and eases the analysis.

Note that where applicable we perform operations on both training and testing datasets together to stay consistent.

* Ticket feature may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may not be a correlation between Ticket and survival.
* Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
* PassengerId may be dropped from training dataset as it does not contribute to survival.
* Name feature is relatively non-standard, may not contribute directly to survival, so maybe dropped. This will be dropped later, after creating some new features out of it.

In [217]:
# Display the new structure of all data frames before dropping two columns
print("Before", train_df.shape, test_df.shape, combine[0].shape, 
       combine[1].shape)

# Drop Ticket and Cabin columns from train_df
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)

#TODO: Write code to drop Ticket and Cabin columns from test_df

# Reset the combine data frame with the new values in both datasets
combine = [train_df, test_df]

#TODO: Write code to display the new structure of all data frames after dropping two columns


In [218]:
#TODO: Write code to drop the PassengerId feature in the training dataset.


# Reset the combine data frame
combine = [train_df, test_df]

#TODO: Write code to display the shape of train_df and test_df


**2. Creating**

We want to analyze if Name feature can be engineered to extract titles and test correlation between titles and survival, before dropping the Name feature.

In the following code we extract Title feature using regular expressions. The RegEx pattern (\w+\.) matches the first word which ends with a dot character within Name feature. The expand=False flag returns a DataFrame.

Most titles band Age groups accurately.
Survival among Title Age bands varies slightly.
Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).

We decide to retain the new Title feature for model training.

The crosstab() function is used to compute a simple cross tabulation of two (or more) factors.

In [219]:
# Loop through each item in the combined data set and extract
# titles from the Name column
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train_df['Title'], train_df['Sex'])

In [220]:
# Replace titles with a more 
# common name or classify them as Rare.
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    #TODO: Write code to replace the remaining titles with 'Rare'.
    #Note: In the replace function, you can group multiple items using brackets.
    # Example: replace(['Rev','Major'], 'Rare')
    
    
train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

In [221]:
# Convert the categorical titles to ordinal.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

#TODO: Write code to inspect the first five rows of train_df


In [222]:
# Safely drop the Name feature from training and testing datasets. 
train_df = train_df.drop(['Name'], axis=1)
test_df = test_df.drop(['Name'], axis=1)

# Recreate the combine data frame with the new values of train_df and test_df
combine = [train_df, test_df]

#TODO: Write code to inspect the new shape of both data frames


**3. Converting**

convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.



In [223]:
# Convert Sex feature to a new feature called Gender where 
# female=1 and male=0.
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

#TODO: Write code to inspect the first five rows of train_df


**4. Completing**

**4.a.** Complete Age feature for null values, as it is definitely correlated to survival.
* Guess missing values using other correlated features. 
* We note correlation among Age, Gender, and Pclass. 
* Guess Age values using median values for Age across sets of Pclass and Gender feature combinations. 
* So, median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, etc

In [224]:
# Prepare an empty array to contain guessed Age values based on 
# Pclass x Gender combinations.
guess_ages = np.zeros((2,3))
guess_ages

In [225]:
# Iterate over Sex (0 or 1) and Pclass (1, 2, 3) to calculate guessed
# values of Age for the six combinations.
for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & \
                                  (dataset['Pclass'] == j+1)]['Age'].dropna()

            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                    'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)

#TODO: Write code to inspect the first five rows of train_df


In [226]:
# Create Age bands and determine correlations with Survived.
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], 
as_index=False).mean().sort_values(by='AgeBand', ascending=True)

In [227]:
# Replace Age with ordinals based on these bands.
for dataset in combine:    
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']
    
#TODO: Write code to inspect the first five rows of train_df


In [228]:
# Now it's safe to remove the AgeBand feature from the training data
train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]

#TODO: Write code to inspect the first five rows of train_df


**4.b.** Complete the Embarked feature as it may also correlate with survival or another important feature.

Embarked feature takes S, Q, C values based on port of embarkation. Our training dataset has two missing values. We simply fill these with the most common occurance.

In [229]:
freq_port = train_df.Embarked.dropna().mode()[0]

for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
    
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [230]:
# Convert the EmbarkedFill feature by creating a new numeric Port feature.
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

#TODO: Write code to inspect the first five rows of train_df


In [231]:
#TODO: Write code to inspect the first five rows of test_df


**Creating**

There are more features that can be created to facilitate the analysis of this data set:

* We may want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.
* We may want to engineer the Name feature to extract Title as a new feature.
* We may want to create new feature for Age bands. This turns a continous numerical feature into an ordinal categorical feature.
* We may also want to create a Fare range feature if it helps our analysis.


In [232]:
# create a new feature for FamilySize which combines Parch and SibSp. 
# This will enable us to drop Parch and SibSp from our datasets.
for dataset in combine:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], 
as_index=False).mean().sort_values(by='Survived', ascending=False)

In [233]:
# Create the IsAlone feature
for dataset in combine:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

In [234]:
# Drop Parch, SibSp, and FamilySize features in favor of IsAlone.
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]

#TODO: Write code to inspect the first five rows of train_df


In [235]:
# We can also create an artificial feature combining Pclass and Age.
for dataset in combine:
    dataset['Age*Class'] = dataset.Age * dataset.Pclass

train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

In [236]:
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
#TODO: Write code to inspect the first five rows of train_df


In [237]:
# We can not create FareBand.
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

In [238]:
# Convert the Fare feature to ordinal values based on the FareBand.
for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]
    
#TODO: Write code to inspect the first five rows of train_df


In [239]:
#TODO: Write code to inspect the first five rows of test_df


# Earn Your Wings

Use a decision tree classifier on the cleaned data set to predict 'Survived' for the given data. Report the accuracy score. Add comments in your code to explain each step that you take in your implementation.