# Titanic - Experiment 12

 1. Frame Problem and Objective
 2. Describe and Wrangle Data
 3. Process Data
 4. Explore Data
 5. Modeling and Evaluation

### Import libraries

In [1]:
import os
import sys
import warnings

import scipy
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap

from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display

if os.getcwd() not in sys.path:
    sys.path.append(os.getcwd())
from Titanic.Code.DataPrep.titanic import Titanic
from Titanic.Code.DataPrep.helpers import score_impute_strategies

### Change notebook settings

In [2]:
warnings.filterwarnings('ignore')
np.random.seed(17)
InteractiveShell.ast_node_interactivity = "all"
plt.style.use('classic')
%matplotlib inline

## 1. Frame Problem and Objective

#### Problem:
On April 15, 1912, the [RMS Titanic](https://en.wikipedia.org/wiki/RMS_Titanic) collided with an iceberg killing more than 1,500 of an *estimated* 2,224 passengers and crew.

***

#### Objective:
Given a set of passenger records from the RMS Titanic, our objective is to generate a model that can predict if a passenger survived the disaster. Therefore, because we know the output can only be one of two discrete values, we can assume the problem type is **[binary classification](https://en.wikipedia.org/wiki/Binary_classification)**.

There are many machine learning models available from the [scikit-learn](https://scikit-learn.org/stable/) library we can leverage for a **binary classification** problem. A handful of these models are:
 * [Logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
 * [Decision trees](https://scikit-learn.org/stable/modules/tree.html#classification)
 * [Random forests](https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees)
 * [Support vector machines](https://scikit-learn.org/stable/modules/svm.html#classification)

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

## 2. Describe and Wrangle Data

The data for this analysis project comes from [Kaggle](https://www.kaggle.com/c/titanic). The following variables are available in the data:
 * PassengerId
    * This appears to be some Kaggle/system-generated identifier column
 * Survived
    * Either 0 or 1 for 'No' or 'Yes' if the passenger survived, respectively
 * Pclass
    * Ticket class of the passenger: first-class (1), second-class (2), or third-class (3)
 * Name
    * Name of the passenger of the form: {last name}, {title} {first name} {middle name}
    * Not all passengers have a middle name
 * Sex
    * The sex of the passenger: 'male' or 'female'
 * Age
    * Age of passenger (in years)
    * Passengers less than 1 year old have their age expressed as a *float*
    * Passengers with an *estimated* age is in the form xx.5
    * A quick look at the data shows there are some records missing values
 * SibSp
    * An aggregated field representing both the number of siblings **and/or** spouses of the passenger aboard the RMS Titanic
 * Parch
    * An aggregated field representing both the number of parents **and/or** children of the passenger aboard the RMS Titanic
 * Ticket
    * The ticket number of the passenger
 * Fare
    * The fare charged to the passenger for the ticket
 * Cabin
    * Cabin number of the passenger
    * This variable also shows that there are missing values
 * Embarked
    * The port of embarkation of the passenger: Cherbourg (C), Queenstown (Q), or Southampton (S)
    * This variable has missing values

In [4]:
# Read data: train
train = pd.read_csv(r'Titanic\Data\Raw\train.csv')

In [5]:
train.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891,891.0,204,889
unique,,,,891,2,,,,681,,147,3
top,,,,"Taussig, Miss. Ruth",male,,,,CA. 2343,,C23 C25 C27,S
freq,,,,1,577,,,,7,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [6]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Notes:
 - A couple of the columns in this data could be classified as categories: 'Sex' and 'Embarked'.
 - There are a handful of text fields with a high number of unique values: 'Name', 'Ticket', 'Cabin'.
    - These are not categories, but may have some useful information buried in them.
 - We have five numeric columns that are exclusively integer values: 'PassengerId', 'Survived', 'Pclass', 'SibSp', 'Parch'.
 - The 'Age' column has estimates but is the only numeric float variable we have.

In [7]:
train = train.set_index('PassengerId', verify_integrity=True)

In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [9]:
train['Sex'].unique()

array(['male', 'female'], dtype=object)

In [10]:
train['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [11]:
sex_category_dtype = pd.CategoricalDtype(["male", "female"])
embarked_category_dtype = pd.CategoricalDtype(["C", "Q", "S"])

In [12]:
train['Sex'] = train['Sex'].astype(sex_category_dtype)
train['Embarked'] = train['Embarked'].astype(embarked_category_dtype)
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null category
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null category
dtypes: category(2), float64(2), int64(4), object(3)
memory usage: 71.5+ KB


## 3. Process Data

We will examine the number of missing values for each observation in our train data set.

In [13]:
# Count of missing values
train.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [14]:
# Ratio of missing values
train.isnull().sum()/len(train)*100

Survived     0.000000
Pclass       0.000000
Name         0.000000
Sex          0.000000
Age         19.865320
SibSp        0.000000
Parch        0.000000
Ticket       0.000000
Fare         0.000000
Cabin       77.104377
Embarked     0.224467
dtype: float64

The analysis above shows that we have three columns with missing values:
 - Age: there is a relatively low number of missing values here, therefore we should be comfortable with imputing here.
    - *We will impute the median.*
 - Cabin: a majority of observations in this column are missing, therefore it would be best to just remove this.
 - Embarked: there are only 2 observations missing a value in this column, therefore we should be comfortable with imputing here.
    - *We will impute the most common value here.*

In [15]:
# Impute 'Age' with median
age_median = train['Age'].median()
train['Age'] = train['Age'].fillna(value=age_median)
train['Age'].isnull().sum()

0

In [16]:
# Drop 'Cabin'
train = train.drop(columns=['Cabin'])
train.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Embarked'],
      dtype='object')

In [17]:
# Impute 'Embarked' with mode
embarked_mode = train['Embarked'].mode().values[0]
train['Embarked'] = train['Embarked'].fillna(value=embarked_mode)
train['Embarked'].isnull().sum()

0