# Titanic Dataset

The **question** we are trying to answer for this dataset is to predict who aboard the Titanic was destined to perish by the set of features from this dataset.  

This is a **Binary Classification** problem that we will be using Python along with the packages: Numpy, Scipy, Scikit-Learn.  As well as some visualization focused packages: Matplotlib and Seaborn.
Personably I am using this modest dataset to practice my Pandas, Scikit-Learn, and Markdown.
> The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

>On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

>While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

>In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Source: [*Kaggle*](https://www.kaggle.com/c/titanic/overview)

## Table of Contents <div id='toc'/>
- [Data Dictionary](#Dictionary)
- [Load Data](#Load-Data)
- EDA
  - [Dataset Overview](#Dataset-Overview)
    - [PassengerId](#PassengerId)
    - [Survived](#Survived)
    - [Pclass](#Pclass)
    - [Name](#Name)
    - [Sex](#Sex)
    - [Age](#Age)
    - [SibSp](#SibSp)
    - [Parch](#Parch)
    - [Ticket](#Ticket)
    - [Fare](#Fare)
    - [Cabin](#Cabin)
    - [Embarked](#Embarked)
- [Preprocessing](#Preprocessing)
- [Model](#Model)
- [Scikit-Learn Pipeline](#Pipeline)
- [Submit to Kaggle](#Submit-to-Kaggle)  
 

### Data Dictionary <a name="Dictionary"></a>

Variable | Definition | Key
:---:|:---:|:---:
survival | Survival | 0 = No, 1 = Yes
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd
sex | Sex	
Age | Age |in years	
sibsp | # of siblings / spouses aboard the Titanic	
parch | # of parents / children aboard the Titanic	
ticket | Ticket number	
fare | Passenger fare	
cabin | Cabin number	
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton

#### Variable Notes
**pclass**: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way...
**Sibling** = brother, sister, stepbrother, stepsister
**Spouse** = husband, wife (mistresses and fiancés were ignored)

**parch**: The dataset defines family relations in this way...
**Parent** = mother, father
**Child** = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Back to [Table Of Contents](#toc)
***

In [None]:
# Data Wrangling/Munging libraries
import pandas as pd
import numpy as np
import scipy as sp

# Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Loading Data

No need to gather data. Just need to load it.

In [None]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

In [None]:
train_data.info()

In [None]:
train_data.head()

In [None]:
train_data.describe()

Back to [Table Of Contents](#toc)
***

In [None]:
def small_multiples_survived(df, base_cat, survived='Survived'):
    
    color = ['#377eb8','#ff7f00','#4daf4a','#4daf4a','#984ea3']
    
    plt.figure(figsize=(16,4))

    plt.subplot(1,3,1)
    plt.title(f'Density of {df[base_cat].name} values')
    df[base_cat].hist()
    
    plt.subplot(1,3,2)
    plt.title("'NO' Survive")
    df[base_cat][df[survived] == 0].value_counts(normalize=True).plot(kind='bar', color=color) 
    
    plt.subplot(1,3,3)
    plt.title("'YES' Survive")
    df[base_cat][df[survived] == 1].value_counts(normalize=True).plot(kind='bar', color=color) 
    
# plt.figure(figsize=(16,4))
# d =train_data.groupby('Survived')['Sex'].value_counts(normalize=True)
# plt.subplot(1,2,1)
# d[0].plot(kind='bar')
# plt.subplot(1,2,2)
# d[1].plot(kind='bar')

## PassengerID

This column is used as an index column.  We will not use this column in our model.

Back to [Table Of Contents](#toc)
***

## Survived ##

These are the **target**(y) values
- Values are nominal/binary.
- 0 = 'NO' Survive
- 1 = 'YES' Survive

We will assign these values to the y variable once we decide how we are going to handle our null values.
```python
y_train = train_data['Survived']
```
First is a **Scatter Plot Matrix** of our continuous values. _There are so few continuous labels I included 'Pclass' since it is Ordinal in nature_.

Second is a **Correlation Heatmap** of our entire dataset. 

In [None]:
cont = ['Age', 'Pclass', 'Fare']

In [None]:
sns.pairplot(train_data[cont],diag_kind='kde')

In [None]:
corr = train_data.corr()
colormap = sns.diverging_palette(220, 10, as_cmap = True)
sns.heatmap(corr, annot=True, cmap=colormap)

In [None]:
print(train_data.Survived.value_counts())

plt.title('Distribution of Survived Values')
train_data.Survived.hist()

Back to [Table Of Contents](#toc)
***

## Pclass ##

**Ticket Class**

- Values are catigorical, ordinal in type:
  - 1 = First Class Ticket
  - 2 = Second Class Ticket
  - 3 = Third Class Ticket
 
- No NaN

Need to **OneHotEncode** values.

In [None]:
small_multiples_survived(train_data, 'Pclass')
print(train_data.Pclass.value_counts())

In [None]:
for x in [1,2,3]:
    train_data.Survived[train_data.Pclass == x].plot(kind='kde')
plt.legend(('1st','2nd','3rd'))

Back to [Table Of Contents](#toc)
***

## Name ##

**Name of the passenger.**
- Values are strings.
- No NaN

This Column doesn't have much use in it's current form. Might come back to it to perform some feature engineering.

*__Dropping__ this column for now.*

In [None]:
train_data.Name.head()

In [None]:
train_data.Name.describe()

Back to [Table Of Contents](#toc)
***

## Sex ##

**Gender of passenger**
- Values are categorical, strings
  - Male
  - Female
- No NaN

Need to __OneHotEncode__ values.

In [None]:
print(train_data.Sex.value_counts())
small_multiples_survived(train_data, 'Sex')

In [None]:
train_data.Survived.index

Back to [Table Of Contents](#toc)
***

## Age ##

**Age of the Passenger.**
- Values are Numeric, as float64.
- Nan = 177 missing entries that we need to figure out how we want to handle.
  1. Drop Entire Column.
  2. Drop the rows with the NaN.
  3. Fill the NaN with a value(most likely the mean).

In [None]:
train_data.Age.isna().sum()

In [18]:
train_data.Age.describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [None]:
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
train_data.Age.hist()
plt.subplot(1,2,2)
train_data.Fare.plot(kind='box')

In [None]:
train_data.Age.describe()

In [None]:
print(f'Median: {train_data.Age.median()}, Mode:{train_data.Age.mode()}')

__Is there a relationship between Age and Ticket Class?__

In [None]:
for x in [1,2,3]:
    train_data.Age[train_data.Pclass == x].plot(kind='kde')
    
plt.legend(('1st', '2nd', '3rd'))

**What kind of ticket did the passengers with missing ages have?**

In [None]:
missing_ages = train_data.loc[train_data.Age.isna()]
missing_ages.Pclass.value_counts().plot(kind='bar')

**Is there a relationshipt between Age and Fare paid?**

In [None]:
plt.scatter(train_data.Age, train_data.Fare, alpha=.2)

Back to [Table Of Contents](#toc)
***

## SibSp ## 

**Number of siblings / spouses aboard the Titanic**
- Values are Catigorical.
  - 0-5 & 8

In [None]:
print(train_data.SibSp.value_counts())
small_multiples_survived(train_data, 'SibSp')

Back to [Table Of Contents](#toc)
***

## Parch ## 

**#of parents / children aboard the Titanic**
- Values are Nominal as float
  - 0-6
- No NaN

In [None]:
train_data.Parch.value_counts()

In [None]:
print(train_data.Parch.value_counts())
small_multiples_survived(train_data, 'Parch')

Back to [Table Of Contents](#toc)
***

## Ticket ##
Back to [TableOfContents](#toc)

**Ticket number**
- Values seem to be random with no way to group them.

_**Drop these values**_

In [None]:
train_data.Ticket.describe()

Back to [Table Of Contents](#toc)
***

## Fare ##

**Passenger fare**
- Values are Continuous as float
- Outliers

In [None]:
train_data.Fare.describe()

In [None]:
train_data.Fare.value_counts()

In [None]:
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
train_data.Fare.plot(kind='box')
plt.subplot(1,2,2)
train_data.Fare.hist()

Back to [Table Of Contents](#toc)
***

## Cabin ##

**Cabin Number**
- Values are Catigorical
- NaN = 687 that need to be handled.
  1. Drop the entire column.
  2. Drop the rows with NaN.
  3. fill the NaN with a value.
- Could try and Feature Engineer values that more useful.
  
_**Dropping entire column for now**_

In [None]:
train_data.Cabin.isna().sum()

In [None]:
train_data.Cabin.describe()

In [None]:
train_data.Cabin.value_counts()

Back to [Table Of Contents](#toc)
***

## Embarked ##
**Port of Embarkation**
- Values are Nominal as strings
  - C = Cherbourg
  - Q = Queenstown
  - S = Southampton
- NaN = 2 nulls that need to be handled.

In [None]:
train_data.Embarked.describe()

In [None]:
train_data.Embarked.isna().sum()

In [None]:
print(train_data.Embarked.value_counts())
small_multiples_survived(train_data, 'Embarked')

Back to [Table Of Contents](#toc)
***

# Preprocessing <a name='Preprocessing'></a>

In [58]:
def normalize(df, continuous):
    for feature in continuous:
        min_value = df[feature].min()
        max_value = df[feature].max()
#         df[feature] = (df[feature] - df.mean()) / (df[feature].std())
        df[feature] = (df[feature] - min_value) / (max_value - min_value)
    return df

In [59]:
# New DF to keep original values intact.
categories = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
X_train = train_data[categories]
y_train = train_data['Survived']
X_test = test_data[categories]

# Fill the 3 NaN Embarked entries with the mode.
X_train.Embarked.fillna(train_data.Embarked.mode()[0], inplace=True)
X_test.Embarked.fillna(train_data.Embarked.mode()[0], inplace=True)
X_test.Fare.fillna(train_data.Fare.mean(), inplace=True)

# fill the 177 NaN Age entries with the median.
median = train_data.Age.median()
X_train.Age.fillna(median, inplace=True)
X_test.Age.fillna(median, inplace=True)

# Apply a Min/Max normilization to our continuous values
continuous = ['Age','Fare']
X_train = normalize(X_train, continuous)

# pd.to_numeric(X_train, downcast='float')
# Use get_dummies as our category encoder.
categories = ['Pclass','Sex','SibSp','Parch','Embarked']
X_train = pd.get_dummies(X_train, columns=categories, prefix=categories)
X_test = pd.get_dummies(X_test, columns=categories, prefix=categories)  
X_test.drop(columns='Parch_9', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Back to [Table Of Contents](#toc)
***

### Model  <a name="Model"></a>

In [60]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

In [61]:
# RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
clf.fit(X_train,y_train)
print(clf.score(X_train, y_train))
predictions = clf.predict(X_test)
output0 = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})

0.8428731762065096


In [62]:
# AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=200, random_state=1)
clf.fit(X_train,y_train)
print(clf.score(X_train, y_train))
predictions = clf.predict(X_test)
scores = cross_val_score(clf, X_train, y_train, cv=5)
scores.mean()
output1 = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})

0.8529741863075196


Back to [Table Of Contents](#toc)
***

## Scikit-Learn Pipeline  <a name='Pipeline'></a>
Will use this once we are happy with data **Preprocessing** steps and **Feature Extraction**.

Makes it easier to do model and hyper paramater training.

In [None]:
'''
y_train = train_data.Survived
columns = ['Age','Fare','Embarked','Sex','Pclass']
X_train = train_data[columns]
X_test = test_data[columns]
'''

In [None]:
'''
numeric_features = ['Age', 'Fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = ['Embarked', 'Sex', 'Pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
])
'''

In [None]:
'''
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                ('classifier', LogisticRegression())])
fit_model_pipeline = model_pipeline.fit(X_train,y_train)
print('model score: %.3f' % fit_model_pipeline.score(X_train,y_train))
'''

In [None]:
'''
predict = model_pipeline.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predict})
'''

## Submit to kaggle  <a name='Submit-to-Kaggle'></a>

In [None]:
# output.to_csv('/submissions/AdaBoostClassifier.csv', index=False)

Back to [Table Of Contents](#toc)
***