# Titanic Data Preparation and Predictive Modeling
**Example by David Cochran**

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition ...

1. Read in the original data.
2. Perform preparation steps — rationale explained in a separate EDA notebook.
2. Split the data into Train (60%) / Validate (20%) / Test (20%)
3. Train, Fit, Test, Evaluate and Compare Models using These Algorithms

References and Resources:
- [Churn Modeling Notebook by cutterback](https://github.com/cutterback/p03-telco-churn-model/blob/master/Telco-Churn-Classification-Model.ipynb)
- [GridSearchCV Docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

Additional Resources:
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html
- https://towardsdatascience.com/fine-tuning-a-classifier-in-scikit-learn-66e048c21e65


# Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook', font_scale = 1.1)
%matplotlib inline
from time import time

# Machine Learning Training, Scoring, and Metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import make_scorer, precision_recall_curve, classification_report
from sklearn.metrics import plot_confusion_matrix, confusion_matrix
from sklearn.metrics import roc_curve, auc, f1_score, roc_auc_score

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Data Preparation

In [2]:
df = pd.read_csv('data/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
# Drop irrelevant columns
df.drop(['PassengerId','Name','Ticket'], axis=1, inplace=True)

# Fill missing age values with average age
df['Age'].fillna(df['Age'].mean(), inplace = True)

# Create Family_count from SibSp and Parch
df['Family_count'] = df['SibSp'] + df['Parch']

# Drop SibSp and Parch
df.drop(['SibSp','Parch'], axis=1, inplace=True)

# Create Cabin_ind
df['Cabin_ind'] = np.where(df['Cabin'].isnull(), 0, 1)

# Drop Cabin and Embarked
df.drop(['Cabin','Embarked'], axis=1, inplace=True)

# Convert sex to numeric indicator
gender_map = {'male': 0, 'female': 1}
df['Sex'] = df['Sex'].map(gender_map)

# View Updated Dataframe Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Survived      891 non-null    int64  
 1   Pclass        891 non-null    int64  
 2   Sex           891 non-null    int64  
 3   Age           891 non-null    float64
 4   Fare          891 non-null    float64
 5   Family_count  891 non-null    int64  
 6   Cabin_ind     891 non-null    int64  
dtypes: float64(2), int64(5)
memory usage: 48.9 KB


In [5]:
# View first records
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Family_count,Cabin_ind
0,0,3,0,22.0,7.25,1,0
1,1,1,1,38.0,71.2833,1,1
2,1,3,1,26.0,7.925,0,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,0,0


# Split into Train, Validation, and Test Sets
See the [train_test_split docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

- Train = 60%
- Validation = 20%
- Test = 20%

In [6]:
features = df.drop('Survived', axis=1)
labels = df['Survived']
# First Split: Train/Test
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
# Second Split: Test into Validation/Test
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

# Report the number of records and labels in each new split:
print('Training Features, Labels:')
print(f'{X_train.shape[0]} Records, {len(y_train)} Labels')
print('-----------')
print('Validation Features, Labels:')
print(f'{X_val.shape[0]} Records, {len(y_val)} Labels')
print('-------')
print('Test Features, Labels:')
print(f'{X_test.shape[0]} Records, {len(y_test)} Labels')

Training Features, Labels:
534 Records, 534 Labels
-----------
Validation Features, Labels:
179 Records, 179 Labels
-------
Test Features, Labels:
178 Records, 178 Labels


In [7]:
# Update Variable Names for Machine Learning
# Training Set (60%)
tr_features = X_train
tr_labels = y_train

# Validation Set (20%)
val_features = X_val
val_labels = y_val

# Test Set (20%)
test_features = X_test
test_labels = y_test

# Train and Tune Models using 5-Fold Cross-Validation

See the [GridSearchCV docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

## Setup
- Create an empty list to hold our best models
- Set fundamental parameters
- Define functions to implement gridsearch and report results

In [8]:
# Create an empty list to hold our best models
models = []

# Establish number of splits for k-fold cross-validation
k = 5

# Specify random seed value
seed = 42

### instiantiate_grid
- Set up defaults for Gridsearch CV, incluing the scorers we want, and the priority
- Set to default to prioritize roc_auc in model fitting and selecting best model

In [9]:
# Set up defaults for Gridsearch CV, incluing the scorers we want, and the priority
# Set to default to prioritize roc_auc
def instantiate_grid(algorithm, param_grid, refit='roc_auc'):
        
    # Set up scorers
    scoring = {
                'roc_auc': make_scorer(roc_auc_score, greater_is_better=True,
                         needs_threshold=False),
                'f1': make_scorer(f1_score),
                'accuracy': make_scorer(accuracy_score),
                'precision': make_scorer(precision_score),
                'recall': make_scorer(recall_score)
              }
    
    gs = GridSearchCV(
                        estimator=algorithm,
                        param_grid=param_grid, 
                        scoring=scoring,
                        refit=refit, 
                        cv=StratifiedKFold(n_splits=k, random_state=seed, shuffle=True)
                     )
    
    return gs

### show_grid_metrics
- Specify metrics we desire from GridsearchCV results
- Sort by desired metric

In [10]:
# Display grid results ranked
def show_grid_metrics(cv_results, sort_by, top_n=10):

    # Specify evaluation metrics that we desire from GridSearchCV results
    metrics = ['params',
               'rank_test_roc_auc',
               'mean_test_roc_auc',
               'rank_test_f1',
               'mean_test_f1',
               'mean_test_accuracy', 
               'mean_test_precision',
               'mean_test_recall'
              ]
    
    cv_results_metrics = cv_results.loc[:, metrics]
    cv_results_metrics.sort_values(by=[sort_by], ascending=False, inplace=True)

    return cv_results_metrics

## Logistic Regression: Train and Tune 
- [GridSearchCV Docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- [LogisticRegression Docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [11]:
# Execute GridSearch and Report Results

# Specify Algorithm shortname
name = 'LR'

# Specify algorithm with desired default parameters
algorithm = LogisticRegression(random_state=seed, fit_intercept=False, max_iter=500, n_jobs=-1)

# Set parameters for GridSearch, to identify best parameters
param_grid = {
    'C': [.001, .01, 1, 10, 100, 1000]
}

# /////////////////////////////////////////////////////////
# Standard process for cross-validating using Gridsearch
# /////////////////////////////////////////////////////////

# Instantiate gridsearch object for this algorithm
gs = instantiate_grid(algorithm, param_grid)

start = time()

# Activate gridsearch
gs.fit(features, labels)

end = time()
latency = round((end-start), 2)

# Create dataframe from gridsearch cv_results_
gs_results = pd.DataFrame.from_dict(gs.cv_results_)

# Print heading
print(f'\nTOP-PERFORMING {name} MODELS\n')
print(f'Training Latency: {latency}s')

# Display the results for top 5 sorted by ROC-AUC
show_grid_metrics(gs_results, 'mean_test_roc_auc', top_n=5)


TOP-PERFORMING LR MODELS

Training Latency: 5.62s


Unnamed: 0,params,rank_test_roc_auc,mean_test_roc_auc,rank_test_f1,mean_test_f1,mean_test_accuracy,mean_test_precision,mean_test_recall
4,{'C': 100},1,0.780835,1,0.728283,0.797973,0.751291,0.707374
5,{'C': 1000},1,0.780835,1,0.728283,0.797973,0.751291,0.707374
3,{'C': 10},3,0.779385,3,0.726369,0.796855,0.750547,0.704476
2,{'C': 1},4,0.775912,4,0.721795,0.794608,0.750555,0.695695
1,{'C': 0.01},5,0.652323,5,0.493244,0.72165,0.815924,0.353836
0,{'C': 0.001},6,0.613164,6,0.413579,0.689109,0.748943,0.286445


### NOTE: _If you'd like to see the entire gridsearch results ..._

In [12]:
# If desired, use this to view the entire GridSearch CV results object
gs.cv_results_

{'mean_fit_time': array([0.40565615, 0.18146744, 0.19117002, 0.10433559, 0.18913121,
        0.01751814]),
 'std_fit_time': array([0.39522465, 0.20742013, 0.21426678, 0.17711049, 0.21368359,
        0.00310894]),
 'mean_score_time': array([0.00548997, 0.00435882, 0.00456181, 0.00475326, 0.00463352,
        0.00511203]),
 'std_score_time': array([0.00152757, 0.00024977, 0.00027124, 0.00046999, 0.00058875,
        0.00074634]),
 'param_C': masked_array(data=[0.001, 0.01, 1, 10, 100, 1000],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.001},
  {'C': 0.01},
  {'C': 1},
  {'C': 10},
  {'C': 100},
  {'C': 1000}],
 'split0_test_roc_auc': array([0.61765481, 0.66567852, 0.77964427, 0.77964427, 0.78689065,
        0.78689065]),
 'split1_test_roc_auc': array([0.61804813, 0.64465241, 0.76270053, 0.77005348, 0.77005348,
        0.77005348]),
 'split2_test_roc_auc': array([0.60227273, 0.62713904, 0.75815508, 0.7