# Exploring Pipelines and Evaluating Classification Models

## Why Pipeline?

Pipelines can keep our code neat and clean all the way from gathering & cleaning our data, to creating models & fine-tuning them!

**Advantages**: 
- Reduces complexity
- Convenient 
- Flexible 
- Can help prevent mistakes (like data leakage between train and test set) 

Easily integrate transformers and estimators, plus cross validation!

<img src="images/grid_search_cross_validation.png" alt="cross validation image from sklearn's documentation" width=500>

Why might CV be good in instances when we're doing things like searching for optimal hyperparameters...?

- 


In [None]:
# Imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, plot_confusion_matrix, plot_roc_curve

First, let's start with the Titanic dataset from earlier today. Target is to predict `Survived`

In [None]:
# Grab, then explore data
df = pd.read_csv('data/titanic.csv')  

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Exploring numeric cols
df.describe()

In [None]:
# Exploring object cols
df[[c for c in df.columns if df[c].dtype == 'object']].describe()

### Baseline Understanding - Aka Model-less Baseline

In [None]:
# Exploring target distribution
df['Survived'].value_counts(normalize=True)

In [None]:
df['Survived'].value_counts().plot(kind='bar');

Evaluate - any thoughts on our model-less understanding?

- 


### Baseline Model

Let's find out how hard our problem is, by throwing things at it and seeing what sticks!

Biggest thing to think about - what types of columns need to be treated differently?

Reference: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

In [None]:
df.head()

In [None]:
# First define our used columns
# Let's not use ['PassengerId', 'Name', 'Ticket', 'Survived']


In [None]:
# Define our X and y

X = None
y = None

# and train test split - to create our val holdout set!
X_train, X_hold, y_train, y_hold = train_test_split(X, y, test_size=0.2,
                                                    random_state=0)

In [None]:
# Set up lists for columns requiring different treatment


In [None]:
# Check our work


In [None]:
# Now, setting up the preprocessing steps for each type of col


In [None]:
# Package those pieces together using ColumnTransformer
preprocessor = None

In [None]:
# Just out of curiosity, let's see what this looks like 
X_tr_transformed = preprocessor.fit_transform(X_train)
X_tr_transformed.shape

In [None]:
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.


# Now - cross_validate!
output = None

# Print our test scores to show average and a measure of variation
print(f"Average ROC-AUC: {output['test_score'].mean()} +/- {output['test_score'].std()}")

Evaluate:

- 


### Try With Adjusted Hyperparameters

In [None]:
# Time for a new pipeline!


# Now - cross_validate!
output = None

# Test scores
print(f"Average ROC-AUC: {output['test_score'].mean()} +/- {output['test_score'].std()}")

Evaluate:

- 


### Validate

How does this perform on our holdout set?

First off - what might we want to check to evaluate our model?

- 


Reference: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

In [None]:
# Can use a pipeline we defined earlier, but need to fit it


In [None]:
# Grab predictions


In [None]:
# What do we want to check first?


In [None]:
# More space to check more metrics


In [None]:
# Might want to plot a few things to help us visualize our metrics


### Now - Let's Build an Evaluate Function for Classification!

In [None]:
def evaluate():
    '''
    
    '''
    pass

In [None]:
# Use our evaluate function!

### Bonus Visualization

Code originated from: https://stackoverflow.com/questions/45715018/scikit-learn-how-to-plot-probabilities

In [None]:
# A bonus, for making it to the end of the notebook...
train_target = pd.DataFrame(y_train) # Create a df with our actual train y values
# Add the predicted probabilities for 1 as a column
train_target['Predicted Probability'] = clf_bal.predict_proba(X_train)[:,1]

# Plot the two
plt.figure(figsize=(12,6))
plt.hist(train_target[train_target['Survived']==0]['Predicted Probability'], 
         bins=50, label='Negatives')
plt.hist(train_target[train_target['Survived']==1]['Predicted Probability'], 
         bins=50, label='Positives', alpha=0.7, color='r')
plt.xlabel('Probability of being Positive Class')
plt.ylabel('Number of records in each bin')
plt.legend()
plt.tick_params(axis='both')
plt.show() 

## Level Up - What to do with too many options in categorical columns?

Data we can use for this: https://www.kaggle.com/c/cat-in-the-dat-ii (this file is in this github repo, in the data folder)

New library you can install with more encoding techniques: https://contrib.scikit-learn.org/category_encoders/

- (these work within SKLearn pipelines, since they're written in the SKLearn style!)


## Resources

Check out Aurélien Geron's notebook of an [end-to-end ml project](https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb) on his GitHub repo based around his book [_Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed)_](https://www.oreilly.com/library/view/hands-on-machine-learning/9781491962282/)