# Introduction to Pipelines in Python

## Objectives
- Understand what a pipeline is and why they are used
- Implement a pipeline to streamline the preprocessing and modeling workflow

## Why Pipeline?

A pipeline defines a series of sequential steps or processes that data must "flow" through. Pipelines can keep our code neat and clean all the way from gathering & cleaning our data, to creating models & fine-tuning them!

**Advantages**: 
- Reduces complexity
- Convenient 
- Flexible 
- Can help prevent mistakes (like data leakage between train and test set) 


## Today's Agenda

We'll introduce pipelines through the lens of simplifying the whole classification workflow!

Our data: https://www.kaggle.com/competitions/spaceship-titanic/data

The goal is to classify the `Transported` column (whether or not passenger was safely transported to new planet)

The competition's main metric is accuracy.

#### Agenda:
- ML Workflow without Pipeline
- Pipeline Architecture
- Implement Pipeline into workflow

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import plot_confusion_matrix, recall_score, roc_auc_score,\
    accuracy_score, precision_score, f1_score
from sklearn import set_config

In [None]:
df = pd.read_csv('data/space-titanic.csv')
df.head()

Perform a train-test split.

In [None]:
# Separate target and features
X = df.drop(columns=['Transported'])
y = df['Transported']

In [None]:
# 80/20 Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

## ML Workflow without Pipeline

Before we dive into pipelines, let's explore the data and outline a few of the preprocessing steps we'll use later!

### EDA

Explore **training** data, checking both numerical and categorical features.

### EDA Notes:

What are a few important things you noticed when exploring the data?

### Data Preprocessing

Outline our data processing strategy

In [None]:
# List of numerical and categorical column names to process the data separately

## Introducing ML Pipeline

Let's begin exploring pipelines and how we can implement them into our workflow. We'll start by reviewing the process we went through above and discuss how we should construct our pipeline architecture.

Two fundamental components:
- Transformer(s)
- Estimator(s)

![Pipeline Architecture Diagram](./Pipeline_Architecture.png)


The primary tool we will use is sklearn's [Pipeline object](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). 
Since the preprocessing steps differ for numerical and categorical data, we will also utilize sklearn's [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to specify the correct steps for different columns.

### Create and Explore a Pipeline

The first thing we should do is handle the preprocessing steps. Let's create a pipeline for preprocessing the categorical columns we didn't get to earlier.

In [None]:
# Create pipeline to preprocess categorical data

### Complete Process

- Pipeline to process numerical columns
- ColumnTransformer
- Final pipeline

The pipeline is visualized above in a text format. You can display pipeline in a more interactive format using [set_config](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_pipeline_display.html#sphx-glr-auto-examples-miscellaneous-plot-pipeline-display-py)

### Iterate

Having the steps predefined and modularized makes it easy to experiment and fine-tune each part of the process!

## Grid Search

Using pipelines is almost exactly the same, we will just need to make a few adjustments to the parameter dictionary.