# Introduction
In this notebook, we are going to use some of the things we learned in previous notebooks, but develop them in a more robust and re-useable way by introducing the concept of "pipelines" (just think of "data in" and "predictions out"). 

First, let's import all the packages we are going to need, and our data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
URL = 'https://github.com/data-analytics-in-business/lasagna-pipeline-demo/raw/main/data/sample_lasagna.csv'
df = pd.read_csv(URL)
df.head()

# Pipeline
Now let's re-use some of the code from the previous notebook, but organise our preprocessing into a `Pipeline`. The `ColumnTransformer` will split the numeric and catergorical columns for us, pass them to the respective preprocessing transformers (i.e., instances of `MinMaxScaler` and `OneHotEncoder`), and then combine them again into a final input matrix.

The `Pipeline` then combines the `preprocesser` with a `LogisticRegression` classifier, to create a *classification pipeline*.

In [None]:
numeric_features = ["Age", "Weight","Income","Car Value","CC Debt","Mall Trips"]
categorical_features = ["Pay Type","Gender","Live Alone","Dwell Type","Nbhd"]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", MinMaxScaler(), numeric_features),
        ("cat", OneHotEncoder(drop='first', handle_unknown="ignore", sparse=False), categorical_features),
    ],
)

clf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])


# Predictions
Now we have setup our classification pipeline we can train (`fit`) on some data and use it to generate some predictions.

Run the code below to train our classification pipeline and to add some new columns to our input matrix $X$ showing the true values of our *target* variable (`Have Tried`) and the predictions resulting from our classification pipeline.

In [None]:
y = df['Have Tried']
X = df.drop(columns=['Person','Have Tried'])
clf_pipeline.fit(X,y)

X['y_true'] = y
X['y_pred'] = clf_pipeline.predict(X)
X.head(30)

# Confusion (Training)
We can see `y_true` and `y_pred` often match, but not always. Let's create a *confusion matrix* to see a summary of the classification errors of our classification pipeline on our *training data* (i.e., the data we used to train the pipeline.

In [None]:
ConfusionMatrixDisplay.from_predictions(y, clf_pipeline.predict(X))
plt.show()

# Performance (Training)
Based on the values in the confusion matrix, we can calculate a variety of performance metrics using the `sklearn.metrics` package.

Run the code below to see how our classification pipeline performs on our <ins>training</ins> data.

In [None]:
y_true = y.values
y_pred = clf_pipeline.predict(X)

A = accuracy_score(y_true, y_pred)
print(f'Accuracy = (TP + TN) /(TP + FP + FN + TN) = {A:.4f}')

P = precision_score(y_true, y_pred, pos_label='Yes')
print(f'Precision = TP / (TP + FP) = {P:.4f}')

R = recall_score(y_true, y_pred, pos_label='Yes')
print(f'Recall = TP / (TP + FN) = {R:.4f}')

F1 = f1_score(y_true, y_pred, pos_label='Yes')
print(f'F1 Score = 2 * (P * R) / (P + R) = {F1:.4f}')

# Exercise
Create a new classification pipeline using an input matrix $X$ with only columns `Age`, `Weight`, `Income`, `Pay Type`, and `Gender`; using a `StandardScaler` to process the numeric variables; and using a `DecisionTreeClassifier` as the classification "head" of the classification pipeline. Then train the classification pipeline on the training data; generate predictions for the training data; and calculate *accuracy*, *precision*, *recall*, and *F1* scores for the classifcation pipeline on the training data.

**Question**: What do you notice about the performance of this new classification pipeline?

Below, there is some code to get you started and steps in the comments to follow.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

y = df['Have Tried']
X = df[['Age', 'Weight', 'Income', 'Pay Type', 'Gender']]

# (SOLUTION)
numeric_features = ["Age", "Weight","Income"]
categorical_features = ["Pay Type","Gender"]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(drop='first', handle_unknown="ignore", sparse=False), categorical_features),
    ],
)

clf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])

clf_pipeline.fit(X,y)

y_true = y.values
y_pred = clf_pipeline.predict(X)

A = accuracy_score(y_true, y_pred)
print(f'Accuracy = (TP + TN) /(TP + FP + FN + TN) = {A:.4f}')

P = precision_score(y_true, y_pred, pos_label='Yes')
print(f'Precision = TP / (TP + FP) = {P:.4f}')

R = recall_score(y_true, y_pred, pos_label='Yes')
print(f'Recall = TP / (TP + FN) = {R:.4f}')

F1 = f1_score(y_true, y_pred, pos_label='Yes')
print(f'F1 Score = 2 * (P * R) / (P + R) = {F1:.4f}')


# Bonus Exercise
Experiment with the code above to identify which of the changes you made to the pipeline was the reason for the biggest change in scores.