# 01 - Classification pipelines in Python
We are going to use some of the things we learned in the previous notebook, but develop them in a more robust and re-useable way by introducing the concept of "pipelines" (just think of "data in" and "predictions out"). Additionally, we are going to explore how to evaluate and compare classification pipelines using the data.

## Import dataset
Let's import the same dataset from before.

In [None]:
import pandas as pd
df = pd.read_csv('../data/Lasagna Triers Logistic Regression.csv')
df.head()

Let's specify out attribute/predictor variables $X$ and the target variable $y$ that we wish to predict.

Note: To create $X$, we drop the ID and target variables.

In [None]:
X = df.drop(columns=['Person','Have Tried'])
y = df['Have Tried']

## Pre-process dataset
To start to create a pipeline, we need to consider how we will pre-process the data. As was seen in the previous notebook, different data types are processed in different ways.

Run the code below to show the data-types of each solumn in $X$.

In [None]:
X.dtypes

We can consider the columns of type `int64` to be numerical and the columns of type `object` to e categorical.

Run the following code to select the columns containing numerical values.

In [None]:
X_num = X.select_dtypes(include=['int64'])
X_num.head()

Run the following code to select the columns containing categorical values.

In [None]:
X_cat = X.select_dtypes(include=['object'])
X_cat.head()

In the previous notebook we used the `get_dummies` method to encode the categorical variables in a form that can be used in logistic regression. Here, we will do the same thing, but we will use the `OneHotEncoder` from the `sklearn` package.

Run the code below to see what the `OneHotEncoder` does.

In [None]:
from sklearn.preprocessing import OneHotEncoder
hot_encoder = OneHotEncoder(drop='first', handle_unknown="ignore", sparse=False)
hot_encoder.fit(X_cat)
X_cat_onehot = pd.DataFrame(hot_encoder.transform(X_cat), 
                                  columns=hot_encoder.get_feature_names_out(X_cat.columns))
X_cat_onehot.head()

Note that we have dropped the first value from each categorical variable (remember liner dependency?), and we have told our `OneHotEncoder` instance to ignore any values it seens in the future which are unknow to it.

We are also going to scale the numeric variables to be more similar in values to the coded categorical values. To do this, run the code below.

In [None]:
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
X_num_scaled = pd.DataFrame(minmax_scaler.fit_transform(X_num), columns=X_num.columns)
X_num_scaled.head()

 Finally, let's join the processed columns back together.

In [None]:
X_train_preprocessed = pd.concat([X_num_scaled, X_cat_onehot], axis=1)
X_train_preprocessed.head()

## Let's create a pipeline
Now, all the above pre-processing steps can be included in a pipeline, which can then be used to process new data, in the same way, without re-writing all the code again.

To create our first pipeline, please run the code below.

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

# Create a custom pipeline
class PreprocessingPipeline():
    def __init__(self):
        # Initialise the encoder and scaler
        self.hot_encoder = OneHotEncoder(drop='first', handle_unknown="ignore", sparse=False)
        self.minmax_scaler = MinMaxScaler()
        
        # Prepare to store the column names
        self.scaled_columns = []
        self.coded_columns = []
    
    def fit(self, X, y=None, **fit_params):
        # Split X into object/categorical and int64/numeric columns
        X_cat, X_num = self._split_dtypes(X)
        
        # Fit the encoder to object/categorical data 
        self.hot_encoder.fit(X_cat)
        self.coded_columns = self.hot_encoder.get_feature_names_out(X_cat.columns)
        
        # Fit the scaler to int64/numeric data
        self.minmax_scaler.fit(X_num)
        self.scaled_columns = X_num.columns
        
        return self

    def transform(self, X, **transform_params):
        # Split X into object/categorical and int64/numeric columns
        X_cat, X_num = self._split_dtypes(X)
        
        #Transform the object/categorical data 
        X_cat_out = pd.DataFrame(self.hot_encoder.transform(X_cat),columns=self.coded_columns)
        
        #Transform the int64/numeric data
        X_num_out = pd.DataFrame(minmax_scaler.fit_transform(X_num),columns=self.scaled_columns)
        
        # Return the full processed data-frame
        return pd.concat([X_num_out, X_cat_out], axis=1)
    
    def _split_dtypes(self, X):
        return(X.select_dtypes(include=['object']),X.select_dtypes(include=['int64']))

We can now fit the pre-processing pipeline to out input data $X$, and use it to transform our input data $X$.

To do this, run the code below.

In [None]:
my_pipe = PreprocessingPipeline()
my_pipe.fit(X)
my_pipe.transform(X).head()

Look the same as before?

Great, BUT we can also pass *new* data through the pipeline.

Let's importh the `New_Customers.csv` from the previous notebook, and create an `X_new`.

In [None]:
df_new = pd.read_csv('../data/New_Customers.csv')
X_new = df_new.drop(columns=['New_Person'])
X_new.head()

Now let's pass the new data through our existing pipeline.

In [None]:
my_pipe.transform(X_new).head()

It works!?

## Let's create a classification pipeline
Our pre-processing pipeline is great, but we actually want to classify examples at the end of it. 

Well, we can easily extend our pre-processing pipeline to become a classification pipeline.

Run the code below to do exactly that.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

clf_pipeline = Pipeline([
    ('preprocessor', PreprocessingPipeline()),
    ('classifier', LogisticRegression())
])

Now let's use our classificiation pipeline to create some predictions for our new customers.

Run the code below.

In [None]:
clf_pipeline.fit(X,y)
df_new['Have Tried'] = clf_pipeline.predict(X_new)
df_new.head(30)

Looks good? But how do we know if our classification pipeline is performing well? Let's explore below.

## Let's evaluate our classification pipeline
As in the previous notebook, we can view the confusion matrix of our classification pipeline.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(clf_pipeline, X, y)
plt.show()

We see True Negatives (TN) of 289, False Negatives (FN) of 74, False Positives (FP) of 72, and True Positives (TP) of 421.

There are several ways we can use the values in the confusion matrix to arrive to a single performance metric. Let's import some now and define our true and predicted labels.

Run the code below.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
y_true = y.values
y_pred = clf_pipeline.predict(X)

A = accuracy_score(y_true, y_pred)
print(f'Accuracy = (TP + TN) /(TP + FP + FN + TN) = {A:.4f}')

P = precision_score(y_true, y_pred, pos_label='Yes')
print(f'Precision = TP / (TP + FP) = {P:.4f}')

R = recall_score(y_true, y_pred, pos_label='Yes')
print(f'Recall = TP / (TP + FN) = {R:.4f}')

F1 = f1_score(y_true, y_pred, pos_label='Yes')
print(f'F1 Score = 2 * (P * R) / (P + R) = {F1:.4f}')


## Let's estimate test performance
That's all great, but what we care about is model performance on new, unseen data. How can we estimate that?

One way, is through *training* and *testing*.

Let's split our $X$ and $y$ data into training and test datasets, then fit our classification pipeline to the training data and test it on the test data.

Run the code below,

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

clf_pipeline = Pipeline([
    ('preprocessor', PreprocessingPipeline()),
    ('classifier', LogisticRegression())
])

clf_pipeline.fit(X_train, y_train)
ConfusionMatrixDisplay.from_estimator(clf_pipeline, X_test, y_test)
plt.show()

And there you have, it. We have:
1. Created a pre-processing pipeline
2. Extended the pre-processing pipeline to create a classification pipeline
3. Used training data to fit the classification pipelin

### Task - Variation over training testing splits
Run the above code a few times to get a feel for how the results change for different training and test splits.

### Task - Compute scores on test set
Calculate the Accuracy, Precision, Recall, and F1 Score of the classification pipeline (`clf_pipeline`) on the test set (`X_test`, `y_test`).

In [None]:
# (SOLUTION)

### Task - Compare Logistic Regression to Decision Tree Classifier
Import the `DecisionTreeClassifier`, create a `LogisticRegression` pipeline and `DecisionTreeClassifier` pipeline, fit both pipeline to the same training data, and show the confusion matrix for each pipeline tested on the test data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# (SOLUTION)

### Task - Final
Compute Accuracy, Precision, Recall, and F1 Score for both pipelines and identify which pipeline performs best for each score.

In [None]:
# (SOLUTION)