# Demo notebook for a pipeline functions

This demonstration notebook demonstrates the usage of a pipeline function. See also Scikit-learn's [pipeline function](http://scikit-learn.org/stable/modules/pipeline.html). The sklearn pipeline function is a tool in machine learning that simplifies the workflow by encapsulating all the steps involved in a single object. It offers advantages such as simplicity, reproducibility, efficiency, flexibility, and integration. 

The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object (the method to be executed). 

This notebook demonstrates several use cases, comparing the single steps to the pipeline step.


In [13]:
import pandas as pd
import numpy as np
from time import time


from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV


## The data

For this notebook we use lifelines data. Lifelines is a large, multi-generational, prospective cohort study that includes over 167,000 participants (10%) from the northern population of the Netherlands. Within this cohort study the participants are followed over a 30-year period. Every five years, participants visit one of the Lifelines sites in the northern parts of the Netherlands for an assessment. During these visits, several physical measurements are taken and different biomaterials are collected.

In [14]:
df = pd.read_csv('../data/lifelines.txt', sep = '\t').set_index('ZIPCODE')
#df.info()

In [15]:
df.shape

(480, 71)

In [16]:
#make a copy to compare manual way with pipeline way
data_manual = df.copy()

## Use case preprocessing

First we will demonstrate the manual steps, the log tranformation and the scaling. Then we demonstrate the pipeline way

In [17]:
def transform_2log(df):
    skew_columns = (df.skew().sort_values(ascending=False))
    skew_columns = skew_columns.loc[skew_columns > 0.75]
    #print(skew_columns)
    # Perform log transform on skewed columns
    for col in skew_columns.index.tolist():
        df[col] = np.log1p(df[col])
    #print(df.shape)
    return df

In [18]:
t0 = time()
# single step log transformation
data_manual = transform_2log(data_manual)
# single step scaling
mms = MinMaxScaler()
data_manual = mms.fit_transform(data_manual)
print(time() - t0)

0.004747867584228516


In [19]:
t0 = time()
# The pipeline
prep = Pipeline([('log1p', FunctionTransformer(transform_2log)), 
                 ('minmaxscale', MinMaxScaler())]
               )

# Convert the original data
data_pipe = prep.fit_transform(df)
print(time() - t0)

0.003876209259033203


Note that the `pipeline` uses the `fit_transform()` function. The MinMaxScaler does have the fit_transform function but the np.log1p function does not have this. We therefore use the sklearn `FunctionTransformer` function. The `FunctionTransformer` is a class in the sklearn.preprocessing module that allows you to apply an arbitrary function to your data. It is commonly used as a preprocessing step in sklearn pipelines.

The FunctionTransformer constructor takes as input a callable, which can be any Python function or lambda expression. When applied to a dataset, the FunctionTransformer simply calls the specified function on the input data and returns the transformed output.


`np.allclose` is a function in the NumPy library that compares two arrays element-wise and returns True if all corresponding pairs of elements are within a specified tolerance of each other, and False otherwise.


In [20]:
#check if values are the same, pipeline works same as the manual steps
np.allclose(data_pipe, data_manual)

True

## Extend pipeline with PCA
We can use the preprocessing pipeline in a new pipeline

In [21]:
df = pd.read_csv('../data/lifelines.txt', sep = '\t').set_index('ZIPCODE')
data_manual = df.copy()
# single step log transformation
data_manual = transform_2log(data_manual)
# single step scaling
mms = MinMaxScaler()
data_scaled = mms.fit_transform(data_manual)
#single step PCA
pca = PCA(2, random_state=42)
result_manual_pca = pca.fit_transform(data_scaled)

In [22]:
#Use case PCA

pipe = Pipeline([
    ('prep', prep),
    ('pca', PCA(2, random_state=42))
])

results_pipe_pca = pipe.fit_transform(df)

In [23]:
np.allclose(result_manual_pca, results_pipe_pca)

True

# Use a pipeline for classification

With pipelines We can follow a structured approach to model development, starting with data preprocessing and feature scaling. To ensure our model generalizes well to unseen data, we will split the dataset into training and testing sets. We will then build a machine learning pipeline that includes a StandardScaler for feature scaling and a RandomForestClassifier for classification.

`Pipeline` is often used in combination with `GridSearchCV` for hyperparameter tuning because it allows us to encapsulate all the preprocessing and modeling steps into a single object. This can be useful for several reasons:

1. Convenience: Instead of manually performing each preprocessing step and modeling step separately, we can combine them into a single object and pass them as an argument to `GridSearchCV`. This can make our code simpler and easier to read.

2. Reproducibility: By encapsulating all the preprocessing and modeling steps into a single object, we can ensure that the same steps are applied consistently to the data, regardless of how many times we run the code.

3. Avoiding data leakage: When performing hyperparameter tuning, it's important to ensure that the evaluation of each parameter combination is done using only the training data, not the validation data. By including the preprocessing steps in the `Pipeline` object, we can ensure that the same preprocessing steps are applied to both the training and validation data, preventing data leakage.

`GridSearchCV` is used to search over a grid of hyperparameters to find the best combination of hyperparameters that maximizes some performance metric, see `sklearn.metrics`. By using `Pipeline` in combination with `GridSearchCV`, we can search over a grid of hyperparameters for both the preprocessing and modeling steps simultaneously. This can be a powerful technique for optimizing the performance of machine learning models.

We can configure `GridSearchCV` with: 
- estimator: estimator object being used, can be the result of a pipeline
- param_grid: dictionary that contains all of the parameters to try
- scoring: evaluation metric to use when ranking results
- cv: cross-validation, the number of cv folds for each combination of parameters

In [24]:
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# load the data
iris = datasets.load_iris()
X = iris.data
y = iris.target

#split train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Building a Pipeline
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Hyperparameter Tuning using GridSearchCV
param_grid = {
    'classifier__n_estimators': [50, 100, 150],
    'classifier__max_depth': [2, 4, 6],
    'classifier__min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)

# Fit the model
grid_search.fit(X_train, y_train)

# Evaluation
# Best parameters
print("\nBest Parameters from GridSearchCV:")
print(grid_search.best_params_)

# Predictions
y_pred = grid_search.predict(X_test)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Fitting 5 folds for each of 27 candidates, totalling 135 fits

Best Parameters from GridSearchCV:
{'classifier__max_depth': 4, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 50}

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
