# *Pipelines* in *Scikit-Learn*

## Previous steps

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

housing = pd.read_csv("./data/housing.csv") 
train_set, test_set = train_test_split(housing, test_size=0.2,
    stratify=pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]),
    random_state=42
    )

X_train = train_set.drop("median_house_value", axis=1)
y_train = train_set["median_house_value"].copy()

X_train_num = X_train.select_dtypes(include=[np.number]) # select numerical columns

## Preprocessing pipelines

A ***pipeline*** is a sequence of data processing components. The 'Pipeline' class from scikit-learn allows us to create objects that represent these sequences, so we can apply them later to any dataset.

All **estimators** except the last one must be **transformers**. When we call the `fit` method of the 'Pipeline' class, it calls the `fit_transform` method of each estimator sequentially, passing the output of one estimator's `transform` method to the next. The last estimator can be of any type (transformer, classifier, regressor, etc.).

It's important to be clear about what each [type of estimator in scikit-learn](./types_estimators.md) means.

Let's build a *pipeline* that preprocesses the numerical predictors.

The constructor of the 'Pipeline' class receives a list of tuples formed by the name that identifies each estimator and that estimator. All estimators must be **transformers**, except the last one, which can be any type of estimator.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")), # impute the median for unavailable values
    ("standardize", StandardScaler()), # standardize the values
])
num_pipeline.steps

We can also use the ``make_pipeline`` function, which creates a *pipeline* like the previous one but automatically giving a name to each estimator.

In [None]:
from sklearn.pipeline import make_pipeline
num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
num_pipeline.steps

The `fit()` method of the *pipeline* calls the `fit_transform()` method of each transformer, passing the output of each one to the next, and finally calls the `fit()` method of the last estimator.
The `fit_transform()` method of the *pipeline* does the same, but calls the `fit_transform()` method of the last estimator.


In [None]:
X_train_num_prepared = num_pipeline.fit_transform(X_train_num)
X_train_num_prepared[:2].round(2)

To better visualize what the *pipeline* does, we can rebuild a dataframe with its results.

In [None]:
pd.DataFrame(X_train_num_prepared,
            columns=num_pipeline.get_feature_names_out(), # get column names after transform
            index=X_train_num.index).head(2)

In [None]:
num_pipeline[1]

In [None]:
num_pipeline[:-1]

In [None]:
num_pipeline.named_steps["simpleimputer"]

With the `set_params` method we can change the value of a parameter of an estimator.

In [None]:
num_pipeline.set_params(simpleimputer__strategy="mean")

To create a *pipeline* that preprocesses both numerical and categorical predictors, we need a transformer that selects the columns we want to transform. Scikit-learn provides the `ColumnTransformer` class for this.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_attribs = ["longitude", "latitude", "housing_median_age", "total_rooms",
               "total_bedrooms", "population", "households", "median_income"]
cat_attribs = ["ocean_proximity"]

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"), # impute the mode for unavailable values
    OneHotEncoder(handle_unknown="ignore")) # encode the categories

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs), # apply the numerical pipeline to numerical columns
    ("cat", cat_pipeline, cat_attribs)], # apply the categorical pipeline to categorical columns
    remainder="passthrough" # remaining columns are kept unchanged
)

The `remainder='passthrough'` parameter indicates that columns that haven't been selected for transformation will be passed directly to the final *pipeline* without changes. If not specified, unselected columns will be removed (by default, `remainder='drop'`). In this case, all are being passed, so there will be no difference.

To be able to assign pipelines to all columns based on their type, we can use the `make_column_transformer` function.

In [None]:
from sklearn.compose import make_column_selector, make_column_transformer

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object)),
)

and we can use the `fit_transform()` method of the *pipeline* to transform the training data.

In [None]:
housing_prepared = preprocessing.fit_transform(X_train)

In [None]:
housing_prepared_fr = pd.DataFrame(
    housing_prepared,
    columns=preprocessing.get_feature_names_out(),
    index=X_train.index)
housing_prepared_fr.head(7).T

In [None]:
preprocessing.get_feature_names_out()

## Next Steps

This notebook introduced the fundamental concepts of scikit-learn pipelines. The preprocessing pipeline is further developed in:

- [e2e051 - Custom Transformers](e2e051_custom_transformers.ipynb): `FunctionTransformer` for feature ratios and logarithmic transformations
- [e2e060 - Spatial Clustering](e2e060_spatial_clustering.ipynb): `ClusterSimilarity` transformer for geospatial features

The complete preprocessing pipeline is consolidated in [`utils/housing_preprocessing.py`](utils/housing_preprocessing.py) for reuse across model training notebooks.