# *Pipelines* in *Scikit-Learn*

## Previous steps

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

housing = pd.read_csv("./data/housing.csv") 
train_set, test_set = train_test_split(housing, test_size=0.2,
    stratify=pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]),
    random_state=42
    )

X_train = train_set.drop("median_house_value", axis=1)
y_train = train_set["median_house_value"].copy()

X_train_num = X_train.select_dtypes(include=[np.number]) # select numerical columns

## Preprocessing pipelines

A ***pipeline*** is a sequence of data processing components. The 'Pipeline' class from scikit-learn allows us to create objects that represent these sequences, so we can apply them later to any dataset.

All **estimators** except the last one must be **transformers**. When we call the `fit` method of the 'Pipeline' class, it calls the `fit_transform` method of each estimator sequentially, passing the output of one estimator's `transform` method to the next. The last estimator can be of any type (transformer, classifier, regressor, etc.).

It's important to be clear about what each [type of estimator in scikit-learn](./types_estimators.md) means.

Let's build a *pipeline* that preprocesses the numerical predictors.

The constructor of the 'Pipeline' class receives a list of tuples formed by the name that identifies each estimator and that estimator. All estimators must be **transformers**, except the last one, which can be any type of estimator.

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")), # impute the median for unavailable values
    ("standardize", StandardScaler()), # standardize the values
])
num_pipeline.steps

[('impute', SimpleImputer(strategy='median')),
 ('standardize', StandardScaler())]

We can also use the ``make_pipeline`` function, which creates a *pipeline* like the previous one but automatically giving a name to each estimator.

In [3]:
from sklearn.pipeline import make_pipeline
num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
num_pipeline.steps

[('simpleimputer', SimpleImputer(strategy='median')),
 ('standardscaler', StandardScaler())]

The `fit()` method of the *pipeline* calls the `fit_transform()` method of each transformer, passing the output of each one to the next, and finally calls the `fit()` method of the last estimator.
The `fit_transform()` method of the *pipeline* does the same, but calls the `fit_transform()` method of the last estimator.


In [4]:
X_train_num_prepared = num_pipeline.fit_transform(X_train_num)
X_train_num_prepared[:2].round(2)

array([[-0.94,  1.35,  0.03,  0.58,  0.64,  0.73,  0.56, -0.89],
       [ 1.17, -1.19, -1.72,  1.26,  0.78,  0.53,  0.72,  1.29]])

To better visualize what the *pipeline* does, we can rebuild a dataframe with its results.

In [5]:
pd.DataFrame(X_train_num_prepared,
            columns=num_pipeline.get_feature_names_out(), # get column names after transform
            index=X_train_num.index).head(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
12655,-0.94135,1.347438,0.027564,0.584777,0.640371,0.732602,0.556286,-0.893647
15502,1.171782,-1.19244,-1.722018,1.261467,0.781561,0.533612,0.721318,1.292168


In [6]:
num_pipeline[1]

0,1,2
,copy,True
,with_mean,True
,with_std,True


In [7]:
num_pipeline[:-1]

0,1,2
,steps,"[('simpleimputer', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


In [8]:
num_pipeline.named_steps["simpleimputer"]

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


With the `set_params` method we can change the value of a parameter of an estimator.

In [9]:
num_pipeline.set_params(simpleimputer__strategy="mean")

0,1,2
,steps,"[('simpleimputer', ...), ('standardscaler', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True


To create a *pipeline* that preprocesses both numerical and categorical predictors, we need a transformer that selects the columns we want to transform. Scikit-learn provides the `ColumnTransformer` class for this.

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_attribs = ["longitude", "latitude", "housing_median_age", "total_rooms",
               "total_bedrooms", "population", "households", "median_income"]
cat_attribs = ["ocean_proximity"]

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"), # impute the mode for unavailable values
    OneHotEncoder(handle_unknown="ignore")) # encode the categories

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs), # apply the numerical pipeline to numerical columns
    ("cat", cat_pipeline, cat_attribs)], # apply the categorical pipeline to categorical columns
    remainder="passthrough" # remaining columns are kept unchanged
)

The `remainder='passthrough'` parameter indicates that columns that haven't been selected for transformation will be passed directly to the final *pipeline* without changes. If not specified, unselected columns will be removed (by default, `remainder='drop'`). In this case, all are being passed, so there will be no difference.

To be able to assign pipelines to all columns based on their type, we can use the `make_column_transformer` function.

In [11]:
from sklearn.compose import make_column_selector, make_column_transformer

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object)),
)

and we can use the `fit_transform()` method of the *pipeline* to transform the training data.

In [12]:
housing_prepared = preprocessing.fit_transform(X_train)

In [13]:
housing_prepared_fr = pd.DataFrame(
    housing_prepared,
    columns=preprocessing.get_feature_names_out(),
    index=X_train.index)
housing_prepared_fr.head(7).T

Unnamed: 0,12655,15502,2908,14053,20496,1481,18125
pipeline-1__longitude,-0.94135,1.171782,0.267581,1.221738,0.437431,-1.231094,-1.226099
pipeline-1__latitude,1.347438,-1.19244,-0.125972,-1.351474,-0.635818,1.085499,0.790817
pipeline-1__housing_median_age,0.027564,-1.722018,1.22046,-0.370069,-0.131489,-0.051963,-0.449595
pipeline-1__total_rooms,0.584777,1.261467,-0.469773,-0.348652,0.427179,-0.661977,0.74752
pipeline-1__total_bedrooms,0.638183,0.779415,-0.547672,-0.038752,0.270495,-0.688903,0.331371
pipeline-1__population,0.732602,0.533612,-0.674675,-0.467617,0.37406,-0.623583,0.324761
pipeline-1__households,0.556286,0.721318,-0.524407,-0.037297,0.220898,-0.652174,0.383269
pipeline-1__median_income,-0.893647,1.292168,-0.525434,-0.865929,0.325752,-0.094224,1.895358
pipeline-2__ocean_proximity_<1H OCEAN,0.0,0.0,0.0,0.0,1.0,0.0,1.0
pipeline-2__ocean_proximity_INLAND,1.0,0.0,1.0,0.0,0.0,0.0,0.0


In [14]:
preprocessing.get_feature_names_out()

array(['pipeline-1__longitude', 'pipeline-1__latitude',
       'pipeline-1__housing_median_age', 'pipeline-1__total_rooms',
       'pipeline-1__total_bedrooms', 'pipeline-1__population',
       'pipeline-1__households', 'pipeline-1__median_income',
       'pipeline-2__ocean_proximity_<1H OCEAN',
       'pipeline-2__ocean_proximity_INLAND',
       'pipeline-2__ocean_proximity_ISLAND',
       'pipeline-2__ocean_proximity_NEAR BAY',
       'pipeline-2__ocean_proximity_NEAR OCEAN'], dtype=object)