# ML @ Scikit-Learn / 20241001

！！！🐍 由于原著版权原因，此学习笔记仅限个人交流学习目的，不做任何商业或广播用途。  
！！！🐍 Reference: --> https://scikit-learn.org/stable/index.html

## §6.1. - [Pipelines and composite estimators](https://scikit-learn.org/stable/modules/compose.html) / 20241015

To build a composite estimator, transformers are usually combined with other transformers or with `predictors` (such as classifiers or regressors). The most common tool used for composing estimators is a `Pipeline`. Pipelines require all steps except the last to be a `transformer`. The last step can be anything, a transformer, a predictor, or a clustering estimator which might have or not have a `.predict(...)` method. A pipeline exposes all methods provided by the last estimator: if the last step provides a `transform` method, then the pipeline would have a `transform` method and behave like a transformer. If the last step provides a `predict` method, then the pipeline would expose that method, and given a data `X`, use all steps except the last to transform the data, and then give that transformed data to the `predict` method of the last step of the pipeline. The class `Pipeline` is often used in combination with `ColumnTransformer` or `FeatureUnion` which concatenate the output of transformers into a composite feature space. `TransformedTargetRegressor` deals with transforming the `target` (i.e. log-transform `y`).

### 6.1.1. Pipeline: chaining estimators

`Pipeline` can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. `Pipeline `serves multiple purposes here:

**Convenience and encapsulation**<br>
You only have to call `fit` and `predict` once on your data to fit a whole sequence of estimators.

**Joint parameter selection**<br>
You can `grid search` over parameters of all estimators in the pipeline at once.

**Safety**<br>
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that `the same samples` are used to train the transformers and predictors.

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).

**<blockquote>⚠️Note:**

Calling `fit` on the pipeline is the same as calling `fit` on each estimator in turn, `transform` the input and pass it on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the `Pipeline` can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.</blockquote>

#### 6.1.1.1. Usage

##### 6.1.1.1.1. Build a pipeline

The [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) is built using a list of `(key, value)` pairs, where the `key` is a string containing the name you want to give this step and `value` is an estimator object:

In [17]:
# build a pipeline
# d06_1_2_build_pipeline.py

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

estimators = [('reduce_dim', PCA()), ('clf', SVC()), ('feature_selection', SelectKBest())]

pipe = Pipeline(estimators)
pipe

**Shorthand version using** [make_pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline)<br><br>
The utility function `make_pipeline` is a shorthand for constructing pipelines; it takes a variable number of estimators and returns a pipeline, filling in the names automatically:

In [19]:
# make pipeline
# d06_1_1_make_pipeline.py

from sklearn.pipeline import make_pipeline

make_pipeline(PCA(), SVC(), SelectKBest())

##### 6.1.1.1.2. Access pipeline steps

The estimators of a pipeline are stored as a list in the `steps` attribute. A sub-pipeline can be extracted using the slicing notation commonly used for Python Sequences such as lists or strings (although only a step of 1 is permitted). This is convenient for performing only some of the transformations (or their inverse):

In [20]:
# access pipeline steps
# d06_1_3_access_pipeline_steps.py

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

estimators = [('reduce_dim', PCA()), ('clf', SVC()), ('feature_selection', SelectKBest())]
pipe = Pipeline(estimators)

# []: access the steps by slicing
a = pipe[:2]
b = pipe[-2:]

print(a)
print(b)

Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])
Pipeline(steps=[('clf', SVC()), ('feature_selection', SelectKBest())])


**<blockquote>Accessing a step by name or position</blockquote>**<br>
&emsp;A specific step can also be accessed by index or name by indexing (with `[idx]`) the pipeline:

In [30]:
# access pipeline steps
# d06_1_3_access_pipeline_steps.py

# [idx]: access the step by index
c = pipe.steps[0]
d = pipe[0]
e = pipe['clf']

print(c)
print(d)
print(e)

# access pipeline steps
steps_slice = pipe.steps[0:3]

for name, step in steps_slice:
    print(f"Step name --> object:\t {name} --> {step}")

('reduce_dim', PCA())
PCA()
SVC()
Step name --> object:	 reduce_dim --> PCA()
Step name --> object:	 clf --> SVC()
Step name --> object:	 feature_selection --> SelectKBest()


`Pipeline`’s `named_steps` attribute allows accessing steps by name with tab completion in interactive environments:

In [7]:
pipe.named_steps.reduce_dim is pipe['reduce_dim']

True

##### 6.1.1.1.3. Tracking feature names in a pipeline

To enable model inspection, `Pipeline` has a `get_feature_names_out()` method, just like all transformers. You can use pipeline slicing to get the feature names going into each step:

In [51]:
# track feature names
# d06_1_4_feature_names_tracking.py

from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression

iris = load_iris()
pipe = Pipeline(steps=[
    ('reduce_dim', PCA(n_components=2)),
    ('select', SelectKBest(k=2)),
    ('clf', LogisticRegression())])
print(pipe)

pipe.fit(iris.data, iris.target)

# feature names for steps with get_feature_names_out
for name, step in pipe.steps:
    if hasattr(step, 'get_feature_names_out'):
        print(f"Step name --> object + feature_names:\t {name} --> {step} + {step.get_feature_names_out()}")
    else:
        print(f"Step name --> object:\t {name} --> {step}")

a = pipe[0].get_feature_names_out()
b = pipe[1].get_feature_names_out()
print(a)
print(b)

Pipeline(steps=[('reduce_dim', PCA(n_components=2)),
                ('select', SelectKBest(k=2)), ('clf', LogisticRegression())])
Step name --> object + feature_names:	 reduce_dim --> PCA(n_components=2) + ['pca0' 'pca1']
Step name --> object + feature_names:	 select --> SelectKBest(k=2) + ['x0' 'x1']
Step name --> object:	 clf --> LogisticRegression()
['pca0' 'pca1']
['x0' 'x1']


**<blockquote>Cutomize feature names</blockquote>**<br>
&emsp;You can also provide custom feature names for the input data using `get_feature_names_out`:

In [58]:
# track feature names
# d06_1_5_feature_names_customize.py

from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression

iris = load_iris()
pipe = Pipeline(steps=[
    ('select', SelectKBest(k=2)),
    ('clf', LogisticRegression())])
print(pipe)

pipe.fit(iris.data, iris.target)

# customize feature names
pipe[:-1].get_feature_names_out(iris.feature_names)

Pipeline(steps=[('select', SelectKBest(k=2)), ('clf', LogisticRegression())])


array(['petal length (cm)', 'petal width (cm)'], dtype=object)

##### 6.1.1.1.4. Access to nested parameters

It is common to adjust the parameters of an estimator within a pipeline. This parameter is therefore nested because it belongs to a particular sub-step. Parameters of the estimators in the pipeline are accessible using the `<estimator>__<parameter>` syntax:

In [61]:
# access to nested parameters
# d06_1_6_nested_parameters_access.py

from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA

pipe = Pipeline(steps=[("reduce_dim", PCA()), ("clf", SVC())])

# parameter C of the clf estimator
print(pipe.set_params(clf__C=10))

Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC(C=10))])


**<blockquote>When does it matter?</blockquote>**<br>
&emsp;This is particularly important for doing grid searches:

In [62]:
# access to nested parameters
# d06_1_7_nested_parameters_gridsearchcv.py

from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

pipe = Pipeline(steps=[("reduce_dim", PCA()), ("clf", SVC())])

param_grid = dict(reduce_dim__n_components=[2, 5, 10],
                  clf__C=[0.1, 10, 100])

grid_search = GridSearchCV(pipe, param_grid=param_grid)
print(grid_search)

GridSearchCV(estimator=Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())]),
             param_grid={'clf__C': [0.1, 10, 100],
                         'reduce_dim__n_components': [2, 5, 10]})


Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to `passthrough`:

In [63]:
# access to nested parameters
# d06_1_8_nested_parameters_passthrough.py

from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

pipe = Pipeline(steps=[("reduce_dim", PCA()), ("clf", SVC())])

param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(10)],
                  clf=[SVC(), LogisticRegression()],
                  clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)
print(grid_search)

GridSearchCV(estimator=Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())]),
             param_grid={'clf': [SVC(), LogisticRegression()],
                         'clf__C': [0.1, 10, 100],
                         'reduce_dim': ['passthrough', PCA(n_components=5),
                                        PCA(n_components=10)]})


**<blockquote>See also</blockquote>**
- [Composite estimators and parameter spaces](https://scikit-learn.org/stable/modules/grid_search.html#composite-grid-search)

**Examples**

- [Pipeline ANOVA SVM](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection_pipeline.html#sphx-glr-auto-examples-feature-selection-plot-feature-selection-pipeline-py)

- [Sample pipeline for text feature extraction and evaluation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-plot-grid-search-text-feature-extraction-py)

- [Pipelining: chaining a PCA and a logistic regression](https://scikit-learn.org/stable/auto_examples/compose/plot_digits_pipe.html#sphx-glr-auto-examples-compose-plot-digits-pipe-py)

- [Explicit feature map approximation for RBF kernels](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_approximation.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-approximation-py)

- [SVM-Anova: SVM with univariate feature selection](https://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm-anova-py)

- [Selecting dimensionality reduction with Pipeline and GridSearchCV](https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py)

- [Displaying Pipelines](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_pipeline_display.html#sphx-glr-auto-examples-miscellaneous-plot-pipeline-display-py)

#### 6.1.1.2. Caching transformers: avoid repeated computation

Fitting transformers may be computationally expensive. With its `memory` parameter set, `Pipeline` will cache each transformer after calling fit. This feature is used to avoid computing the `fit` transformers within a pipeline if the parameters and input data are identical. A typical example is the case of a grid search in which the transformers can be fitted only once and reused for each configuration. The last step will never be cached, even if it is a transformer.

The parameter `memory` is needed in order to cache the transformers. `memory` can be either a string containing the directory where to cache the transformers or a [joblib.Memory](https://joblib.readthedocs.io/en/latest/memory.html) object:

In [64]:
# Example of pipeline with PCA and SVC
# d06_1_9_pipeline_memory.py

from tempfile import mkdtemp
from shutil import rmtree
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

estimators = [('reduce_dim', PCA()), ('clf', SVC())]
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)

print(pipe)
print(cachedir)

# Clear the cache directory when you don't need it anymore
print(rmtree(cachedir))

Pipeline(memory='C:\\Users\\fuy3pk\\AppData\\Local\\Temp\\tmpy3r8h31q',
         steps=[('reduce_dim', PCA()), ('clf', SVC())])
C:\Users\fuy3pk\AppData\Local\Temp\tmpy3r8h31q
None


**<blockquote>Side effect of caching transformers**</blockquote><br>
Using a `Pipeline` without cache enabled, it is possible to inspect the original instance such as:

In [11]:
# Example of non-cached pipeline with PCA and SVC
# d06_1_10_non_cached_pipeline.py

from tempfile import mkdtemp
from shutil import rmtree
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_digits

X_digits, y_digits = load_digits(return_X_y=True)
pca1 = PCA(n_components=10)
svm1 = SVC()

# Without caching, parameter memory=None
pipe = Pipeline([('reduce_dim', pca1), ('clf', svm1)])

pipe_fit = pipe.fit(X_digits, y_digits)
print(pipe_fit)

# The pca instance can be inspected directly
pca1.components_.shape

Pipeline(steps=[('reduce_dim', PCA(n_components=10)), ('clf', SVC())])


(10, 64)

Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. In following example, accessing the `PCA `instance `pca2` will raise an `AttributeError` since `pca2` will be an unfitted transformer. Instead, use the attribute `named_steps` to inspect estimators within the pipeline:

In [12]:
# Example of cached pipeline with PCA and SVC
# d06_1_11_cached_pipeline.py

from tempfile import mkdtemp
from shutil import rmtree
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_digits

cachedir = mkdtemp()
pca2 = PCA(n_components=10)
svm2 = SVC()

# with caching, parameter memory=cachedir
cached_pipe = Pipeline([('reduce_dim', pca2), ('clf', svm2)],
                       memory=cachedir)
cached_pipe_fit = cached_pipe.fit(X_digits, y_digits)
print(cached_pipe_fit)

# The pca instance/samples cannot be inspected directly
# pca2.components_.shape

# .named_steps to access the fitted transformers of the cached pipeline
cached_pipe.named_steps['reduce_dim'].components_.shape

# Remove the cache directory
# rmtree(cachedir)

Pipeline(memory='C:\\Users\\fuy3pk\\AppData\\Local\\Temp\\tmp5aerqkno',
         steps=[('reduce_dim', PCA(n_components=10)), ('clf', SVC())])


(10, 64)

**Examples**

- [Selecting dimensionality reduction with Pipeline and GridSearchCV](https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py)