# Pipeline

In this notebook, we will explain a bit better the concept of pipeline, which is used to represent the sequence of steps. It can be obtained and represented by including the `pipeline` package of the `scikit-learn` library.

In [2]:
from sklearn import pipeline

In [3]:
from sklearn import datasets
from sklearn import linear_model
from sklearn import model_selection
from sklearn import preprocessing

We have already noticed in the previous Lab session that we have to repeat some steps when building a model. Thus, for example, we calculate the values for the data standardization on the training set, and then we transform the test set using these values by calling the `transform` function. The same is true for the other preparations/processing of the attributes: the appropriate attributes are selected over the training set, and then the training set and the test set are transformed in terms of these attributes. A "pipeline" is a mechanism that defines a series of steps to be carried out together, one after the other. All library classes that have `fit` and `transform` functions can be placed in the pipeline. Pipelines are set up with the fit/transform/predict functionality, so we can fit a whole pipeline to the training data and transform to the test data, without having to do it individually for each thing you do.

To demonstrate how to work with pipelines, we will load the set used to classify tumors into benign and malignant and divide it into a training set and a test set.

In [4]:
data = datasets.load_breast_cancer()

In [5]:
X = data.data
y = data.target

In [16]:
print(X.shape)
print(y.shape)

(569, 30)
(569,)


In [6]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.33, stratify = y, random_state = 10)

We will further prepare the data by standardizing it.

In [7]:
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)

By using this data we can train the classification model of choice. Let it be a logistic regression model with regularization parameter $2$ (parameter $C$ corresponds to its reciprocal value).

In [8]:
model = linear_model.LogisticRegression(C = 1/2)

In [9]:
model.fit(X_train_transformed, y_train)

We evaluate the model by calling the `score` function or by calling the `predict` function and selecting the appropriate measure of the `metrics` package.

In [10]:
model.score(X_test_transformed, y_test)

0.973404255319149

The code that follows uses pipelines and is identical to the code we just described. The `make_pipeline` function allows us to create a rudimentary pipeline which in our case consists of a standard scaler and the model itself. When the `fit` function is called over a pipeline, it is called over each specified element. Calling the `score` function over a pipeline means calling the `transform` function over the last but one step, and then calling the `score` function over the last step. When executing the `transform` function, the values calculated/obtained by calling the `fit` function are used. This description allows us a more compact representation and prevents the possibility of omitting one of the steps or applying it to the wrong set.

In [17]:
linreg_pipeline = pipeline.make_pipeline(preprocessing.StandardScaler(), linear_model.LogisticRegression(C=1/2))

In [18]:
linreg_pipeline.fit(X_train, y_train)

In [19]:
linreg_pipeline.score(X_test, y_test)

0.973404255319149

In addition to the `make_pipeline` function, the `Pipeline` class can also be used, which offers slightly more comfort in working with pipelines. It allows you to specify a named list of steps when creating a pipeline, which can be convenient for accessing individual steps in order to adjust their hyperparameters or read the obtained values.

In [20]:
linreg_pipeline = pipeline.Pipeline(steps=[('scaler', preprocessing.StandardScaler()), ('linreg', linear_model.LogisticRegression())])

In [21]:
linreg_pipeline.named_steps['scaler']

With the `set_params` method we can adjust the parameters of any element of the pipeline. The parameters are described by the name of the pipeline element followed by a double underline and then by the parameter name itself. If it is necessary to specify the values of several parameters at once, they can be separated by commas.

In [22]:
linreg_pipeline.set_params(linreg__C=2)

The `fit` function works as described.

In [23]:
linreg_pipeline.fit(X_train, y_train)

Individual elements of the pipeline can be retrieved using the listed names. Thus, for example, the next block of code can be used to obtain the learned coefficients of the logistic regression model.

In [24]:
linreg_pipeline['linreg'].coef_

array([[-0.79251673, -0.46364269, -0.74855091, -0.86833764, -0.30646894,
         0.58060554, -1.1264342 , -0.97423125,  0.46512597,  0.38258632,
        -1.44598332,  0.51297567, -0.94635518, -1.12432812, -0.41213252,
         0.81943926,  0.03968061, -0.0242161 , -0.29270501,  0.88414575,
        -1.29967288, -1.49391187, -1.09956167, -1.23298801, -1.36898131,
         0.26682155, -1.01853841, -0.47128969, -0.80195573, -0.80468404]])

In [25]:
linreg_pipeline['linreg'].intercept_

array([-0.18620725])

The `score` method works as described.

In [26]:
linreg_pipeline.score(X_test, y_test)

0.973404255319149

The same holds for the `predict` method.

In [27]:
linreg_pipeline.predict([X_test[0,:]])

array([1])

In cases when the pipeline consists of a large number of steps, it is convenient to show it graphically. This can be achieved by including an interactive mode from the `scikit-learn` library level.

In [28]:
from sklearn import set_config

In [29]:
set_config(display='diagram')

In [30]:
linreg_pipeline