`sklearn.pipeline` module provides utilities to build a composite estimator as a chain of transformers and estimators.

What is a Pipeline in Scikit-learn?

- At its core, a Pipeline in sklearn is a sequential list of steps, where each step (except the last one) must be a "transformer" and the last step can be either a transformer or an "estimator" (like a model).
- When you call fit on a Pipeline, it calls fit_transform on all transformers in order, and then fit on the final estimator.
- When you call predict, it calls transform on all transformers and then predict on the final estimator.

Why are Pipelines useful?

- Convenience and Conciseness: They allow you to define a complete machine learning workflow in a single object. This makes your code cleaner and easier to read.
- Data Leakage Prevention: This is perhaps the most crucial benefit. When you perform operations like scaling or imputation before splitting your data into training and testing sets, information from the test set can "leak" into the training set, leading to overly optimistic performance estimates. Pipelines ensure that transformations learned from the training data are applied consistently to new data (like the test set or future predictions), preventing this leakage.
- Reproducibility: A well-defined pipeline makes it easy to reproduce your entire workflow, from preprocessing to model training, with new data.
- Hyperparameter Tuning: You can easily tune hyperparameters for any step within the pipeline (e.g., the imputation strategy, the type of encoder, or the model's parameters) using GridSearchCV or RandomizedSearchCV.

In [None]:
import pandas as pd
import numpy as np


In [None]:
data = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 35],
    'salary': [50000, 60000, 45000, 70000, np.nan],
    'city': ['A', 'B', 'A', 'C', 'B'],
    'gender': ['M', 'F', 'M', 'F', 'M'],
    'target': [0, 1, 0, 1, 0]
})

X = data.drop("target", axis = 1)
y = data["target"]

# Basic Pipeline

In [None]:
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

numerical_transformers = Pipeline(steps=[
    ("impute", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler()),
    ])

categorical_transformers = Pipeline(steps= [
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder()),
])

# combine transformers using column transformer

preprocessor = ColumnTransformer(
    transformers=[
        ("numerical", numerical_transformers, ["age", "salary"]),
        ("categorical", categorical_transformers, ["city", "gender"])],
    remainder = "drop") # this drops all the remaining columns

# then create a full pipeline

model_pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("classifier", LogisticRegression(random_state=42))
    ])

# fitting the pipeline to our data
model_pipeline.fit(X, y)

# make predictions
predictions = model_pipeline.predict(X)
predictions

array([0, 1, 0, 1, 0])

### Types of Pipelines
- While sklearn primarily offers one Pipeline class, we can conceptualize "types" based on their purpose or the stage of the workflow they represent:

### Preprocessing Pipelines (or Feature Engineering Pipelines):

- Purpose: These pipelines focus solely on data transformation. They typically chain together various transformers like imputers, scalers, encoders, feature selectors, dimensionality reduction techniques (e.g., PCA), or custom transformers.
- Last Step: The last step is usually a transformer, not an estimator.
- Use Case: Often used as a step within a larger ColumnTransformer or as the first step in a complete machine learning pipeline, as shown in the numerical_transformer and categorical_transformer examples above.
- You might fit such a pipeline on training data and then transform both training and test data.
Python

# Example of a preprocessing pipeline
```python
feature_engineering_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    # ('pca', PCA(n_components=2)) # Could add dimensionality reduction
])
```
### Full Machine Learning Pipelines:
- Purpose: These are comprehensive pipelines that include all steps from raw data transformation to the final model prediction.
- Last Step: The last step is always an estimator (a machine learning model like LogisticRegression, RandomForestClassifier, etc.).
- Use Case: This is the most common and recommended way to build your end-to-end machine learning solution. It's what you'll primarily use for training, evaluation, and deployment. The model_pipeline example above is a perfect illustration of this type.
Python

# Example of a full machine learning pipeline (similar to the one above)
```python
full_ml_pipeline = Pipeline(steps=[
    ('preprocessing', ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, ['age', 'salary']),
            ('cat', categorical_transformer, ['city', 'gender'])
        ]
    )),
    ('model', RandomForestClassifier(random_state=42)) # Or any other model
])
```
### Nested Pipelines (Pipeline within ColumnTransformer):

- Purpose: While not a "type" of pipeline itself, it's a very common and powerful pattern where smaller pipelines are embedded as steps within a ColumnTransformer. This allows you to apply distinct sequences of transformations to different subsets of your features.
- Use Case: This is what you already saw in the main example, where numerical_transformer and categorical_transformer (which are themselves Pipeline objects) are used inside the ColumnTransformer. - This is extremely useful for handling mixed data types.
- You're already familiar with ColumnTransformer, and understanding how to combine it with Pipeline to create robust and clean machine learning workflows is a crucial step in your machine learning journey.
- Keep exploring the sklearn documentation, and don't hesitate to experiment with different transformers within your pipelines!