# Part 5.4: Model Selection - Pipelines

Scikit-learn `Pipeline`s are a powerful tool for chaining together multiple data processing steps (transformers) and a final estimator (like a classifier or regressor). 

### Why Use Pipelines?
1.  **Convenience**: You only have to call `fit` and `predict` once on your data to fit a whole sequence of steps.
2.  **Preventing Data Leakage**: This is the most important reason. When using tools like `GridSearchCV`, a pipeline ensures that preprocessing steps (like scaling) are applied correctly *within each fold* of the cross-validation. This prevents information from the validation fold from 'leaking' into the training process, giving you a more reliable evaluation.

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Building a Simple Pipeline
A pipeline is a list of (name, transformer/estimator) tuples.

In [2]:
pipe = Pipeline([
    ('scaler', StandardScaler()),        # Step 1: Scale the data
    ('svc', SVC(random_state=42))      # Step 2: Apply the classifier
])

# Now we can treat the whole pipeline as a single estimator
pipe.fit(X_train, y_train)

print(f"Pipeline score on test data: {pipe.score(X_test, y_test):.4f}")

Pipeline score on test data: 1.0000


### Using Pipelines with GridSearchCV
This is where pipelines truly shine. We can tune the hyperparameters of any step in the pipeline. The parameter names are created by joining the step name and the parameter name with a double underscore `__`.

In [3]:
# We can tune the SVM's C and gamma parameters
param_grid = {
    'svc__C': [0.1, 1, 10, 100],
    'svc__gamma': [0.001, 0.01, 0.1, 1]
}

search = GridSearchCV(pipe, param_grid, cv=5)
search.fit(X_train, y_train)

print("Best parameters for the pipeline:", search.best_params_)
print(f"Best cross-validation score: {search.best_score_:.4f}")
print(f"\nFinal score on test set: {search.score(X_test, y_test):.4f}")

Best parameters for the pipeline: {'svc__C': 100, 'svc__gamma': 0.01}
Best cross-validation score: 0.9667

Final score on test set: 0.9667
