<a href="https://colab.research.google.com/github/awsdevguru/PearsonMLFoundations/blob/main/2_3_05_Sklearn_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scikit-learn Pipelines

## 1. Objectives
* Understand how to chain preprocessing and modeling steps in one workflow
* Learn to prevent data leakage by fitting transformations only on training data
* Build, tune, and save a reproducible ML pipeline

## 2. Setup

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib

## 3. Load and Inspect Data

Use a simple dataset (e.g., Titanic or a small synthetic dataset):

In [None]:
from sklearn.datasets import fetch_openml
titanic = fetch_openml('titanic', version=1, as_frame=True)
df = titanic.frame[['pclass', 'sex', 'age', 'fare', 'survived']].dropna()
X = df[['pclass', 'sex', 'age', 'fare']]
y = df['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=73)
X_train

## 4. Define Preprocessing

* Numerical data -> scaled
* Categorical data -> one-hot encoded
* All transformations fit only on training data

In [None]:
num_features = ['age', 'fare']
cat_features = ['pclass', 'sex']

num_transformer = Pipeline([('scaler', StandardScaler())])
cat_transformer = Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer([
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])
preprocessor

## 5. Build the Pipeline

In [None]:
clf = Pipeline([
    ('preprocess', preprocessor),
    ('model', LogisticRegression(max_iter=1000))
])
clf

In [None]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

## 6. Grid Search with Pipelines

Hyperparameter tuning works end-to-end, including preprocessing.

This code is doing hyperparameter tuning using Grid Search to find the best settings for a machine learning model inside a pipeline.

In [None]:
param_grid = {
    'model__C': [0.1, 1.0, 10],
    'model__penalty': ['l2']
}

grid = GridSearchCV(clf, param_grid, cv=3)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
print("Test accuracy:", accuracy_score(y_test, grid.predict(X_test)))

## 8. Saving and Loading Pipelines

In [None]:
joblib.dump(grid.best_estimator_, 'titanic_pipeline.pkl')
loaded_model = joblib.load('titanic_pipeline.pkl')
print("Reloaded Accuracy:", accuracy_score(y_test, loaded_model.predict(X_test)))

## 9. Single Prediction

In [None]:
# Build a single, realistic passenger row
one = pd.DataFrame([{
    "pclass": 2,
    "sex": "female",
    "age": 28,
    "fare": 20.0
}])

# predict from loaded model
print("Survival probability:", loaded_model.predict_proba(one)[0,1])
print("Predicted class     :", int(loaded_model.predict(one)[0]))

## 9. Key Takeaways

* Pipelines ensure consistent preprocessing and reproducible training.
* They simplify code and prevent leakage automatically.
* Pipelines are production-ready and easy to deploy with joblib.
* You can tune preprocessing and model hyperparameters together.