# Day-16 Model Tuning with Pipelines & Practical Hyperparameter Optimization

Hyperparameter tuning is a critical step in building a high-performing machine learning model. Today, we will explore two popular techniques:

- Grid Search – exhaustive and systematic

- Random Search – randomized and efficient

- We will also implement both techniques using scikit-learn and understand when to use which.

## Topics Covered

- What is pipeline?

- Why to Use a Pipeline?

- Benefits of Using a Pipeline

- Grid Search (GridSearchCV) [Revision]

- Random Search (RandomizedSearchCV) [Revision]

- When to Use Grid vs Random Search? 

- Hands-on Examples

- Summary

- Reference Links

## What is a Pipeline?

A pipeline in scikit-learn is a tool that chains multiple preprocessing steps(like sciling or encoding) abd a model into single object.

It ensures that everystep in the workflow(preprocessing → training → prediction)is applied in the correct sequence, every time.

## Why Use a Pipeline?

In a real-world data science workflow, your model is just one part of the pipeline. You often need to:

- Preprocess data (e.g., scale, encode)

- Train/test split

- Train models

- Tune hyperparameters

Manually managing this sequence can lead to data leakage, messy code, or inconsistent transformations.

## Benefits of Using a Pipeline:

- Ensures consistent preprocessing during cross-validation

- Reduces risk of data leakage

- Makes your code modular and cleaner

- Compatible with GridSearchCV and RandomizedSearchCV

## Example of simple Pipeline

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
])


#

## Revising Gridsearch CV from Day 11

### Key Features:

- Performs exhaustive search over specified hyperparameter values.

- Works well for smaller search spaces.

- Slower than RandomizedSearchCV when many combinations.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'ridge__alpha': [0.1, 1, 10, 100]
}



grid = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)


NameError: name 'X_train' is not defined

## Revising RandomizedSearchCV from Day 11

### Key Features:

- Randomly samples from a parameter distribution.

- Useful when you have a large or continuous search space.

- Faster alternative to GridSearchCV.

In [2]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

param_dist = {
    'ridge__alpha': uniform(loc=0.01, scale=10)
}

random_search = RandomizedSearchCV(pipe, param_distributions=param_dist, n_iter=10, cv=5, scoring='neg_mean_squared_error')
random_search.fit(X_train, y_train)


NameError: name 'X_train' is not defined

## Example of use of pipeline in regression

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load data
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', Lasso())
])

# GridSearch
param_grid = {
    'lasso__alpha': [0.01, 0.1, 1, 10]
}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best alpha:", grid.best_params_)
print("Best score (MSE):", grid.best_score_)


Best alpha: {'lasso__alpha': 0.01}
Best score (MSE): 0.6077113518445894


In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute  import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Sample DataFrame
data = {
    'age': [25, 40, 35, 23],
    'salary': [50000, 80000, 65000, 40000],
    'gender': ['Male', 'Female', 'Female', 'Male'],
    'region': ['North', 'South', 'East', 'West'],
    'churn': [0, 1, 0, 1]  # Target: 0 = No Churn, 1 = Churn
}

# Create DataFrame
df = pd.DataFrame(data)


X = df.drop('churn', axis=1)
y = df['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)   

# define preprocessing for numerical and categorical features
numerical_features = ['age', 'salary']
categorical_features = ['gender', 'region']

# numerical pipeline
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# combine pipelines
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
])

# final pipeline
final_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# train the pipeline

final_pipeline.fit(X_train, y_train)

# predict
y_pred = final_pipeline.predict(X_test)

#evaluate

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.0
Confusion Matrix:
 [[0 0]
 [1 0]]


In [9]:
! pip install scikit-learn pandas scipy

Collecting pandas
  Downloading pandas-2.3.2-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.2-cp311-cp311-win_amd64.whl (11.3 MB)
   ---------------------------------------- 0.0/11.3 MB ? eta -:--:--
   --------- ------------------------------ 2.6/11.3 MB 12.6 MB/s eta 0:00:01
   ----------------------- ---------------- 6.6/11.3 MB 16.1 MB/s eta 0:00:01
   ------------------------------------ --- 10.2/11.3 MB 15.9 MB/s eta 0:00:01
   ---------------------------------------- 11.3/11.3 MB 15.8 MB/s eta 0:00:00
Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas

   ---------------------------------------- 0/3 [pytz]
   ---------------------------------


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
