<a href="https://colab.research.google.com/github/Venkatpotla33/Machine-Learning-Lab/blob/main/ML_Lab_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hyperparameter tuning**

# Understanding the Core Concepts

Hyperparameter tuning is the process of finding the optimal settings for a machine learning model to achieve the best performance on a specific dataset. Think of it like tuning a radio: the model is the radio, the data is the broadcast signal, and the hyperparameters are the knobs (like volume and frequency). You don't create the signal (the data), but you adjust the knobs to get the clearest sound (the best performance)


These settings, called hyperparameters, are not learned from the data during training. Instead, they are set

before the training process begins. Examples include:

The penalty strength

C in a Support Vector Machine (SVM).

The number of trees

n_estimators in a Random Forest.

The number of neighbors

k in K-Nearest Neighbors (KNN).

Choosing the right combination of these is crucial for building a high-performing model. We use automated methods like

GridSearchCV and RandomizedSearchCV to find these optimal settings.

# How Do We Find the Best Hyperparameters?


Two common methods are used

1.   GridSearchCV
2.   RandomizedSearchCV



Both use cross-validation to prevent overfitting, which means they test the model's performance on different subsets of the training data to get a more reliable performance estimate.

# GridSearchCV (The Methodical Search)

GridSearchCV systematically tests every single possible combination of the hyperparameter values you provide.

Imagine you have two knobs to tune:

Knob A has settings: [1, 2, 3]

Knob B has settings: ['low', 'high']

Grid Search will try all 3 x 2 = 6 combinations: (1, 'low'), (1, 'high'), (2, 'low'), (2, 'high'), (3, 'low'), (3, 'high'). It's very thorough but can be incredibly slow if you have many hyperparameters or many values to test.

# RandomizedSearchCV (The Efficient Search)

RandomizedSearchCV doesn't try every combination. Instead, it randomly samples a fixed number of combinations (n_iter) from the ranges of values you provide.

For example, you might tell it:

Knob A can be any integer from 1 to 100.

Knob B can be any value from 0.1 to 1.0.


It will then randomly pick, say, 10 combinations from these ranges and test them. This is much faster than Grid Search and is very effective when you have a large number of hyperparameters, as it allows for a wider exploration of the potential values without the computational cost.

# **Task 01**

GridSearchCV and RandomizedSearchCV

In [None]:
# Step 01
# Import The Required Libraries

import numpy as np
import joblib # Import joblib for saving models
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score, make_scorer
from scipy.stats import randint, uniform

In [None]:
# Step 02
# Generate dataset with make_classification including imbalance and noise
X, y = make_classification(
    n_samples=5000, n_features=100, n_informative=20, n_redundant=20,
    n_classes=3, weights=[0.6, 0.3, 0.1], flip_y=0.02, random_state=42
)

# Display the shape of the generated data
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

# Display the first 5 rows of X and y
print("\nFirst 5 rows of X:")
print(X[:5])
print("\nFirst 5 rows of y:")
print(y[:5])

Shape of X: (5000, 100)
Shape of y: (5000,)

First 5 rows of X:
[[ 8.34271913e-01 -3.28847874e+00 -8.13079381e+00 -8.35605609e+00
   2.10542085e+00  4.24891794e+00 -1.63685511e+00 -8.71885082e-01
  -5.89458874e-01 -2.47633395e+00  3.83553923e-01  1.01494089e+00
   1.55225095e-01  1.36044502e+00  1.91536537e+00 -7.47820821e-01
   1.19882116e+00  9.39710240e-01 -7.33716642e+00  6.73649599e+00
  -6.71113944e-01 -1.59227959e+00 -3.48273478e-02 -1.99458526e+00
   1.83572928e+00 -1.05765854e+01  4.77371309e+00 -1.04925574e+00
   5.64709324e-01 -3.05792924e+00 -1.36122265e+00 -7.27754385e-02
  -2.66865535e-02  1.67586769e+00 -4.35909683e-01  6.19294892e-02
   1.05829815e+00  9.54166784e-01 -4.72593138e-01  6.41473588e+00
  -4.72407771e+00 -4.54819476e+00 -3.88676880e-01 -3.03231424e-01
   3.95557867e-01  8.39393860e-01 -4.49092833e-01  3.50637948e+00
  -1.02536094e+01  2.07980639e+00  1.80716386e+00  1.00752387e-01
  -1.34936410e+00 -3.98806037e+00  2.19106170e-01 -7.13154120e-01
  -1.0614904

In [None]:
# Step 03
# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=0)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)


In [None]:
# Step 04
# Define models and pipelines

pipelines = {
    "SVM": Pipeline([('scaler', StandardScaler()), ('svc', SVC(random_state=42))]),
    "RandomForest": Pipeline([('rf', RandomForestClassifier(random_state=42))]),
    "LogisticRegression": Pipeline([('scaler', StandardScaler()), ('logreg', LogisticRegression(max_iter=2000, multi_class='multinomial', random_state=42))]),
    "KNN": Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])
}

In [None]:
# Step 05
# Use GridSearchCV with a small, focused hyperparameter grid and Use RandomizedSearchCV with broader hyperparameter distributions
param_grids = {
    "SVM": {
        "svc__C": [0.1, 1, 10],
        "svc__gamma": ['scale', 0.01, 0.1]
    },
    "RandomForest": {
        "rf__n_estimators": [100, 300],
        "rf__max_depth": [None, 20, 50],
        "rf__min_samples_split": [2, 5]
    },
    "LogisticRegression": {
        "logreg__C": [0.1, 1, 10],
        "logreg__penalty": ['l2']
    },
    "KNN": {
        "knn__n_neighbors": [3, 5, 11],
        "knn__weights": ['uniform', 'distance']
    }
}

grid_results = []
print("=== GridSearchCV Results ===")
for name, pipe in pipelines.items():
    gs = GridSearchCV(pipe, param_grids[name], cv=cv, scoring='f1_macro', n_jobs=-1)
    gs.fit(X_train, y_train)
    grid_results.append({
        "Model": name,
        "Best Params": gs.best_params_,
        "Best CV F1": gs.best_score_
    })

import pandas as pd
display(pd.DataFrame(grid_results))


# 5) Use RandomizedSearchCV with broader hyperparameter distributions [cite: 14]
param_dists = {
    "SVM": {
        "svc__C": uniform(0.01, 100),
        "svc__gamma": uniform(0.001, 1)
    },
    "RandomForest": {
        "rf__n_estimators": randint(100, 500),
        "rf__max_depth": randint(10, 100),
        "rf__min_samples_split": randint(2, 10)
    },
    "LogisticRegression": {
        "logreg__C": uniform(0.01, 10)
    },
    "KNN": {
        "knn__n_neighbors": randint(3, 50),
        "knn__weights": ['uniform', 'distance']
    }
}

random_results = []
print("\n=== RandomizedSearchCV Results ===")
for name, pipe in pipelines.items():
    rs = RandomizedSearchCV(pipe, param_distributions=param_dists[name], n_iter=10, cv=cv, scoring='f1_macro', random_state=0, n_jobs=-1)
    rs.fit(X_train, y_train)
    random_results.append({
        "Model": name,
        "Best Params": rs.best_params_,
        "Best CV F1": rs.best_score_
    })

display(pd.DataFrame(random_results))

=== GridSearchCV Results ===




Unnamed: 0,Model,Best Params,Best CV F1
0,SVM,"{'svc__C': 10, 'svc__gamma': 'scale'}",0.817446
1,RandomForest,"{'rf__max_depth': None, 'rf__min_samples_split...",0.669286
2,LogisticRegression,"{'logreg__C': 10, 'logreg__penalty': 'l2'}",0.673808
3,KNN,"{'knn__n_neighbors': 5, 'knn__weights': 'dista...",0.669187



=== RandomizedSearchCV Results ===




Unnamed: 0,Model,Best Params,Best CV F1
0,SVM,"{'svc__C': 54.89135039273247, 'svc__gamma': 0....",0.248497
1,RandomForest,"{'rf__max_depth': 87, 'rf__min_samples_split':...",0.663881
2,LogisticRegression,{'logreg__C': 7.161893663724195},0.673808
3,KNN,"{'knn__n_neighbors': 6, 'knn__weights': 'dista...",0.659632


In [None]:
# Step 06
# Modify and Expand the Grids and Save Models

print("\n=== EXPANDED GridSearchCV Results (with model saving) ===")

# EXPAND the parameter grid for a more thorough search
expanded_param_grids = {
    "SVM": {
        "svc__C": [0.001, 0.01, 0.1, 1, 10, 50, 100, 200],        # Added more C values
        "svc__gamma": ['scale', 'auto', 0.0001, 0.001, 0.01, 0.1, 1, 2] # Added more gamma values
    },
    "RandomForest": {
        "rf__n_estimators": [100, 300, 500],
        "rf__max_depth": [10, 20, 50, 70],
        "rf__min_samples_split": [2, 5, 10]
    },
    "LogisticRegression": {
        "logreg__C": [0.01, 0.1, 1, 10, 100],
        "logreg__penalty": ['l2']
    },
    "KNN": {
        "knn__n_neighbors": [3, 5, 11, 19, 25],
        "knn__weights": ['uniform', 'distance']
    }
}

for name, pipe in pipelines.items():
    gs = GridSearchCV(pipe, expanded_param_grids[name], cv=cv, scoring='f1_macro', n_jobs=-1)
    gs.fit(X_train, y_train)
    print(f"{name}: Best Params={gs.best_params_}, Best CV F1={gs.best_score_:.4f}")

    # Save the best model found by GridSearchCV
    model_filename = f"gridsearch_best_{name.lower()}_model.pkl"
    joblib.dump(gs.best_estimator_, model_filename)
    print(f"   -> Saved best {name} model to {model_filename}")

# ---

print("\n=== EXPANDED RandomizedSearchCV Results (with model saving) ===")

# EXPAND the distributions for a wider search
expanded_param_dists = {
    "SVM": {
        "svc__C": uniform(0.001, 300),       # Expanded range
        "svc__gamma": uniform(0.0001, 3)     # Expanded range
    },
    "RandomForest": {
        "rf__n_estimators": randint(100, 1000), # Expanded range to 1000
        "rf__max_depth": randint(10, 150),
        "rf__min_samples_split": randint(2, 20)
    },
    "LogisticRegression": {
        "logreg__C": uniform(0.01, 100)      # Expanded range to 100
    },
    "KNN": {
        "knn__n_neighbors": randint(3, 100),    # Expanded range to 100
        "knn__weights": ['uniform', 'distance']
    }
}

for name, pipe in pipelines.items():
    # Increase n_iter to explore the larger space more effectively
    rs = RandomizedSearchCV(pipe, param_distributions=expanded_param_dists[name], n_iter=100, cv=cv, scoring='f1_macro', random_state=0, n_jobs=-1)
    rs.fit(X_train, y_train)
    print(f"{name}: Best Params={rs.best_params_}, Best CV F1={rs.best_score_:.4f}")

    # Save the best model found by RandomizedSearchCV
    model_filename = f"randomsearch_best_{name.lower()}_model.pkl"
    joblib.dump(rs.best_estimator_, model_filename)
    print(f"   -> Saved best {name} model to {model_filename}")


=== EXPANDED GridSearchCV Results (with model saving) ===
SVM: Best Params={'svc__C': 10, 'svc__gamma': 'scale'}, Best CV F1=0.8174
   -> Saved best SVM model to gridsearch_best_svm_model.pkl
RandomForest: Best Params={'rf__max_depth': 50, 'rf__min_samples_split': 2, 'rf__n_estimators': 100}, Best CV F1=0.6693
   -> Saved best RandomForest model to gridsearch_best_randomforest_model.pkl




LogisticRegression: Best Params={'logreg__C': 10, 'logreg__penalty': 'l2'}, Best CV F1=0.6738
   -> Saved best LogisticRegression model to gridsearch_best_logisticregression_model.pkl
KNN: Best Params={'knn__n_neighbors': 5, 'knn__weights': 'distance'}, Best CV F1=0.6692
   -> Saved best KNN model to gridsearch_best_knn_model.pkl

=== EXPANDED RandomizedSearchCV Results (with model saving) ===
SVM: Best Params={'svc__C': np.float64(248.68300876520894), 'svc__gamma': np.float64(0.014186428577641196)}, Best CV F1=0.8057
   -> Saved best SVM model to randomsearch_best_svm_model.pkl


# **Task 02**

# Accuracy vs. Macro-F1 Score


Before diving into the code, it's crucial to understand why these metrics lead to different choices.

Accuracy: This is the most straightforward metric. It calculates the overall percentage of correct predictions. While simple, it can be very misleading for imbalanced datasets. For instance, in your dataset where one class makes up 60% of the data, a lazy model that only predicts this majority class would still achieve 60% accuracy without learning anything about the minority classes. A search optimizing for accuracy might pick hyperparameters that favor this lazy but high-accuracy behavior.

Macro-F1 Score: The F1-score for a class is the harmonic mean of its precision and recall. The Macro-F1 score calculates the F1-score for each class independently and then takes their unweighted average. This gives equal importance to every class, regardless of how many samples it has. Therefore, to get a high Macro-F1 score, the model must perform well on the rare classes too.

Conclusion: On an imbalanced dataset, Macro-F1 will often lead to different hyperparameter choices because it forces the model to find a balance in performance across all classes, while accuracy can be dominated by the model's performance on the majority class.

In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score, f1_score

# 1) Generate the dataset with imbalance [cite: 11, 14]
X, y = make_classification(
    n_samples=5000, n_features=100, n_informative=20, n_redundant=20,
    n_classes=3, weights=[0.6, 0.3, 0.1], flip_y=0.02, random_state=42
)

# 2) Split into training and testing sets [cite: 14]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=0)

# 3) Define models, parameters, and cross-validation [cite: 14]
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

pipelines = {
    "SVM": Pipeline([('scaler', StandardScaler()), ('svc', SVC(random_state=42))]),
    "RandomForest": Pipeline([('rf', RandomForestClassifier(random_state=42))]),
}

param_grids = {
    "SVM": {
        "svc__C": [0.1, 1, 10, 50],
        "svc__gamma": ['scale', 0.01, 0.1]
    },
    "RandomForest": {
        "rf__n_estimators": [100, 300],
        "rf__max_depth": [20, 50],
        "rf__min_samples_split": [2, 5]
    },
}

# Define both accuracy and macro-F1 as scoring metrics [cite: 14, 24]
scoring = {'acc': make_scorer(accuracy_score), 'f1': make_scorer(f1_score, average='macro')}

# --- Run search optimizing for ACCURACY ---
print("="*20)
print("  Optimizing for ACCURACY")
print("="*20)

for name, pipe in pipelines.items():
    # Set refit='acc' to choose the best model based on accuracy
    gs_acc = GridSearchCV(pipe, param_grids[name], cv=cv, scoring=scoring, refit='acc', n_jobs=-1)
    gs_acc.fit(X_train, y_train)
    print(f"{name} (Refit on Accuracy):")
    print(f"   Best Params: {gs_acc.best_params_}")
    print(f"   Best Accuracy Score: {gs_acc.best_score_:.4f}")
    # Find the corresponding F1 score for the best accuracy model
    best_index = gs_acc.best_index_
    f1_for_best_acc = gs_acc.cv_results_['mean_test_f1'][best_index]
    print(f"   F1 Score at this setting: {f1_for_best_acc:.4f}\n")

# --- Run search optimizing for MACRO-F1 ---
print("="*20)
print("  Optimizing for MACRO-F1")
print("="*20)

for name, pipe in pipelines.items():
    # Set refit='f1' to choose the best model based on the f1-score
    gs_f1 = GridSearchCV(pipe, param_grids[name], cv=cv, scoring=scoring, refit='f1', n_jobs=-1)
    gs_f1.fit(X_train, y_train)
    print(f"{name} (Refit on Macro-F1):")
    print(f"   Best Params: {gs_f1.best_params_}")
    print(f"   Best F1 Score: {gs_f1.best_score_:.4f}")
    # Find the corresponding accuracy for the best F1 model
    best_index = gs_f1.best_index_
    acc_for_best_f1 = gs_f1.cv_results_['mean_test_acc'][best_index]
    print(f"   Accuracy at this setting: {acc_for_best_f1:.4f}\n")

  Optimizing for ACCURACY
SVM (Refit on Accuracy):
   Best Params: {'svc__C': 10, 'svc__gamma': 'scale'}
   Best Accuracy Score: 0.8925
   F1 Score at this setting: 0.8174

RandomForest (Refit on Accuracy):
   Best Params: {'rf__max_depth': 50, 'rf__min_samples_split': 2, 'rf__n_estimators': 300}
   Best Accuracy Score: 0.8322
   F1 Score at this setting: 0.6656

  Optimizing for MACRO-F1
SVM (Refit on Macro-F1):
   Best Params: {'svc__C': 10, 'svc__gamma': 'scale'}
   Best F1 Score: 0.8174
   Accuracy at this setting: 0.8925

RandomForest (Refit on Macro-F1):
   Best Params: {'rf__max_depth': 50, 'rf__min_samples_split': 2, 'rf__n_estimators': 100}
   Best F1 Score: 0.6693
   Accuracy at this setting: 0.8270

