# Hyperparameter Tuning with `scikit-learn`


## 1. Introduction

In this notebook, we explore hyperparameter tuning methods, specifically `GridSearchCV` and `RandomizedSearchCV`, to optimize machine learning models in `scikit-learn`.


## 2. An Introduction to Grid Search

Grid Search exhaustively searches over a specified parameter grid, testing each possible combination to find the best-performing parameters.


In [5]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split

# Load the data
cancer = load_breast_cancer()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target)

# Initialize the model
lr = LogisticRegression(solver="liblinear", max_iter=1000)

# Define the parameter grid
parameters = {"penalty": ["l1", "l2"], "C": [1, 10, 100]}

# Create the GridSearchCV model
clf = GridSearchCV(lr, parameters)

# Fit the model to find the best parameters
clf.fit(X_train, y_train)
best_model = clf.best_estimator_

# Output the best model and parameters
print("Best Model:", best_model)
print("Best Parameters:", clf.best_params_)

Best Model: LogisticRegression(C=100, max_iter=1000, penalty='l1', solver='liblinear')
Best Parameters: {'C': 100, 'penalty': 'l1'}


## 3. Evaluating the Results of `GridSearchCV`

After fitting `GridSearchCV`, we can access the best parameters and score.


In [6]:
best_score = clf.best_score_
test_score = clf.score(X_test, y_test)
print("Best Score (Cross-Validation):", best_score)
print("Test Score:", test_score)

Best Score (Cross-Validation): 0.9554856361149111
Test Score: 0.9790209790209791


## 4. An Introduction to Random Search

Unlike Grid Search, Random Search samples from a distribution of parameter values, which can be more efficient for large parameter spaces.


In [7]:
from scipy.stats import uniform
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, train_test_split

# Load the data set
cancer = load_breast_cancer()

# Split the data into training and testing sets
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Create distributions to draw hyperparameters from
distributions = {"penalty": ["l1", "l2"], "C": uniform(loc=0, scale=100)}

# The logistic regression model
lr = LogisticRegression(solver="liblinear", max_iter=1000)

# Create a RandomizedSearchCV model
clf = RandomizedSearchCV(lr, distributions, n_iter=8)

clf.fit(X_train, y_train)
best_model = clf.best_estimator_
print(best_model)
print(clf.best_params_)

LogisticRegression(C=76.29336604445899, max_iter=1000, penalty='l1',
                   solver='liblinear')
{'C': 76.29336604445899, 'penalty': 'l1'}


## 5. Evaluating the Results of `RandomizedSearchCV`

After fitting `RandomizedSearchCV`, we can access the best parameters and score.


In [8]:
best_score = clf.best_score_
test_score = clf.score(X_test, y_test)
print(best_score)
print(test_score)

import pandas as pd
hyperparameter_values = pd.DataFrame(clf.cv_results_["params"])
randomsearch_scores = pd.DataFrame(
    clf.cv_results_["mean_test_score"], columns=["score"]
)
df = pd.concat([hyperparameter_values, randomsearch_scores], axis=1)
print(df)

0.9601367989056089
0.9790209790209791
           C penalty     score
0  26.100009      l2  0.948399
1  76.293366      l1  0.960137
2  10.042309      l2  0.943721
3  42.513174      l1  0.950780
4   3.738885      l2  0.950752
5  26.036038      l1  0.955458
6  73.942583      l1  0.960137
7   8.745556      l1  0.950752
