<a href="https://www.kaggle.com/code/faressayah/hyperparameter-optimization-for-machine-learning?scriptVersionId=118252459" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🎯Hyperparameter Optimization for Machine Learning

The aim of this notebook:
> - Discuss multiple ways to optimize hyperparameters.
> - Understand the logic of each technique.
> - Considerations when utilizing each technique.
> - Master the use of Python open-source for hyperparameter tuning.

---

## Parameters in ML models
> - The objective of a typical learning algorithm is to find a function `f` that minimizes a certain `loss` over a `dataset`.
> - The learning algorithm produces `f` through the optimization of a training criteron with respect to a set of `parameters`.

---

## Hyperparameters in ML models
> - Hyperparameters are parameters that are not directly learnt by the learning algorithm.
> - Hyperparameters are specified outside of the training procedure.
> - Hyperparameters control the capacity of the model, i.e., how flexible the model is to fit the data.
> - Prevent over-fitting.
> - Hyperparameters could have a big impact on the performance of the learning algorithm.
> - Optimal hyperparameter settings often differ for different datasets. Therefore they should be optimized for each dataset.
---

## Hyperparameter Nature
>- Some hyperparameters are discrete: Number of estimators in ensemble models.
>- Some hyperparameters are continuous: Penalization coefficient, Number of samples per split.
>- Some hyperparameters are categorical: Loss (deviance, exponential), Regularization (Lasso, Ridge)

---

## Parameters vs Hyperparameters

|Parameters                  |   Hyperparameters |
|:-------------------------|----------------------:|
| - Intrinsic to model equation     | - Defined before training |
| - Optimized during training | - Constrain the algorithm|

> - The process of finding the best Hyperparameters for a given dataset is called `Hyperparameter Tuning` or `Hyperparameter Optimization`.

---

## Challenges
>- We can't define a formula to find the hyperparameters.
>- Try different combinations of hyperparameters and evaluate model performance. The critical step is to choose how many different combinations we are going to test.

The number of hyperparameter combination ---> the chance to get a better model ---> Computational cost

>- How do we find the hyperparameter combinations to maximize performance while diminishing computational costs?

---

## Methods
Different hyperparamete optimization strategies:
>- Manual Search
>- Grid Search
>- Random Search
>- Bayesian Optimization

---

## Generalization vs Over-fitting
> Generalization is the ability of an algorithm to be effective across various inputs. The performance of the machine learning model is constant across different datasets (with the same distribution on the training data). When the model performs well on the train set, but not on new / naive data, the model over-fits to the training data.

---

## Training a Machine Learning Model
> To prevent over-fitting, it is common practice to:
> - Separate the data into a train and a test set.
> - Train the model in the train set.
> - Evaluate in the test set.

# Hyperparameter Optimization

## Grid Search

>- Exhaustive search through a specified subset of hyperparameters of a learning algorithm.
>- Examines all possible combinations of the specified hyperparameters (Cartesian product of hyperparameters).

### Limitations
>- Curse of dimentionality: possible combinations grow exponentially with the number of hyperparameters.
>- Computationally expensive.
>- Hyperparameter values are determined manually.
>- Not ideal for continuous hyperparameters.
>- Does not explore the entire hyperparameter space (not feasible).
>- It performs worse than other searches (for models with complex hyperparameter spaces).

### Advantages
>- For models with simpler hyperparameter spaces works well.
>- It can be parallelized.

Grid Search is the most expensive method in terms of total computation time. However, if run in parallel, it is fast in terms of wall clock time. Sometimes, we run a small grid, determine where the optimum lies, and then expand the grid in that direction.

# Tuning XGBoost with Optuna

In this example, the hyperparameters `learning_rate`, `max_depth`, `n_estimators`, and `min_child_weight` are tuned using the Optuna library. The objective function is defined to return the negative accuracy on the test set, as Optuna minimizes the objective function. The `study.optimize` function is used to run the hyperparameter tuning, with `n_trials` specifying the number of trials to run. The final performance of the tuned classifier is evaluated on the test set.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import optuna

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

def objective(trial):
    # Define the hyperparameters to tune
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-1)
    max_depth = trial.suggest_int("max_depth", 3, 7)
    n_estimators = trial.suggest_int("n_estimators", 100, 1000)
    min_child_weight = trial.suggest_int("min_child_weight", 1, 5)
    
    # Create an XGBoost classifier
    clf = XGBClassifier(
        learning_rate=learning_rate, 
        max_depth=max_depth,
        n_estimators=n_estimators, 
        min_child_weight=min_child_weight
    )
    
    # Train the classifier and calculate the accuracy on the validation set
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    
    return 1.0 - score

# Use Optuna to tune the hyperparameters
study = optuna.create_study()
study.optimize(objective, n_trials=100)

# Print the best hyperparameters and the best score
print("Best hyperparameters: ", study.best_params)
print("Best score: ", 1.0 - study.best_value)

# Train the classifier with the best hyperparameters on the full training set
best_params = study.best_params
clf = XGBClassifier(
    learning_rate=best_params["learning_rate"], 
    max_depth=best_params["max_depth"],
    n_estimators=best_params["n_estimators"], 
    min_child_weight=best_params["min_child_weight"]
)
clf.fit(X, y)

# Evaluate the tuned classifier on the test set
score = clf.score(X_test, y_test)
print("Test set accuracy: ", score)

[32m[I 2023-02-05 05:59:41,508][0m A new study created in memory with name: no-name-19fb766a-50db-4e3f-be80-cb6db7ca8911[0m
[32m[I 2023-02-05 05:59:43,974][0m Trial 0 finished with value: 0.01754385964912286 and parameters: {'learning_rate': 0.04751609063782706, 'max_depth': 3, 'n_estimators': 615, 'min_child_weight': 1}. Best is trial 0 with value: 0.01754385964912286.[0m
[32m[I 2023-02-05 05:59:45,901][0m Trial 1 finished with value: 0.02631578947368418 and parameters: {'learning_rate': 0.005501317485289229, 'max_depth': 5, 'n_estimators': 394, 'min_child_weight': 1}. Best is trial 0 with value: 0.01754385964912286.[0m
[32m[I 2023-02-05 05:59:49,232][0m Trial 2 finished with value: 0.04385964912280704 and parameters: {'learning_rate': 0.04092574159565932, 'max_depth': 4, 'n_estimators': 825, 'min_child_weight': 1}. Best is trial 0 with value: 0.01754385964912286.[0m
[32m[I 2023-02-05 05:59:52,792][0m Trial 3 finished with value: 0.02631578947368418 and parameters: {'lea

Best hyperparameters:  {'learning_rate': 0.015849264733013, 'max_depth': 5, 'n_estimators': 743, 'min_child_weight': 2}
Best score:  0.9912280701754386
Test set accuracy:  1.0


# Tuning Random Forest with Optuna

In this example, the hyperparameters `n_estimators`, `max_depth`, `min_samples_split`, and `min_samples_leaf` are tuned using the Optuna library. The objective function is defined to return the negative accuracy on the test set, as Optuna minimizes the objective function. The `study.optimize` function is used to run the hyperparameter tuning, with `n_trials` specifying the number of trials to run. The final performance of the tuned classifier is evaluated on

In [2]:
from sklearn.ensemble import RandomForestClassifier

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

def objective(trial):
    # Define the hyperparameters to tune
    n_estimators = trial.suggest_int("n_estimators", 100, 1000)
    max_depth = trial.suggest_int("max_depth", 3, 7)
    min_samples_split = trial.suggest_int("min_samples_split", 2, 5)
    min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 5)
    
    # Create a random forest classifier
    clf = RandomForestClassifier(
        n_estimators=n_estimators, 
        max_depth=max_depth,
        min_samples_split=min_samples_split, 
        min_samples_leaf=min_samples_leaf
    )
    
    # Train the classifier and calculate the accuracy on the validation set
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    
    return 1.0 - score

# Use Optuna to tune the hyperparameters
study = optuna.create_study()
study.optimize(objective, n_trials=100)

# Print the best hyperparameters and the best score
print("Best hyperparameters: ", study.best_params)
print("Best score: ", 1.0 - study.best_value)

# Train the classifier with the best hyperparameters on the full training set
best_params = study.best_params
clf = RandomForestClassifier(
    n_estimators=best_params["n_estimators"], 
    max_depth=best_params["max_depth"],
    min_samples_split=best_params["min_samples_split"], 
    min_samples_leaf=best_params["min_samples_leaf"]
)
clf.fit(X, y)

# Evaluate the tuned classifier on the test set
score = clf.score(X_test, y_test)
print("Test set accuracy: ", score)

[32m[I 2023-02-05 06:03:54,996][0m A new study created in memory with name: no-name-f260eda6-5c3e-4e9a-9219-befd239a1efc[0m
[32m[I 2023-02-05 06:03:56,808][0m Trial 0 finished with value: 0.03508771929824561 and parameters: {'n_estimators': 940, 'max_depth': 3, 'min_samples_split': 3, 'min_samples_leaf': 5}. Best is trial 0 with value: 0.03508771929824561.[0m
[32m[I 2023-02-05 06:03:57,341][0m Trial 1 finished with value: 0.04385964912280704 and parameters: {'n_estimators': 261, 'max_depth': 4, 'min_samples_split': 4, 'min_samples_leaf': 5}. Best is trial 0 with value: 0.03508771929824561.[0m
[32m[I 2023-02-05 06:03:57,904][0m Trial 2 finished with value: 0.04385964912280704 and parameters: {'n_estimators': 269, 'max_depth': 6, 'min_samples_split': 3, 'min_samples_leaf': 5}. Best is trial 0 with value: 0.03508771929824561.[0m
[32m[I 2023-02-05 06:03:58,807][0m Trial 3 finished with value: 0.04385964912280704 and parameters: {'n_estimators': 413, 'max_depth': 6, 'min_sampl

Best hyperparameters:  {'n_estimators': 790, 'max_depth': 5, 'min_samples_split': 4, 'min_samples_leaf': 2}
Best score:  0.9736842105263158
Test set accuracy:  1.0


## Random Search

>- Hyperparameter values are selected by independent (random) draws from uniform distribution of the hyperparameter space. Random Search selects the combinations of hyperparameter values at random from all the possible combinations given a hyperparameter space.

---

## Random Search vs Grid Search
>- Some parameters affect performance a lot and some others don't (Low Effective Dimension). 

|      Random Search                                                |   Grid Search                            |
|:------------------------------------------------------------------|-----------------------------------------:|
| Allows the exploration of more dimensions of important parameters | Waste time exploring non-important dimensions |
| Select values from a distribution of parameter values             | Parameters are defined manually |
| Good for continuous parameters                                    | Good for discrete parameters |

---

## Considerations
>- We choose a (computational) budget independently of the number of parameters and possible values.
>- Adding parameters that do not influence the performance does not decrease efficiency of the search (if enough iterations are allowed).
>- Important to specify a continuous distribution of the hyperparameter to take full advantage of the randomization.