# Hyperparameter Tuning with Optuna

In this hands-on demo, you will learn how to leverage **Optuna**, a powerful optimization library, for efficient model tuning. We'll guide you through the process of performing **hyperparameter optimization**, demonstrating how to define the search space, objective function, and algorithm selection. Throughout the demo, you will utilize _MLflow_ to seamlessly track the model tuning process, capturing essential information such as hyperparameters, metrics, and intermediate results. By the end of the session, you will not only grasp the principles of hyperparameter optimization but also be proficient in finding the best-tuned model using various methods such as the **MLflow API** and **MLflow UI**.

By integrating Optuna and MLflow, you can efficiently optimize hyperparameters and maintain comprehensive records of your machine learning experiments, facilitating reproducibility and collaborative research.

## Learning Objectives

**By the end of this demo, you will be able to:**

- Perform hyperparameter optimization using Optuna.
- Track the model tuning process with MLflow.
- Query previous runs from an experiment using the `MLflowClient`.
- Review an MLflow Experiment for visualizing results and selecting the best run.
- Read in the best model, make a prediction, and register the model to Unity Catalog.

In [0]:
print("Hello")

Installing the library

In [0]:
%pip install -U -qq optuna
dbutils.library.restartPython()

In [0]:
import optuna
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
from mlflow.models.signature import infer_signature
from sklearn.model_selection import train_test_split

In [0]:
table_name = "sdd_dev.sohag_test.diabetes_binary_health_indicators_brfss_2015"
diabetes_dataset = spark.read.table(table_name)
diabetes_pd = diabetes_dataset.toPandas()
diabetes_pd.head()

### Spliting Train/ Test

In [0]:
print(f"We have {diabetes_pd.shape[0]} records in our source dataset")

# split target variable into its own dataset
target_col = "Diabetes_binary"
X_all = diabetes_pd.drop(labels=target_col, axis=1)
y_all = diabetes_pd[target_col]

# test / train split
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, train_size=0.95, random_state=42)

y_train = y_train.astype(float)
y_test = y_test.astype(float)

print(f"We have {X_train.shape[0]} records in our training dataset")
print(f"We have {X_test.shape[0]} records in our test dataset")

# Hyperparameter Tuning with Optuna and MLflow

This project demonstrates hyperparameter tuning for a scikit-learn `DecisionTreeClassifier` using [Optuna](https://optuna.org/) and experiment tracking with [MLflow](https://mlflow.org/).

## Objective Function

The objective function in Optuna:
1. Defines the hyperparameter search space.
2. Trains the model with suggested hyperparameters.
3. Evaluates the modelâ€™s performance.
4. Returns a scalar value for Optuna to optimize (minimize or maximize).

## Hyperparameters Tuned

For the `DecisionTreeClassifier`, the following hyperparameters are tuned:
- **criterion**: Chooses between `gini` and `entropy`. This determines the function used to measure the quality of a split.
- **max_depth**: Integer between 5 and 50.
- **min_samples_split**: Integer between 2 and 40.
- **min_samples_leaf**: Integer between 1 and 20.

## Optimization Process

- The search algorithm can use various samplers (e.g., TPE, GPSampler).
- Each Optuna trial starts a new MLflow run for experiment tracking.
- Model performance is evaluated using 5-fold cross-validation, and the negative mean of the fold results is used as the optimization target.

## Impurity Measures

- **Gini impurity**: Measures the likelihood of incorrect classification of a randomly chosen element.
- **Entropy**: Measures the impurity or disorder in the dataset.

## Usage

1. Install dependencies:
   ```bash
   pip install optuna mlflow scikit-learn

In [0]:
# Define the objective function
def optuna_objective_function(trial):
    # Define hyperparameter search space
    params = {
        'criterion': trial.suggest_categorical('criterion', ['gini', 'entropy']),
        'max_depth': trial.suggest_int('max_depth', 5, 50),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 40),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20)
    }

    # Start an MLflow run for logging
    with mlflow.start_run(nested=True, run_name=f"Model Tuning with Optuna - Trial {trial.number}"):
        # Log parameters with MLflow
        mlflow.log_params(params)

        dtc = DecisionTreeClassifier(**params)
        scoring_metrics = ['accuracy', 'precision', 'recall', 'f1']
        cv_results = cross_validate(dtc, X_train, y_train, cv=5, scoring=scoring_metrics, return_estimator=True)

        # Log cross-validation metrics to MLflow
        for metric in scoring_metrics:
            mlflow.log_metric(f'{metric}', cv_results[f'test_{metric}'].mean())

        # Train the model on the full training set
        final_model = DecisionTreeClassifier(**params)
        final_model.fit(X_train, y_train)

        # Create input signature using the first row of X_train
        input_example = X_train.iloc[[0]]
        signature = infer_signature(input_example, final_model.predict(input_example))

        # Log the model with input signature
        mlflow.sklearn.log_model(final_model, "decision_tree_model", signature=signature, input_example=input_example)

        # Compute the mean from cross-validation
        f1_score_mean = cv_results['test_f1'].mean()

        # Metric to be minimized
        return -f1_score_mean

## Optimize the Scikit-Learn Model on Single-Machine Optuna and Log Results with MLflow

Before running the optimization, we need to perform two key steps:

1. **Initialize an Optuna Study using `optuna.create_study()`.**
   - A study represents an optimization process consisting of multiple trials.
   - A trial is a single execution of the objective function with a specific set of hyperparameters.

2. **Run the Optimization using `study.optimize()`.**
   - This tells Optuna how many trials to perform and allows it to explore the search space.

Each trial will be logged to MLflow, including the hyperparameters tested and their corresponding cross-validation results. Optuna will handle the optimization while training continues.

### Steps:

- **Set up an Optuna study with `optuna.create_study()`.**
- **Start an MLflow run with `mlflow.start_run()` to log experiments.**
- **Optimize hyperparameters using `study.optimize()` within the MLflow context.**

---

### Note on `n_jobs` in `study.optimize()`

The `n_jobs` argument controls the **number of trials running in parallel** using multi-threading **on a single machine**.

- If `n_jobs=-1`, Optuna will use **all available CPU cores** (e.g., on a 4-core machine, it will likely use all 4 cores).
- If `n_jobs` is undefined (default), trials run **sequentially (single-threaded)**.
- **Important:** `n_jobs` does **not** distribute trials across multiple nodes in a Spark cluster. To parallelize across nodes, use `SparkTrials()` instead.

---

### Why We Don't Use `MLflowCallback`

Optuna provides an `MLflowCallback` for automatic logging. However, in this demo, we are demonstrating how to integrate the MLflow API with Optuna separate from `MLflowCallback`.

In [0]:
# Set the MLflow experiment name and get the id
experiment_name = "/Users/sohagahammed.siyam@kone.com/Databricks Training/MLOps/Optuna Experiment"

print(f"Experiment Name: {experiment_name}")
mlflow.set_experiment(experiment_name)
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id
print(f"Experiment ID: {experiment_id}")

print("Clearing out old runs (If you want to add more runs, change the n_trial parameter in the next cell) ...")
# Get all runs
runs = mlflow.search_runs(experiment_ids=[experiment_id], output_format="pandas")

if runs.empty:
    print("No runs found in the experiment.")
else:
    # Iterate and delete each run
    for run_id in runs["run_id"]:
        mlflow.delete_run(run_id)
        print(f"Deleted run: {run_id}")

print("All runs have been deleted.")

In [0]:
study = optuna.create_study(
    study_name="optuna-hyper-param-optimization",
    direction="minimize"
)

with mlflow.start_run(run_name='demo_optuna_hpo') as parent_run:
    # Run optimization
    study.optimize(
        optuna_objective_function,
        n_trials=10
    )

# Review Tuning Results
We can use the MLflow API to review the trial results

In [0]:
import mlflow
import pandas as pd

# Define your experiment name or ID
experiment_id = parent_run.info.experiment_id  # Replace with your actual experiment ID

# Fetch all runs from the experiment
df_runs = mlflow.search_runs(
    experiment_ids=[experiment_id]
)

# df_runs = df_runs[df_runs['tags.mlflow.runName'] != 'demo_optuna_hpo']

display(df_runs)