In [0]:
print('Optimize Hyperparameters for machine learning in Azure Databricks')

Optimize Hyperparameters for machine learning in Azure Databricks


In [0]:
%sh
rm -r /Workspace/MicrosoftLearnings/dbfs/hyperparam_tune_lab
mkdir /Workspace/MicrosoftLearnings/dbfs/hyperparam_tune_lab
wget -O /Workspace/MicrosoftLearnings/dbfs/mlflow_lab/penguins.csv https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/penguins.csv



--2025-09-25 06:50:06--  https://raw.githubusercontent.com/MicrosoftLearning/mslearn-databricks/main/data/penguins.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9533 (9.3K) [text/plain]
Saving to: ‘/Workspace/MicrosoftLearnings/dbfs/mlflow_lab/penguins.csv’

     0K .........                                             100% 4.04M=0.002s

2025-09-25 06:50:07 (4.04 MB/s) - ‘/Workspace/MicrosoftLearnings/dbfs/mlflow_lab/penguins.csv’ saved [9533/9533]



# Ingest data
### Following cell does the following tasks
- Remove any incomplete rows
- Apply appropriate data types
- View a random sample of the data
- Split the data into two datasets: one for training, and another for testing.

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

data = spark.read.format("csv").option("header", "true").load("file:/Workspace/MicrosoftLearnings/dbfs/mlflow_lab/penguins.csv")
data = data.dropna().select(col("Island").astype("string"),
                          col("CulmenLength").astype("float"),
                          col("CulmenDepth").astype("float"),
                          col("FlipperLength").astype("float"),
                          col("BodyMass").astype("float"),
                          col("Species").astype("int")
                          )
display(data.sample(0.2).limit(9))
   
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1]
print ("Training Rows:", train.count(), " Testing Rows:", test.count())

Island,CulmenLength,CulmenDepth,FlipperLength,BodyMass,Species
Torgersen,38.6,21.2,191.0,3800.0,0
Torgersen,38.7,19.0,195.0,3450.0,0
Torgersen,42.5,20.7,197.0,4500.0,0
Biscoe,37.7,18.7,180.0,3600.0,0
Biscoe,38.2,18.1,185.0,3950.0,0
Dream,38.8,20.0,190.0,3950.0,0
Dream,37.6,19.3,181.0,3300.0,0
Dream,37.0,16.9,185.0,3000.0,0
Biscoe,40.1,18.9,188.0,4300.0,0


Training Rows: 234  Testing Rows: 108


# Optimize hyperparameter values for training a model

In [0]:
import optuna
import mlflow # if you wish to log your experiments
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, MinMaxScaler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
   
def objective(trial):
    # Suggest hyperparameter values (maxDepth and maxBins):
    max_depth = trial.suggest_int("MaxDepth", 0, 9)
    max_bins = trial.suggest_categorical("MaxBins", [10, 20, 30])

    # Define pipeline components
    cat_feature = "Island"
    num_features = ["CulmenLength", "CulmenDepth", "FlipperLength", "BodyMass"]
    catIndexer = StringIndexer(inputCol=cat_feature, outputCol=cat_feature + "Idx")
    numVector = VectorAssembler(inputCols=num_features, outputCol="numericFeatures")
    numScaler = MinMaxScaler(inputCol=numVector.getOutputCol(), outputCol="normalizedFeatures")
    featureVector = VectorAssembler(inputCols=[cat_feature + "Idx", "normalizedFeatures"], outputCol="Features")

    dt = DecisionTreeClassifier(
        labelCol="Species",
        featuresCol="Features",
        maxDepth=max_depth,
        maxBins=max_bins
    )

    pipeline = Pipeline(stages=[catIndexer, numVector, numScaler, featureVector, dt])
    model = pipeline.fit(train)

    # Evaluate the model using accuracy.
    predictions = model.transform(test)
    evaluator = MulticlassClassificationEvaluator(
        labelCol="Species",
        predictionCol="prediction",
        metricName="accuracy"
    )
    accuracy = evaluator.evaluate(predictions)

    # Since Optuna minimizes the objective, return negative accuracy.
    return -accuracy

In [0]:
# Run the optimization experiment

# Optimization run with 5 trials:
study = optuna.create_study()
study.optimize(objective, n_trials=5)

print("Best param values from the optimization run:")
print(study.best_params)

[I 2025-09-25 07:54:30,967] A new study created in memory with name: no-name-c3e7b7c7-21b9-4c43-8151-55ae417f6cd0


Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

[I 2025-09-25 07:55:53,660] Trial 0 finished with value: -0.9722222222222222 and parameters: {'MaxDepth': 5, 'MaxBins': 30}. Best is trial 0 with value: -0.9722222222222222.


Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

[I 2025-09-25 07:57:00,019] Trial 1 finished with value: -0.9722222222222222 and parameters: {'MaxDepth': 7, 'MaxBins': 10}. Best is trial 0 with value: -0.9722222222222222.


Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

[I 2025-09-25 07:58:04,430] Trial 2 finished with value: -0.9814814814814815 and parameters: {'MaxDepth': 4, 'MaxBins': 20}. Best is trial 2 with value: -0.9814814814814815.


Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

[I 2025-09-25 07:59:06,372] Trial 3 finished with value: -0.9629629629629629 and parameters: {'MaxDepth': 3, 'MaxBins': 10}. Best is trial 2 with value: -0.9814814814814815.


Downloading artifacts:   0%|          | 0/45 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

[I 2025-09-25 08:00:11,628] Trial 4 finished with value: -0.9814814814814815 and parameters: {'MaxDepth': 6, 'MaxBins': 20}. Best is trial 2 with value: -0.9814814814814815.


Best param values from the optimization run:
{'MaxDepth': 4, 'MaxBins': 20}


Observe as the code iteratively runs the training function 5 times while trying to minimize the loss(based on the n_trials setting). Each trial is recorded by MLflow, and you can use the the ▸ toggle to expand the MLflow run output under the code cell and select the experiment hyperlink to view them. Each run is assigned a random name, and you can view each of them in the MLflow run viewer to see details of parameters and metrics that were recorded.

When all of the runs have finished, observe that the code displays details of the best hyperparameter values that were found (the combination that resulted in the least loss). 
In this case, 
- the MaxBins parameter is defined as a choice from a list of three possible values (10, 20, and 30) - the best value indicates the zero-based item in the list (so 0=10, 1=20, and 2=30). 
- The MaxDepth parameter is defined as a random integer between 0 and 10, and the integer value that gave the best result is displayed.

### ClosingSummary

Optuna helps data scientists build optimal machine learning models by providing an efficient, easy-to-use framework for running hyperparameter optimization trials. With its integration into Azure Databricks and MLflow, it offers a powerful solution for distributed hyperparameter tuning.