🎞️ Movie Analogy: Hyperparameter Tuning
🎬 Imagine a Movie Director (ML Engineer) wants to make a blockbuster film (Machine Learning Model).
But there’s a problem — the director doesn’t know:

What genre will connect best (🤔 classifier vs regressor),

What lead actor will work (🤵 random forest, xgboost, etc.),

What camera angle, lighting, costume, or dialogues (🎥 hyperparameters like max_depth, learning_rate, etc.) will make the movie a hit.

So what does the director do?

🎯 Step-by-Step Comparison
ML Term	Movie World Example 🎥
Model	The Movie Script
Hyperparameters	Camera settings, actor choice, costume design
Training	Shooting and rehearsing
Evaluation (Accuracy, RMSE)	Audience review and box office performance
Hyperparameter Tuning	Trying multiple variations of scenes and scripts to see which one gets the best test audience reaction

🎥 Hyperparameter Tuning = Test Screening
The director makes many versions of the movie (with small changes):

One with more action (higher learning rate),

One with more emotion (deeper tree),

One with two lead actors instead of one (more estimators),

One with fewer songs (less regularization).

Each version is shown to a test audience (cross-validation), and their reaction (accuracy, F1-score) helps decide which version is best.

🎬 Final Output
After tuning and testing many combinations, the director finds the perfect blend of scenes — that’s your best model configuration, ready for full release (production deployment)!

#ML model Hyperparameter Tuning
##Loading the CSV dataset in the DBFS (Databricks File System)

In [0]:
%sh
 rm -r /dbfs/mlflow_lab
 mkdir /dbfs/mlflow_lab
 wget -O /dbfs/mlflow_lab/diabetes.csv https://raw.githubusercontent.com/kuljotSB/DatabricksUdemyCourse/refs/heads/main/MachineLearningModel/diabetes.csv
     

##Splitting the dataset into train and test values

In [0]:

from pyspark.sql.types import *
from pyspark.sql.functions import *
   
data = spark.read.format("csv").option("header", "true").load("/mlflow_lab/diabetes.csv")
data = data.dropna().select(col("Pregnancies").astype("int"),
                           col("Glucose").astype("int"),
                          col("BloodPressure").astype("int"),
                          col("SkinThickness").astype("int"),
                          col("Insulin").astype("int"),
                          col("BMI").astype("float"),
                          col("DiabetesPedigreeFunction").astype("float"),
                          col("Age").astype("int"),
                          col("Outcome").astype("int")
                          )

   
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1]
print ("Training Rows:", train.count(), " Testing Rows:", test.count())
     

## Optimizing Hyperparameter values for our ML model

In [0]:
from hyperopt import STATUS_OK
import mlflow
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, MinMaxScaler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
   
def objective(params):
    # Train a model using the provided hyperparameter value
    numFeatures = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
    numVector = VectorAssembler(inputCols=numFeatures, outputCol="numericFeatures")
    numScaler = MinMaxScaler(inputCol = numVector.getOutputCol(), outputCol="normalizedFeatures")
    featureVector = VectorAssembler(inputCols=["normalizedFeatures"], outputCol="Features")
    mlAlgo = DecisionTreeClassifier(labelCol="Outcome",    
                                    featuresCol="Features",
                                    maxDepth=params['MaxDepth'], maxBins=params['MaxBins'])
    pipeline = Pipeline(stages=[numVector, numScaler, featureVector, mlAlgo])
    model = pipeline.fit(train)
       
    # Evaluate the model to get the target metric
    prediction = model.transform(test)
    eval = MulticlassClassificationEvaluator(labelCol="Outcome", predictionCol="prediction", metricName="accuracy")
    accuracy = eval.evaluate(prediction)
       
    # Hyperopt tries to minimize the objective function, so you must return the negative accuracy.
    return {'loss': -accuracy, 'status': STATUS_OK}




#Defining the Search Space for our hyperparameters
## Also will be logging each hyperparameter run using a Trials() object