This project demonstrates how to perform hyperparameter tuning for a logistic regression model using the Iris dataset. The code leverages MLflow for experiment tracking and Optuna for hyperparameter optimization.
iris_classification.py: Main script for training and tuning the logistic regression model.mlflow_utils.py: Utility functions for MLflow.optuna_utils.py: Utility functions for Optuna.blob_storage_deploy.py: Script for uploading files to Azure Blob Storage.start_mlflow_server.sh: Script to start the MLflow server.mlartifacts/: Directory containing MLflow artifacts.mlruns/: Directory containing MLflow run data.models/: Directory containing registered models.
- Python 3.x
- MLflow
- Optuna
- scikit-learn
- Azure Storage Blob
-
Install the required Python packages:
pip install mlflow optuna scikit-learn azure-storage-blob
-
Start the MLflow server:
./start_mlflow_server.sh
To run the iris_classification.py script, execute the following command:
python iris_classification.pyTo run the blob_storage_deploy.py script, execute the following command:
python blob_storage_deploy.pyThe iris_classification.py script performs the following steps:
- Import Libraries: Imports necessary libraries including MLflow, Optuna, and scikit-learn.
- Load Dataset: Loads the Iris dataset using scikit-learn's
load_irisfunction. - Split Data: Splits the dataset into training and validation sets.
X_train, X_valid, y_train, y_valid = train_test_split( iris.data, iris.target, test_size=0.3 )
- Set MLflow Experiment: Sets the current active MLflow experiment using the
get_or_create_experimentfunction frommlflow_utils.py.experiment_id = get_or_create_experiment("Iris Classification") mlflow.set_experiment(experiment_id=experiment_id)
- Start MLflow Run: Initiates an MLflow run and creates an Optuna study for hyperparameter tuning.
with mlflow.start_run(nested=True): study = optuna.create_study( direction="maximize", study_name="Iris Classification", load_if_exists=True, )
- Optimize Hyperparameters: Uses Optuna to optimize the hyperparameters of a logistic regression model. The
logistic_regression_errorfunction fromoptuna_utils.pyis used as the objective function.study.optimize( lambda trial: logistic_regression_error( trial, X_train, X_valid, y_train, y_valid ), n_trials=10, )
- Load Best Model: Loads the best model of the study from a file if it exists.
if os.path.exists("best_model.pkl"): best_model = joblib.load("best_model.pkl")
- Log Parameters and Metrics: Logs the best hyperparameters and the best accuracy to MLflow.
- Set Tags: Logs tags related to the project, optimizer engine, model family, and feature set version.
- Train Model: Trains a logistic regression model using the best hyperparameters found by Optuna.
- Log Model: Logs the trained model as an artifact in MLflow.
- Print Model URI: Prints the URI of the logged model.
- Evaluate Model: Evaluates the model on the validation set and logs the evaluation metrics to MLflow.
The blob_storage_deploy.py script performs the following steps:
- Import Libraries: Imports necessary libraries including
osandazure.storage.blob. - Define Function: Defines the
upload_directory_to_blobfunction to upload a directory to Azure Blob Storage. - Connect to Azure Blob Storage: Connects to Azure Blob Storage using the connection string from the environment variable
AZURE_STORAGE_CONNECTION_STRING. - Recursively Upload Files: Recursively uploads each file in the specified local directory to Azure Blob Storage, excluding certain files like
requirements.txt,python_env.yaml, andconda.yaml. - Get Blob Client: Gets the blob client for each file and uploads the file to the specified container and blob name.
- Handle Exceptions: Handles any exceptions that occur during the upload process and prints an error message.
mlflow_utils.py: Contains theget_or_create_experimentfunction to manage MLflow experiments.optuna_utils.py: Contains thechampion_callbackandlogistic_regression_errorfunctions for Optuna optimization.
When you run the iris_classification.py script, the following will be logged and displayed in the MLflow dashboard:
-
Experiment Creation/Loading: An experiment will be created or loaded in MLflow. This experiment will contain all the runs related to the Iris classification project.
-
Nested Runs: A nested run will be logged within the experiment. This parent run will encapsulate several child runs, each corresponding to a training run performed during the Optuna hyperparameter tuning process.
-
Training Runs: At the lowest level, there will be multiple training runs logged by Optuna. Each run will represent a different set of hyperparameters evaluated during the tuning process. Metrics such as accuracy and loss will be logged for each run.
-
Best Parameters Logging: Once Optuna identifies the best hyperparameters, these parameters will be logged in the parent run. The parent run will take the name of the best trial. This includes the best hyperparameters, the corresponding performance metrics, and the model itself so there is no need to retrain.
-
Model Registration: The best model will be loaded from the best trial and registered as a new artifact in MLflow. The model will have a default validation status set to "pending".
-
Tags and Metrics: Various tags and metrics will be logged for both the parent and child runs, providing detailed information about the experiment, optimizer engine, model family, and feature set version.
By navigating to the MLflow dashboard, you can visualize and analyze the experiment, nested runs, and the performance of different hyperparameter configurations. The registered model can be accessed and managed through the MLflow model registry.
This project is licensed under the MIT License. See the LICENSE file for details.