## Evaluating LLMs with MLflow 📏

MLflow provides a robust framework for tracking and managing machine learning experiments. In this notebook, we will explore how to evaluate large language models (LLMs) using MLflow, focusing on key metrics such as bleu, rouge, and LLM-as-a-judge metrics.


Installed required libraries


In [None]:
!pip install --upgrade -r requirements-fine-tuning.txt -q

Import the necessary libraries and modules


In [None]:
# from dotenv import load_dotenv
# load_dotenv()
import os
import boto3
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.configs import Compute, SourceCode, StoppingCondition

SageMaker AI provides containers for popular ML frameworks, including PyTorch, TensorFlow, and Hugging Face. These containers are optimized for performance and security, allowing you to focus on building and deploying your models without worrying about the underlying infrastructure.

We are using the pytorch 2.8.0 container with Python 3.12. Then we are setting up the ModelTrainer API from the SageMaker SDK to run our training job. We are specifying the instance type, instance count, role, and source code location. We are also setting the stopping condition to 1 hour.


In [None]:
pytorch_image = '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.8.0-gpu-py312-cu129-ubuntu22.04-sagemaker'
# define the script to be run
source_code = SourceCode(
    source_dir="scripts",
    entry_script="eval.sh",
)

from huggingface_hub import HfFolder

model_id = 'Qwen/Qwen3-0.6B'
dataset_name = 'Josephgflowers/Finance-Instruct-500k'

environment = {
    'HF_TOKEN': HfFolder.get_token(),
    "HF_DATASET": dataset_name,
    "MODEL_ID": model_id,
    "MLFLOW_TRACKING_URI": os.getenv(
        'MLFLOW_TRACKING_URI',
        'arn:aws:sagemaker:us-east-1:198346569064:mlflow-tracking-server/vlm-finetuning-server',
    ),
    "MLFLOW_EXPERIMENT_NAME": "qwen3-06b-lora-ft-finance",
    "NUM_SAMPLES": '10',
}

# experiment_name = 'qwen3-06b-lora-ft-finance'
# run_name = 'qwen3-06b-finance'

hyperparameters = {
    "model_id": model_id,
    "adapter_path": 's3://sagemaker-us-east-1-198346569064/qwen3-06b-fine-tuned/',
    'dataset_name': dataset_name,
    'experiment_name': 'qwen3-06b-lora-ft-finance',
    'run_name': 'fine-tuning-run-1',
}

assert (
    environment["MLFLOW_TRACKING_URI"] != "XXX"
), "Please set your MLFLOW_TRACKING_URI in the environment variable"

assert (
    environment["HF_TOKEN"] is not None
), "Please set your HF_TOKEN in the environment variable"

We are using a G5 2xlarge instance for our training job. This instance type is optimized for machine learning workloads and provides a good balance of compute, memory, and networking resources. The g5 instances are powered by NVIDIA A10G Tensor Core GPUs, which are well-suited for training large language models.

The G5 has a single NVIDIA A10G GPU, 8 vCPUs, and 32 GB of memory. This instance type is a good choice for training large language models, as it provides enough compute power to handle the training workload while still being cost-effective.

If you are running multiple experiments you can speed up the container download time by using warm pools. Warm pools are configured with the `keep_alive_period_in_seconds` parameter. This parameter specifies the amount of time that the warm pool will keep the container alive after the training job has completed. This can help to reduce the time it takes to start a new training job, as the container will already be downloaded and ready to use.


In [None]:
stopping_condition = StoppingCondition(
    max_runtime_in_seconds=60 * 60 * 10,  # seconds * minutes * hours
)

compute = Compute(
    instance_count=1,
    instance_type="ml.g5.2xlarge",
    # volume_size_in_gb=96,
    keep_alive_period_in_seconds=3600,
)

We tie together all the previous objects and configurations to create a ModelTrainer instance. This instance is responsible for orchestrating the training job on SageMaker. We pass in the role, source code configuration, compute configuration, and stopping condition to the ModelTrainer.


In [None]:
base_job_name = "mlflow-eval-llmaaj"

# define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=pytorch_image,
    source_code=source_code,
    stopping_condition=stopping_condition,
    base_job_name=base_job_name,
    compute=compute,
    environment=environment,
    hyperparameters=hyperparameters,
)

`model_trainer.fit()` is the command that initiates the training job on SageMaker. This method will package the source code, upload it to S3, and start the training job using the specified compute resources and configurations. The training job will run the `train.sh` script, which essentially installs some python libraries and then executes the `sft.py` script to fine-tune the model and log the results to MLflow.


In [None]:
model_trainer.train(wait=False)