## Fine-Tuning and Evaluating LLMs with SageMaker Pipelines and MLflow

Running hundreds of experiments, comparing the results, and keeping a track of the ML lifecycle can become very complex. This is where MLflow can help streamline the ML lifecycle, from data preparation to model deployment. By integrating MLflow into your LLM workflow, you can efficiently manage experiment tracking, model versioning, and deployment, providing reproducibility. With MLflow, you can track and compare the performance of multiple LLM experiments, identify the best-performing models, and deploy them to production environments with confidence. 

You can create workflows with SageMaker Pipelines that enable you to prepare data, fine-tune models, and evaluate model performance with simple Python code for each step. 

Now you can use SageMaker managed MLflow to run LLM fine-tuning and evaluation experiments at scale. Specifically:

- MLflow can manage tracking of fine-tuning experiments, comparing evaluation results of different runs, model versioning, deployment, and configuration (such as data and hyperparameters)
- SageMaker Pipelines can orchestrate multiple experiments based on the experiment configuration 
  

The following figure shows the overview of the solution.
![](./ml-16670-arch-with-mlflow.png)

## Prerequisites 
Before you begin, make sure you have the following prerequisites in place:

- [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens) – You need a HuggingFace login token to access the gated Llama 3.2 model and datasets used in this post.

- Once you have your HuggingFace access token, navigate to the **steps/finetune_llama3b_hf.py** and update the **'hf_token'** parameter with your access token to download the Llama model for fine-tuning.

### 1. Setup and Dependencies
Restart the kernel after executing below cells

In [1]:
%pip install -r ./scripts/requirements.txt --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.2 requires nvidia-ml-py3==7.352.0, which is not installed.
jupyter-ai 2.30.0 requires faiss-cpu!=1.8.0.post0,<2.0.0,>=1.8.0, which is not installed.
autogluon-common 1.2 requires psutil<7.0.0,>=5.7.3, but you have psutil 7.0.0 which is incompatible.
autogluon-core 1.2 requires scikit-learn<1.5.3,>=1.4.0, but you have scikit-learn 1.6.1 which is incompatible.
autogluon-features 1.2 requires scikit-learn<1.5.3,>=1.4.0, but you have scikit-learn 1.6.1 which is incompatible.
autogluon-multimodal 1.2 requires jsonschema<4.22,>=4.18, but you have jsonschema 4.23.0 which is incompatible.
autogluon-multimodal 1.2 requires nltk<3.9,>=3.4.5, but you have nltk 3.9.1 which is incompatible.
autogluon-multimodal 1.2 requires omegaconf<2.3.0,>=2.1.1, but you have omegaconf 2.3.0 which is incompatible.

In [2]:
from IPython import get_ipython
get_ipython().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

**Importing Libraries and Setting Up Environment**

This part imports all necessary Python modules. It includes SageMaker-specific imports for pipeline creation and execution, as well as user-defined functions for the pipeline steps like finetune_llama3b_hf and preprocess_llama3.

In [1]:
import os
import sagemaker
from sagemaker.workflow.execution_variables import ExecutionVariables
from sagemaker.workflow.function_step import step
from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.fail_step import FailStep
from sagemaker.workflow.steps import CacheConfig

# Set path to config file
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


### 2. SageMaker Session and IAM Role

`get_execution_role()`: Retrieves the IAM role that SageMaker will use to access AWS resources. This role needs appropriate permissions for tasks like accessing S3 buckets and creating SageMaker resources.

In [2]:
sagemaker_session = sagemaker.session.Session()
role = sagemaker.get_execution_role()
instance_type = "ml.m5.xlarge"
processing_instance_type = "ml.m5.xlarge"
training_instance_type = "ml.m5.xlarge"

sagemaker.config INFO - Fetched defaults config from location: /home/sagemaker-user/generative-ai-on-amazon-sagemaker/workshops/fine-tuning-with-sagemakerai-and-bedrock/task_05_fmops


In [3]:
bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix
if default_prefix:
    input_path = f'{default_prefix}/datasets/llm-fine-tuning-modeltrainer-sft'
else:
    input_path = f'datasets/llm-fine-tuning-modeltrainer-sft'

train_data_path = f"s3://{bucket_name}/{input_path}/train/dataset.json"
test_dataset_path = f"s3://{bucket_name}/{input_path}/test/dataset.json"

pipeline_name = "deepseek-finetune-pipeline"
    
tracking_server_arn = "arn:aws:sagemaker:us-east-1:905418257479:mlflow-tracking-server/genai-mlflow-tracker"
experiment_name = "deepseek-finetune-pipeline"
os.environ["mlflow_uri"] = ""
os.environ["mlflow_experiment_name"] = "deepseek-finetune-pipeline"

model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
model_id_filesafe = model_id.replace("/","_")
model_s3_destination="s3://sagemaker-us-east-1-891377369387/models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B"
use_local_model = True #set to false for the training job to download from HF, otherwise True will download locally

### 3. Configuration

**Training Configuration**

The train_config dictionary is comprehensive, including:

Experiment naming for tracking purposes
Model specifications (ID, version, name)
Infrastructure details (instance types and counts for fine-tuning and deployment)
Training hyperparameters (epochs, batch size)

This configuration allows for easy adjustment of the training process without changing the core pipeline code.

In [4]:
from huggingface_hub import snapshot_download
from sagemaker.s3 import S3Uploader
import os
import subprocess


model_local_location = f"../models/{model_id_filesafe}"
# print("Downloading model ", model_id)
# os.makedirs(model_local_location, exist_ok=True)
# snapshot_download(repo_id=model_id, local_dir=model_local_location)
# print(f"Model {model_id} downloaded under {model_local_location}")

# if default_prefix:
#     model_s3_destination = f"s3://{bucket_name}/{default_prefix}/models/{model_id_filesafe}"
# else:
#     model_s3_destination = f"s3://{bucket_name}/models/{model_id_filesafe}"

# print(f"Beginning Model Upload...")

# subprocess.run(['aws', 's3', 'cp', model_local_location, model_s3_destination, '--recursive', '--exclude', '.cache/*', '--exclude', '.gitattributes'])

# print(f"Model Uploaded to: \n {model_s3_destination}")

# os.environ["model_location"] = model_s3_destination

print(model_s3_destination)

Downloading model  deepseek-ai/DeepSeek-R1-Distill-Llama-8B


Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

model-00002-of-000002.safetensors:   0%|          | 0.00/7.39G [00:00<?, ?B/s]

model-00001-of-000002.safetensors:   0%|          | 0.00/8.67G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

benchmark.jpg:   0%|          | 0.00/777k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

Model deepseek-ai/DeepSeek-R1-Distill-Llama-8B downloaded under ../models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B
Beginning Model Upload...
upload: ../models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B/config.json to s3://sagemaker-us-east-1-891377369387/models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B/config.json
upload: ../models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B/README.md to s3://sagemaker-us-east-1-891377369387/models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B/README.md
upload: ../models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B/generation_config.json to s3://sagemaker-us-east-1-891377369387/models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B/generation_config.json
upload: ../models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B/LICENSE to s3://sagemaker-us-east-1-891377369387/models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B/LICENSE
upload: ../models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B/tokenizer_config.json to s3://sagemaker-us-east-1-891377369387/models/deepseek-ai_DeepSeek-R1-Distill-

**LoRA Parameters**

Low-Rank Adaptation (LoRA) is an efficient fine-tuning technique for large language models. The parameters here (lora_r, lora_alpha, lora_dropout) control the behavior of LoRA during fine-tuning, affecting the trade-off between model performance and computational efficiency.

### 4. MLflow Setup

MLflow integration is crucial for experiment tracking and management. **Update the ARN for the MLflow tracking server.**

mlflow_arn: The ARN for the MLflow tracking server. You can get this ARN from SageMaker Studio UI. This allows the pipeline to log metrics, parameters, and artifacts to a central location.

experiment_name: give appropriate name for experimentation

### 5. Dataset Configuration

For the purpose of fine tuning and evaluation we are going too use `HuggingFaceH4/no_robots` dataset

### 6. Pipeline Steps

This section defines the core components of the SageMaker pipeline.

**Preprocessing Step**

This step handles data preparation. We are going to prepare data for training and evaluation. We will log this data in MLflow

In [12]:
@step(
    name="DataPreprocessing",
    instance_type=processing_instance_type,
    display_name="Data Preprocessing",
    keep_alive_period_in_seconds=3600
)
def preprocess(
    input_path: str,
    experiment_name: str,
    run_id: str,
) -> tuple:
    import boto3
    import shutil
    import sagemaker
    from sagemaker.config import load_sagemaker_config
    
    sagemaker_session = sagemaker.Session()
    s3_client = boto3.client('s3')
    
    sagemaker_session = sagemaker.Session()
    bucket_name = sagemaker_session.default_bucket()
    default_prefix = sagemaker_session.default_bucket_prefix
    configs = load_sagemaker_config()
    
    
    from datasets import load_dataset
    import pandas as pd
    
    dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")
    
    df = pd.DataFrame(dataset['train'])
    df = df[:100]
    
    # df.head()
    
    from sklearn.model_selection import train_test_split
    
    train, test = train_test_split(df, test_size=0.1, random_state=42, shuffle=True)
    
    print("Number of train elements: ", len(train))
    print("Number of test elements: ", len(test))
    
    # custom instruct prompt start
    prompt_template = f"""
    <|begin_of_text|>
    <|start_header_id|>system<|end_header_id|>
    You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
    Below is an instruction that describes a task, paired with an input that provides further context. 
    Write a response that appropriately completes the request.
    Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    {{question}}<|eot_id|>
    <|start_header_id|>assistant<|end_header_id|>
    {{complex_cot}}
    
    {{answer}}
    <|eot_id|>
    """
    
    # template dataset to add prompt to each sample
    def template_dataset(sample):
        sample["text"] = prompt_template.format(question=sample["Question"],
                                                complex_cot=sample["Complex_CoT"],
                                                answer=sample["Response"])
        return sample
    
    from datasets import Dataset, DatasetDict
    from random import randint
    
    train_dataset = Dataset.from_pandas(train)
    test_dataset = Dataset.from_pandas(test)
    
    dataset = DatasetDict({"train": train_dataset, "test": test_dataset})
    
    train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features))
    
    print(train_dataset[randint(0, len(dataset))]["text"])
    
    test_dataset = dataset["test"].map(template_dataset, remove_columns=list(dataset["test"].features))
    
    # save train_dataset to s3 using our SageMaker session
    # if default_prefix:
    #     input_path = f'{default_prefix}/datasets/llm-fine-tuning-modeltrainer-sft'
    # else:
    #     input_path = f'datasets/llm-fine-tuning-modeltrainer-sft'
    if default_prefix:
        input_path = f'{default_prefix}/datasets/llm-fine-tuning-modeltrainer-sft'
    else:
        input_path = f'datasets/llm-fine-tuning-modeltrainer-sft'

    # Save datasets to s3
    # We will fine tune only with 20 records due to limited compute resource for the workshop
    train_dataset.to_json("./data/train/dataset.json", orient="records")
    test_dataset.to_json("./data/test/dataset.json", orient="records")
    train_data_path = f"s3://{bucket_name}/{input_path}/train/dataset.json"
    test_dataset_path = f"s3://{bucket_name}/{input_path}/test/dataset.json"
    s3_client.upload_file("./data/train/dataset.json", bucket_name, f"{input_path}/train/dataset.json")
    s3_client.upload_file("./data/test/dataset.json", bucket_name, f"{input_path}/test/dataset.json")

    print(train_data_path)
    print(test_dataset_path)

    shutil.rmtree("./data")

    return experiment_name, run_id, train_data_path, test_dataset_path

In [13]:
%%bash

cat > ./args.yaml <<EOF

# MLflow Config
mlflow_uri: "${mlflow_uri}"
mlflow_experiment_name: "${mlflow_experiment_name}"


model_id: "${model_location}"       # Hugging Face model id, or S3 location

# sagemaker specific parameters
output_dir: "/opt/ml/model"                       # path to where SageMaker will upload the model 
train_dataset_path: "/opt/ml/input/data/train/"   # path to where FSx saves train dataset
test_dataset_path: "/opt/ml/input/data/test/"     # path to where FSx saves test dataset
# training parameters
max_seq_length: 1500  #512 # 2048
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1                 
learning_rate: 2e-4                    # learning rate scheduler
num_train_epochs: 1                    # number of training epochs
per_device_train_batch_size: 1         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
gradient_checkpointing: true           # use gradient checkpointing
fp16: true
bf16: false                            # use bfloat16 precision, also enables FlashAttention2 (requires Ampere/Hopper GPU+ ex:A10, A100, H100)
tf32: false                            # use tf32 precision

#uncomment here for fsdp - start
# fsdp: "full_shard auto_wrap offload"
# fsdp_config: 
#     backward_prefetch: "backward_pre"
#     cpu_ram_efficient_loading: true
#     offload_params: true
#     forward_prefetch: false
#     use_orig_params: true
#uncomment here for fsdp - end

merge_weights: true                    # merge weights in the base model
EOF

In [14]:
from sagemaker.s3 import S3Uploader

if default_prefix:
    input_path = f"s3://{bucket_name}/{default_prefix}/training_config/{model_id_filesafe}"
else:
    input_path = f"s3://{bucket_name}/training_config/{model_id_filesafe}"

# upload the model yaml file to s3
model_yaml = "args.yaml"
train_config_s3_path = S3Uploader.upload(local_path=model_yaml, desired_s3_uri=f"{input_path}/config")

print(f"Training config uploaded to:")
print(train_config_s3_path)

Training config uploaded to:
s3://sagemaker-us-east-1-891377369387/training_config/deepseek-ai_DeepSeek-R1-Distill-Llama-8B/config/args.yaml


**Fine-tuning Step**

This is where the actual model adaptation occurs. The step takes the preprocessed data and applies it to fine-tune the base LLM (in this case, a Llama model). It incorporates the LoRA technique for efficient adaptation.

In [15]:
@step(
    name="ModelFineTuning",
    instance_type=training_instance_type,
    display_name="Model Fine Tuning",
    keep_alive_period_in_seconds=3600
)
def train(
    train_dataset_s3_path: str,
    test_dataset_s3_path: str,
    train_config_s3_path: str,
    experiment_name: str,
    model_id: str,
    run_id: str,
):
    import sagemaker
    import boto3
    sagemaker_client = boto3.client('sagemaker')
    job_name = "deepseek-finetune-pipeline"
    from sagemaker.pytorch import PyTorch
    sagemaker_session = sagemaker.Session()
    pytorch_estimator = PyTorch(
        entry_point='launch_fsdp_qlora.py',
        source_dir="./scripts",
        job_name=job_name,
        base_job_name=job_name,
        max_run=50000,
        role=role,
        framework_version="2.2.0",
        py_version="py310",
        instance_count=1,
        instance_type="ml.p3.2xlarge",
        sagemaker_session=sagemaker_session,
        volume_size=50,
        disable_output_compression=True,
        keep_alive_period_in_seconds=1800,
        distribution={"torch_distributed": {"enabled": True}},
        hyperparameters={
            "config": "/opt/ml/input/data/config/args.yaml"
        }
    )

    

    # define a data input dictonary with our uploaded s3 uris
    data = {
      'train': train_dataset_s3_path,
      'eval': test_dataset_s3_path,
      'config': train_config_s3_path
      }

    print(data)

    pytorch_estimator.fit(data, wait=True)

    return experiment_name, run_id

**Evaluation Step**

After fine-tuning, this step assesses the model's performance. It uses built-in evaluation function in MLflow to evaluate metrices like toxicity, exact_match etc:

It will then log the results in MLflow

In [None]:
@step(
    name="ModelEvaluation",
    instance_type=training_instance_type,
    display_name="Model Evaluation",
    keep_alive_period_in_seconds=3600
)
def evaluate(
    experiment_name: str,
    model_id: str,
    run_id: str,
    model_s3_path: str,
):

### 7. Pipeline Creation and Execution

This final section brings all the components together into an executable pipeline.

**Creating the Pipeline**

The pipeline object is created with all defined steps. The lora_config is passed as a parameter, allowing for easy modification of LoRA settings between runs.

In [19]:
preprocessing_step = preprocess(
    experiment_name=experiment_name,
    run_id=ExecutionVariables.PIPELINE_EXECUTION_ID,
    input_path=input_path,
)

training_step = train(
    train_dataset_s3_path=preprocessing_step[2],
    test_dataset_s3_path=preprocessing_step[3],
    train_config_s3_path=train_config_s3_path,
    experiment_name=preprocessing_step[0],
    run_id=preprocessing_step[1],
    model_id=model_s3_destination,
)

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        instance_type,
    ],
    steps=[preprocessing_step, training_step],
)

**Upserting the Pipeline**

This step either creates a new pipeline in SageMaker or updates an existing one with the same name. It's a key part of the MLOps process, allowing for iterative refinement of the pipeline.

In [17]:
pipeline.upsert(role)

sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.ImageUri
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.IncludeLocalWorkDir
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.CustomFileFilter.IgnoreNamePatterns


2025-05-21 06:57:34,116 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-891377369387/deepseek-finetune-pipeline/DataPreprocessing/2025-05-21-06-57-30-811/function
2025-05-21 06:57:34,211 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-east-1-891377369387/deepseek-finetune-pipeline/DataPreprocessing/2025-05-21-06-57-30-811/arguments
2025-05-21 06:57:34,443 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmpzoewtjvs/requirements.txt'
2025-05-21 06:57:34,474 sagemaker.remote_function INFO     Successfully uploaded dependencies and pre execution scripts to 's3://sagemaker-us-east-1-891377369387/deepseek-finetune-pipeline/DataPreprocessing/2025-05-21-06-57-30-811/pre_exec_script_and_dependencies'
2025-05-21 06:57:34,492 sagemaker.remote_function INFO     Copied user workspace to '/tmp/tmp2g46b2f6/temp_workspace/sagemaker_remote_function_workspace'
20

sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.ImageUri
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.IncludeLocalWorkDir
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.CustomFileFilter.IgnoreNamePatterns


2025-05-21 06:57:37,489 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-891377369387/deepseek-finetune-pipeline/ModelFineTuning/2025-05-21-06-57-30-811/function
2025-05-21 06:57:37,594 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-east-1-891377369387/deepseek-finetune-pipeline/ModelFineTuning/2025-05-21-06-57-30-811/arguments
2025-05-21 06:57:37,698 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmpqgj0nkzu/requirements.txt'
2025-05-21 06:57:37,731 sagemaker.remote_function INFO     Successfully uploaded dependencies and pre execution scripts to 's3://sagemaker-us-east-1-891377369387/deepseek-finetune-pipeline/ModelFineTuning/2025-05-21-06-57-30-811/pre_exec_script_and_dependencies'
2025-05-21 06:57:38,343 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-891377369387/deepseek-finetune-pipeline/

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:891377369387:pipeline/deepseek-finetune-pipeline',
 'ResponseMetadata': {'RequestId': '51b2b2b0-fc3b-4c9c-84b2-2e82d10af4c8',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '51b2b2b0-fc3b-4c9c-84b2-2e82d10af4c8',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '94',
   'date': 'Wed, 21 May 2025 06:57:39 GMT'},
  'RetryAttempts': 0}}

**Starting the Pipeline Execution**

This command kicks off the actual execution of the pipeline in SageMaker. From this point, SageMaker will orchestrate the execution of each step, managing resources and data flow between steps.

In [18]:
execution1 = pipeline.start()

# Clean up