<center>
  <a href="https://escience.sdu.dk/index.php/ucloud/">
    <img src="https://escience.sdu.dk/wp-content/uploads/2020/03/logo_esc.svg" width="400" height="186" />
  </a>
</center>
<br>
<p style="font-size: 1.2em;">
  This notebook was tested using <strong>NeMo Framework v25.02.01</strong> and machine type <code>u3-gpu4</code> on UCloud.
</p>


# 04 - Llama 3.1 Fine-tuning: Training on a Synthetic Medical Q&A Dataset

## Introduction

In this tutorial, our goal is to **fine-tune the Llama 3.1 Instruct model** for medical Q&A tasks.

We will:
- **Use a synthetic medical Q&A dataset** generated in [`03-ls-create-medal-qa-dataset.ipynb`](03-ls-create-medal-qa-dataset.ipynb).

- **Start from a pre-converted NeMo checkpoint** of the Llama 3.1 Instruct model, prepared in [`01-nemo-medchat-qa-peft.ipynb`](01-nemo-medchat-qa-peft.ipynb)

- **Apply LoRA (Low-Rank Adaptation)** for parameter-efficient fine-tuning, also following the setup from [`01-nemo-medchat-qa-peft.ipynb`](01-nemo-medchat-qa-peft.ipynb).

- **Demonstrate the full workflow**, including loading the dataset and model, configuring the training, and evaluating the fine-tuned model's Q&A performance.

**Environment Requirements:**
- UCloud [NVIDIA NeMo Framework](https://docs.cloud.sdu.dk/Apps/nemo.html) app, Version `v25.02.01`
- GPU-enabled session (e.g., A100, V100)

> 🛠️ **Important Environment Note:**
> This notebook is designed to run on **UCloud**, using the **NVIDIA NeMo Framewwork app, version `v25.02.01`**.
> If you encounter unexpected errors, **double-check you are using the correct app version**, and that your session includes **GPU resources**.

## 🛠️ **Step 1: Environment Check**

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.cuda.get_device_name()
    print(f"✅ GPU detected: {device}")
else:
    raise RuntimeError("❌ No GPU detected! Ensure your UCloud session uses a GPU node.")

## 🛠️ **Step 2: Prepare the Dataset**

### Load the dataset

In [None]:
import json

# Read all data into a list of dictionaries
dataset = []
with open("datasets/medal-qa_synthetic_dataset_v1.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        dataset.append(json.loads(line))

# Preview the first 3 examples
for idx, item in enumerate(dataset[:3]):
    print(f"Example {idx+1}:\n{item}\n")

In [None]:
dataset[10]

### Perform train/validation/test split

In [None]:
dataset = load_dataset('json', data_files='datasets/medal-qa_synthetic_dataset_v1.jsonl', split='train')
dataset = dataset.shuffle(seed=42)

# Only keep "question" and "answer" fields
dataset = dataset.remove_columns([col for col in dataset.column_names if col not in ["question", "answer"]])

# Step 1: Split into 80% train and 20% temp (temp will be further split into validation and test)
train_temp = dataset.train_test_split(test_size=0.2)

# Step 2: Split temp into 50% validation and 50% test (each gets 10% of total data)
val_test = train_temp["test"].train_test_split(test_size=0.5)

# Final splits
train_dataset = train_temp["train"]
validation_dataset = val_test["train"]
test_dataset = val_test["test"]

# Print dataset sizes
print(f"Train size: {len(train_dataset)}, Validation size: {len(validation_dataset)}, Test size: {len(test_dataset)}")

### Convert to NeMo JSONL format

We leverage the `save_jsonl_preprocessed` utility function to prefix each example's `input` with a detailed instruction.
This instruction sets the model's role and response style before presenting the actual question, improving task understanding and guiding generation.

Each JSONL record will follow NeMo's expected `{input} {output}` prompt format.

In [None]:
import re, json
from typing import List, Literal, Optional

from datasets import DatasetDict, concatenate_datasets, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError

# Customized conversion with instruction prefix
def save_jsonl_preprocessed(
    dataset,
    filename,
    instruction=(
        "You are a board-certified medical professional and "
        "a skilled communicator. Provide accurate, evidence‑based answers "
        "to medical questions in clear, concise language, suitable for both "
        "healthcare providers and patients."
    ),
):
    """
    Writes a JSONL file where each line is:
      {"input": "<instruction> Question: …\n\n### Response:\n", "output": "…"}
    with *actual* newlines and one JSON record per line.
    """
    with open(filename, "w", encoding="utf-8") as f:
        for example in dataset:
            q = example["question"]
            a = example["answer"]
            # Use real newlines (\n), not literal backslashes
            inp = f"{instruction} Question: {q}\n\n### Response:\n"
            json.dump({"input": inp, "output": a}, f, ensure_ascii=False)
            f.write("\n") 

# after your splits:
save_jsonl_preprocessed(train_dataset,      "datasets/medal_train.jsonl")
save_jsonl_preprocessed(validation_dataset, "datasets/medal_validation.jsonl")
save_jsonl_preprocessed(test_dataset,       "datasets/medal_test.jsonl")

In [None]:
!head -n3 datasets/medal_train.jsonl | jq .

In [None]:
!head -n3 datasets/medal_validation.jsonl | jq .

In [None]:
!head -n3 datasets/medal_test.jsonl | jq .

## 🛠️ **Step 3: LoRA Fine-Tuning**

In this step, we’ll set up and launch the LoRA fine-tuning run using NeMo’s high-level model fine-tuning script. We only need to specify a few essential parameters—dataset paths, PEFT scheme, optimizer settings, parallelism degrees, and batch sizes—to get started.

In [None]:
from IPython.display import display
from ipywidgets import Password
from huggingface_hub import snapshot_download

pwd = Password(description="Hugging Face Token:")
display(pwd)

In [None]:
token = pwd.value

In [None]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

# Set paths to the model, train, validation and test sets.
PRECISION=bf16
MODEL="models/llama-3.1-instruct/8B/nemo/$PRECISION/Llama-3_1-Instruct-8B.nemo"

OUTPUT_DIR="lora/llama-3.1-instruct-medal/8B/$PRECISION"
rm -rf "$OUTPUT_DIR"

TRAIN_DS="['datasets/medal_train.jsonl']"
VALID_DS="['datasets/medal_validation.jsonl']"

SCHEME="lora"
GPUS=4 # Adjust if necessary
TP_SIZE=1
PP_SIZE=1

# Monitor training using WandB
export WANDB_API_KEY="" # Use your WandB API key
WANDB_LOGGER=False # Set equal to True to instantiate a WandB logger
WANDB_PROJECT="Medal-QA"

export PYTHONWARNINGS="ignore"

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${OUTPUT_DIR} \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    trainer.precision=${PRECISION} \
    trainer.val_check_interval=213 \
    trainer.max_steps=2000 \
    exp_manager.early_stopping_callback_params.patience=3 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    ++model.dist_ckpt_load_strictness=log_all \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.global_batch_size=64 \
    model.micro_batch_size=16 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME} \
    model.optim.name=fused_adam \
    model.optim.lr=5e-6 \
    exp_manager.create_wandb_logger=${WANDB_LOGGER} \
    exp_manager.wandb_logger_kwargs.project=${WANDB_PROJECT} \
    exp_manager.resume_if_exists=True \
    exp_manager.create_checkpoint_callback=True \
    exp_manager.checkpoint_callback_params.monitor=validation_loss \
    exp_manager.resume_ignore_no_checkpoint=True

This will create a LoRA adapter - a file named `megatron_gpt_peft_lora_tuning.nemo` in `./lora/llama-3.1-instruct-medal/.../checkpoints/`.

## 🛠️ **Step 4: Model Evaluation**

After fine-tuning and merging LoRA adapters, we evaluate the model by generating answers on the test set using NeMo's high-level model evaluation script:
[megatron_gpt_generate.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py).

We'll compute two metrics:
- **Exact Match (EM)**: whether the prediction exactly matches the label.
- **Token-level F1**: overlap between prediction and label tokens.

In [None]:
%%bash
# Check that the LORA model file exists

python -c "import torch; torch.cuda.empty_cache()"

PRECISION=bf16
OUTPUT_DIR="lora/llama-3.1-instruct-medal/8B/$PRECISION"
ls -l $OUTPUT_DIR/checkpoints

In the code snippet below, the following configurations are worth noting: 

1. `model.restore_from_path` to the path for the `Llama-3_1-Instruct-8B.nemo` file.
2. `model.peft.restore_from_path` to the path for the PEFT checkpoint that was created in the fine-tuning run in the last step.
3. `model.test_ds.file_names` to the path of the `medal_test.jsonl` file.

If you have made any changes in model or experiment paths, please ensure they are configured correctly below.

In [None]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

PRECISION=bf16
MODEL="models/llama-3.1-instruct/8B/nemo/$PRECISION/Llama-3_1-Instruct-8B.nemo"
OUTPUT_DIR="lora/llama-3.1-instruct-medal/8B/$PRECISION"
TEST_DS="[datasets/medal_test.jsonl]"
TEST_NAMES="[medal]"
SCHEME="lora"
GPUS=4 # Adjust if necessary
TP_SIZE=1
PP_SIZE=1

# This is where your LoRA checkpoint was saved
PATH_TO_TRAINED_MODEL="$OUTPUT_DIR/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

# The generation run will save the generated outputs over the test dataset in a file prefixed like so
OUTPUT_PREFIX="results/medalQA_result_lora_tuning_"

export PYTHONWARNINGS="ignore"
export TOKENIZERS_PARALLELISM=true

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${MODEL} \
    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    trainer.precision=${PRECISION} \
    model.megatron_amp_O2=True \
    model.global_batch_size=64 \
    model.micro_batch_size=16 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=64 \
    model.data.test_ds.tokens_to_generate=128 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \
    model.data.test_ds.write_predictions_to_file=True

In [None]:
!head -n 10 results/medchatQA_result_lora_tuning__test_medal_inputs_preds_labels.jsonl | jq .

In [None]:
import json

def compute_f1(pred: str, ref: str) -> float:
    pred_tokens = pred.lower().split()
    ref_tokens = ref.lower().split()
    common = set(pred_tokens) & set(ref_tokens)
    num_common = sum(min(pred_tokens.count(tok), ref_tokens.count(tok)) for tok in common)
    if num_common == 0:
        return 0.0
    precision = num_common / len(pred_tokens)
    recall = num_common / len(ref_tokens)
    return 2 * (precision * recall) / (precision + recall)

# Evaluate results from combined JSONL
file_path = 'results/medchatQA_result_lora_tuning__test_medchat_inputs_preds_labels.jsonl'

exacts = []
f1s = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        obj = json.loads(line)
        pred = obj.get('pred', '').strip()
        label = obj.get('label', '').strip()
        exacts.append(float(pred == label))
        f1s.append(compute_f1(pred, label))

# Aggregate and display
avg_em = sum(exacts) / len(exacts)
avg_f1 = sum(f1s) / len(f1s)
print(f"Average Exact Match (EM): {avg_em:.3f}")
print(f"Average F1 Score: {avg_f1:.3f}")

In [None]:
%%bash -s "$token"

# OPTIONAL: Assess performance on the original model

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

PRECISION=bf16
MODEL="models/llama-3.1-instruct/8B/nemo/$PRECISION/Llama-3_1-Instruct-8B.nemo"
TEST_DS="[datasets/medal_test.jsonl]"
TEST_NAMES="[medal]"
SCHEME="lora"
GPUS=4 # Adjust if necessary
TP_SIZE=1
PP_SIZE=1

# The generation run will save the generated outputs over the test dataset in a file prefixed like so
OUTPUT_PREFIX="results/medalQA_result_no_tuning_"

export PYTHONWARNINGS="ignore"
export TOKENIZERS_PARALLELISM=true

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${MODEL} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    trainer.precision=${PRECISION} \
    model.megatron_amp_O2=True \
    model.global_batch_size=64 \
    model.micro_batch_size=16 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=64 \
    model.data.test_ds.tokens_to_generate=128 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \
    model.data.test_ds.write_predictions_to_file=True \
    model.data.test_ds.label_key='output' \
    model.data.test_ds.add_eos=True \
    model.data.test_ds.add_sep=False \
    model.data.test_ds.add_bos=False \
    model.data.test_ds.truncation_field="input" \
    model.data.test_ds.prompt_template="\{input\} \{output\}" \

In [None]:
!head -n 10 results/medalQA_result_no_tuning__test_medchat_inputs_preds_labels.jsonl | jq .

In [None]:
import json

def compute_f1(pred: str, ref: str) -> float:
    pred_tokens = pred.lower().split()
    ref_tokens = ref.lower().split()
    common = set(pred_tokens) & set(ref_tokens)
    num_common = sum(min(pred_tokens.count(tok), ref_tokens.count(tok)) for tok in common)
    if num_common == 0:
        return 0.0
    precision = num_common / len(pred_tokens)
    recall = num_common / len(ref_tokens)
    return 2 * (precision * recall) / (precision + recall)

# Evaluate results from combined JSONL
file_path = 'results/medalQA_result_no_tuning__test_medchat_inputs_preds_labels.jsonl'

exacts = []
f1s = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        obj = json.loads(line)
        pred = obj.get('pred', '').strip()
        label = obj.get('label', '').strip()
        exacts.append(float(pred == label))
        f1s.append(compute_f1(pred, label))

# Aggregate and display
avg_em = sum(exacts) / len(exacts)
avg_f1 = sum(f1s) / len(f1s)
print(f"Average Exact Match (EM): {avg_em:.3f}")
print(f"Average F1 Score: {avg_f1:.3f}")