<center>
  <a href="https://escience.sdu.dk/index.php/ucloud/">
    <img src="https://escience.sdu.dk/wp-content/uploads/2020/03/logo_esc.svg" width="400" height="186" />
  </a>
</center>
<br>
<p style="font-size: 1.2em;">
  This notebook was tested using <strong>NeMo Framework v25.02.01</strong> and machine type <code>u3-gpu4</code> on UCloud.
</p>


# 01 - Llama 3.1 Fine-tuning: Building a Medical Q&A Model

## Introduction
In this hands-on session, we'll learn **how to fine-tune a large language model** using **Parameter-Efficient Fine-Tuning (PEFT)** techniques inside the **NVIDIA NeMo Framework**, running on the **UCloud** - Interactive HPC platform.

**Environment Requirements:**
- UCloud [NVIDIA NeMo Framework](https://docs.cloud.sdu.dk/Apps/nemo.html) app, Version `v25.02.01`
- GPU-enabled session (e.g., A100, V100)

> 🛠️ **Important Environment Note:**
> This notebook is designed to run on **UCloud**, using the **NVIDIA NeMo Framewwork app, version `v25.02.01`**.
> If you encounter unexpected errors, **double-check you are using the correct app version**, and that your session includes **GPU resources**.

**By the end of this notebook, you will be able to:**
1. Understand PEFT and why it matters.
2. Load a pre-trained open-source model.
3. Apply LoRA (Low-Rank Adaptation) fine-tuning.
4. Evaluate improvements.
5. Save and reuse fine-tuned adapters.

📚 **What is PEFT and Why Should We Care?**

Large language models (like Llama, GPT) are **very expensive to fully fine-tune**:
- Billions of parameters.
- Gigabytes of optimizer states.
- Massive GPU memory needs.

**PEFT** solves this by **freezing most model weights** and only training **small additional parameters**.

🔥 **Key Advantages:**
- **Faster training** (minutes to hours, not days).
- **Lower memory** usage (fits on a single A100 or even smaller GPUs).
- **Adaptable**: train different adapters for different tasks cheaply.

🧩 **Popular PEFT Techniques:**
- **LoRA**: Train low-rank matrices injected into the model layers.
- **QLoRA**: Quantize + LoRA = even smaller memory footprint.
- **Prefix-Tuning / P-Tuning**: Learn a small prefix instead of full weights.

**This tutorial focuses on LoRA**, the most widely used PEFT method.

## 🛠️ **Step 1: Environment Check**

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.cuda.get_device_name()
    print(f"✅ GPU detected: {device}")
else:
    raise RuntimeError("❌ No GPU detected! Ensure your UCloud session uses a GPU node.")

## 🛠️ **Step 2: Download and Convert Pre-trained Model from Hugging Face**

We use Hugging Face's `transformers` library and NeMo conversion utilities to fetch an open-source checkpoint and convert it into NeMo `.nemo` format.

**Why convert?**
- NeMo's inference and training pipelines expect models in `.nemo` format, which bundles both the model weights and configuration in a single file.
- Ensures compatibility with NeMo's `restore_from` and `save_to` methods for seamless loading.

**Hugging Face repo:**  [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

In [None]:
from IPython.display import display
from ipywidgets import Password
from huggingface_hub import snapshot_download

pwd = Password(description="Hugging Face Token:")
display(pwd)

In [None]:
token = pwd.value
hf_model="meta-llama/Llama-3.1-8B-Instruct"
hf_model_path="models/llama-3.1-instruct/8B/hf"
snapshot_download(
    repo_id=hf_model,
    local_dir=hf_model_path,
    token=token
)

In [None]:
%%bash -s "$hf_model_path"

ls $1
du -sh $1

In [None]:
%%bash

# Convert the Model in NeMo Format
HF_MODEL="models/llama-3.1-instruct/8B/hf"
PRECISION=bf16
NEMO_MODEL="models/llama-3.1-instruct/8B/nemo/$PRECISION/Llama-3_1-Instruct-8B.nemo"

export TOKENIZERS_PARALLELISM=true
export NUMEXPR_MAX_THREADS=$(nproc)

# Convert model to .nemo 
python3 -W ignore /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
        --input_name_or_path "$HF_MODEL" \
        --output_path "$NEMO_MODEL" \
        --precision "$PRECISION"

## 🛠️ **Step 3: Prepare the Dataset**

In this tutorial we fine-tune on the [**MedChat-QA** dataset](https://huggingface.co/datasets/ngram/medchat-qa), a Medical Question Answering corpus.
It is available on Hugging Face under the name `medchat-qa`. 

This dataset consists of approximately 30000 questions, covering about 1000 FDA approved human prescription drugs.

We download the full **MedChat-QA** dataset and **split** it into **train**, **validation**, and **test** sets in code.
This ensures we have:
- **Train set:** to fit model adapters.
- **Validation set:** to tune hyperparameters and monitor for overfitting.
- **Test set:** to assess final model performance on unseen data.

Proper splitting prevents information leakage and ensures unbiased evaluation.

### Download the dataset from Hugging Face

We use the `datasets` library to fetch the train and validation splits directly:

In [None]:
from datasets import load_dataset

import warnings
warnings.filterwarnings("ignore")

# Load full dataset (no predefined splits)
dataset = load_dataset("ngram/medchat-qa", split = "train", cache_dir="datasets")
print(f"✅ Downloaded MedChat-QA with {len(dataset)} examples")

In [None]:
dataset

### Perform train/validation/test split

In [None]:
dataset = load_dataset("ngram/medchat-qa", split = "train", cache_dir="datasets")
dataset = dataset.shuffle(seed=42)

# Step 1: Split into 90% train and 10% temp (temp will be further split into validation and test)
train_temp = dataset.train_test_split(test_size=0.1)

# Step 2: Split temp into 50% validation and 50% test (each gets 5% of total data)
val_test = train_temp["test"].train_test_split(test_size=0.5)

# Final splits
train_dataset = train_temp["train"]
validation_dataset = val_test["train"]
test_dataset = val_test["test"]

# Print dataset sizes
print(f"Train size: {len(train_dataset)}, Validation size: {len(validation_dataset)}, Test size: {len(test_dataset)}")

### Convert to NeMo JSONL format

We leverage the `save_jsonl_preprocessed` utility function to prefix each example's `input` with a detailed instruction.
This instruction sets the model's role and response style before presenting the actual question, improving task understanding and guiding generation.

Each JSONL record will follow NeMo's expected `{input} {output}` prompt format.

In [None]:
import re, json
from typing import List, Literal, Optional

from datasets import DatasetDict, concatenate_datasets, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError

# Customized conversion with instruction prefix
def save_jsonl_preprocessed(
    dataset,
    filename,
    instruction=(
        "You are a board-certified medical professional and "
        "a skilled communicator. Provide accurate, evidence‑based answers "
        "to medical questions in clear, concise language, suitable for both "
        "healthcare providers and patients."
    ),
):
    """
    Writes a JSONL file where each line is:
      {"input": "<instruction> Question: …\n\n### Response:\n", "output": "…"}
    with *actual* newlines and one JSON record per line.
    """
    with open(filename, "w", encoding="utf-8") as f:
        for example in dataset:
            q = example["question"]
            a = example["answer"]
            # Use real newlines (\n), not literal backslashes
            inp = f"{instruction} Question: {q}\n\n### Response:\n"
            json.dump({"input": inp, "output": a}, f, ensure_ascii=False)
            f.write("\n") 

# after your splits:
save_jsonl_preprocessed(train_dataset,      "datasets/medchat_train.jsonl")
save_jsonl_preprocessed(validation_dataset, "datasets/medchat_validation.jsonl")
save_jsonl_preprocessed(test_dataset,       "datasets/medchat_test.jsonl")

In [None]:
!head -n3 datasets/medchat_train.jsonl | jq .

In [None]:
!head -n3 datasets/medchat_validation.jsonl | jq .

In [None]:
!head -n3 datasets/medchat_test.jsonl | jq .

## 🛠️ **Step 4: LoRA Fine-Tuning**

In this step, we’ll set up and launch the LoRA fine-tuning run using NeMo’s high-level model fine-tuning script. We only need to specify a few essential parameters—dataset paths, PEFT scheme, optimizer settings, parallelism degrees, and batch sizes—to get started.


### 🤔 Understanding Fine-Tuning Objectives for LLMs

Fine-tuning a large language model (LLM) like Llama 3.1 does not mean memorizing every possible answer verbatim—it means:

1. **Adapting to domain-specific language and style**: The model learns terminology, phrasing, and response formats relevant to medical QA.
2. **Improving factual consistency**: By training on medical question–answer pairs, the model reinforces evidence-based associations.
3. **Enhancing reasoning patterns**: Exposure to step-by-step medical explanations helps the model generalize reasoning to new questions.

Since test questions may never have appeared in training, we do not expect exact matches. Instead, we measure:
- **Content relevance**: Does the generated answer address the question accurately?
- **Factual correctness**: Are medical facts presented correctly, even if worded differently?
- **Clarity and completeness**: Is the response concise yet informative, following the instruction prompt?

In summary, fine-tuning refines the LLM’s ability to **generalize** medical Q&A skills to unseen queries, not to retrieve memorized answers.

### ⚙️ Data Parallelism & Gradient Accumulation

In large-scale training, we often want to process more samples than a single GPU can handle at once. Two techniques help:

1. **Data Parallelism:** Each GPU holds a copy of the model and processes different micro batches in parallel. After each backward pass, gradients are summed across GPUs.
2. **Gradient Accumulation:** Instead of updating weights after every micro batch, we accumulate gradients over multiple micro batches to simulate a larger **global batch** while keeping memory usage constant.

#### Visual Illustration of Accumulation

```text
┌─ micro‑batch #1 ─┬─ micro‑batch #2 ─┬─ micro‑batch #3 ─┬─ micro‑batch #4 ─┐
│                  │                  │                  │                  │
│   forward pass   │   forward pass   │   forward pass   │   forward pass   │
│     (loss₁)      │     (loss₂)      │     (loss₃)      │     (loss₄)      │
│        ↓         │        ↓         │        ↓         │        ↓         │
│  backward pass   │  backward pass   │  backward pass   │  backward pass   │
│ (∂l₁/∂θ)  +=     │ (∂l₂/∂θ)  +=     │ (∂l₃/∂θ)  +=     │ (∂l₄/∂θ)  =      │
│ accumulate grads │ accumulate grads │ accumulate grads │ accumulate grads │
└──────────────────┴──────────────────┴──────────────────┴──────────────────┘
                                            │
                        ┌───────────────────┴──────────────────┐
                        │            optimizer step            │
                        │      θ ← θ − lr · Σ grads / 64       │
                        └──────────────────────────────────────┘
```

- **Micro batch:** Number of samples processed per GPU per forward/backward pass (e.g., 16).
- **Gradient accumulation steps:** Number of consecutive micro batches (forward/backward passes) each GPU processes—accumulating (i.e. summing) their gradients—before performing a single optimizer update. 
- **Global batch:** total samples whose gradients contribute to one weight update across all GPUs; equals: `micro-batch size × gradient accumulation steps × number of GPUs`.

#### Example of Gradient Accumulation

```text
# Settings:
micro_batch_size       = 16    # samples per GPU per forward/backward pass
gradient_accumulation  = 1     # micro‑batches per weight update on each GPU
num_GPUs               = 4
global_batch_size      = micro_batch_size × gradient_accumulation × num_GPUs
                       = 16 × 1 × 4
                       = 64    # samples per weight update

# Then each global_step processes one global batch (64 samples):
global_step = 1   → processed 1 × 64   =   64 samples → 1st weight update  
global_step = 2   → processed 2 × 64   =  128 samples → 2nd weight update  
…  
global_step = 426 → processed 426 × 64 = 27264 samples → 426th weight update (≈1 epoch)
```

#### Detailed Workflow 
| Stage                 | Description                                                                                                                                                                                   | Purpose                                                                                                  |
|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
| Forward pass          | The network runs on input data, produces logits, and computes a scalar loss                                                                                                                   | Needed to know how wrong the current weights are                                                           |
| Backward pass         | PyTorch autograd walks the graph in reverse, computing gradients (∂loss / ∂θ) for every parameter θ                                                                                            | Gives the direction to adjust each weight                                                                  |
| Gradient accumulation | Instead of calling `optimizer.step()` immediately, we add these gradients to a running buffer                                                                                                 | Lets us mimic a larger batch without fitting all samples in memory                                         |
| Optimizer step        | After we have accumulated gradients from enough micro‑batches to equal the global batch size, we update the weights once (SGD, Adam, etc.), then zero the grad buffers                          | This is the true training step seen by the learning‑rate scheduler and appears in training logs/metrics    |

This approach provides flexibility:
- **Larger effective batch sizes** for stable training and better convergence.
- **Memory efficiency** by keeping per-step memory constant.
- **Scalability** across multiple GPUs with straightforward gradient synchronization.
 264 samples → 426th weight update (≈1 epoch)

### 🚀 Launching the LoRA Fine-Tuning Script

NeMo framework includes a high level Python script for fine-tuning ([megatron_gpt_finetuning.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py)) that can abstract away some of the lower level API calls. Once you have your model downloaded and the dataset ready, LoRA fine-tuning with NeMo is essentially just running this script!

Some of the relevant settings are:

#### Training dataset JSONL file(s)
```bash
model.data.train_ds.file_names='datasets/medchat_train.jsonl'
```
#### Validation dataset JSONL file(s)
```bash
model.data.validation_ds.file_names='datasets/medchat_validation.jsonl'
```
#### PEFT method: LoRA scheme
```bash
model.peft.peft_scheme=lora
```
#### O2-level automatic mixed precision
```bash
model.megatron_amp_O2=True
```
#### Optimizer and learning rate configuration
```bash
model.optim.name=fused_adam
model.optim.lr=5e-6
```
#### Tensor model parallelism across model layers
```bash
model.tensor_model_parallel_size=1
```
#### Pipeline model parallelism across model stages
```bash
model.pipeline_model_parallel_size=1  
```
#### Effective batch size across all GPUs and gradient accumulation steps
```bash
model.global_batch_size=64 
```
#### Number of samples per GPU per forward/backward pass
```bash
model.micro_batch_size=16
```
For this demonstration, this training run is capped by `max_steps`, and validation is carried out every `val_check_interval` steps. If the validation loss does not improve after a few checks, training is halted to avoid overfitting.

> `NOTE:` In the block of code below, pass the paths to your train and validation data files as well as path to the `.nemo` model.

In [None]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

# Set paths to the model, train, validation and test sets.
PRECISION=bf16
MODEL="models/llama-3.1-instruct/8B/nemo/$PRECISION/Llama-3_1-Instruct-8B.nemo"

OUTPUT_DIR="lora/llama-3.1-instruct-medchat/8B/$PRECISION"
rm -rf "$OUTPUT_DIR"

TRAIN_DS="['datasets/medchat_train.jsonl']"
VALID_DS="['datasets/medchat_validation.jsonl']"

SCHEME="lora"
GPUS=4 # Adjust if necessary
TP_SIZE=1
PP_SIZE=1

# Monitor training using WandB
export WANDB_API_KEY=""  # Use your WandB API key
WANDB_LOGGER=False # Set equal to True to instantiate a WandB logger
WANDB_PROJECT="MedChat-QA"

export PYTHONWARNINGS="ignore"

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${OUTPUT_DIR} \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    trainer.precision=${PRECISION} \
    trainer.val_check_interval=213 \
    trainer.max_steps=2000 \
    exp_manager.early_stopping_callback_params.patience=3 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    ++model.dist_ckpt_load_strictness=log_all \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.global_batch_size=64 \
    model.micro_batch_size=16 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME} \
    model.optim.name=fused_adam \
    model.optim.lr=5e-6 \
    exp_manager.create_wandb_logger=${WANDB_LOGGER} \
    exp_manager.wandb_logger_kwargs.project=${WANDB_PROJECT} \
    exp_manager.resume_if_exists=True \
    exp_manager.create_checkpoint_callback=True \
    exp_manager.checkpoint_callback_params.monitor=validation_loss \
    exp_manager.resume_ignore_no_checkpoint=True

This will create a LoRA adapter - a file named `megatron_gpt_peft_lora_tuning.nemo` in `./lora/llama-3.1-instruct-medchat/.../checkpoints/`. We'll use this later.

To further configure the run above, try the following:

* **A different PEFT technique**: The `peft.peft_scheme` parameter determines the technique being used. In this case, we did LoRA, but NeMo Framework supports other techniques as well - such as P-tuning, Adapters, and IA3. For more information, refer to the [Supported PEFT Methods](https://docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/supported_methods.html). For example, for P-tuning, simply set 
    ```bash
    model.peft.peft_scheme="ptuning" # instead of "lora"
    ```

* **Tuning Llama 3.3 70B Instruct**: You will need 4 x H100 GPUs. Provide the path to it's `.nemo` checkpoint (similar to the download and conversion steps earlier), and change the model parallelization settings for the Llama 3.3 70B Instruct model to distribute across the GPUs.
    ```bash
    model.tensor_model_parallel_size=4
    model.pipeline_model_parallel_size=1
    ```

You can override many such configurations while running the script. A full set of possible configurations is located in [NeMo Framework Github](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml).

## 🛠️ **Step 5: Model Evaluation**

After fine-tuning and merging LoRA adapters, we evaluate the model by generating answers on the test set using NeMo's high-level model evaluation script:
[megatron_gpt_generate.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py).

We'll compute two metrics:
- **Exact Match (EM)**: whether the prediction exactly matches the label.
- **Token-level F1**: overlap between prediction and label tokens.

In [None]:
%%bash
# Check that the LORA model file exists

python -c "import torch; torch.cuda.empty_cache()"

PRECISION=bf16
OUTPUT_DIR="lora/llama-3.1-instruct-medchat/8B/$PRECISION"
ls -l $OUTPUT_DIR/checkpoints

In the code snippet below, the following configurations are worth noting: 

1. `model.restore_from_path` to the path for the `Llama-3_1-Instruct-8B.nemo` file.
2. `model.peft.restore_from_path` to the path for the PEFT checkpoint that was created in the fine-tuning run in the last step.
3. `model.test_ds.file_names` to the path of the `pubmedqa_test.jsonl` file.

If you have made any changes in model or experiment paths, please ensure they are configured correctly below.

In [None]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

PRECISION=bf16
MODEL="models/llama-3.1-instruct/8B/nemo/$PRECISION/Llama-3_1-Instruct-8B.nemo"
OUTPUT_DIR="lora/llama-3.1-instruct-medchat/8B/$PRECISION"
TEST_DS="[datasets/medchat_test.jsonl]"
TEST_NAMES="[medchat]"
SCHEME="lora"
GPUS=4 # Adjust if necessary
TP_SIZE=1
PP_SIZE=1

# This is where your LoRA checkpoint was saved
PATH_TO_TRAINED_MODEL="$OUTPUT_DIR/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

# The generation run will save the generated outputs over the test dataset in a file prefixed like so
OUTPUT_PREFIX="results/medchatQA_result_lora_tuning_"

export PYTHONWARNINGS="ignore"
export TOKENIZERS_PARALLELISM=true

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${MODEL} \
    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    trainer.precision=${PRECISION} \
    model.megatron_amp_O2=True \
    model.global_batch_size=64 \
    model.micro_batch_size=16 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=64 \
    model.data.test_ds.tokens_to_generate=128 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \
    model.data.test_ds.write_predictions_to_file=True

In [None]:
!head -n 10 results/medchatQA_result_lora_tuning__test_medchat_inputs_preds_labels.jsonl | jq .

In [None]:
import json

def compute_f1(pred: str, ref: str) -> float:
    pred_tokens = pred.lower().split()
    ref_tokens = ref.lower().split()
    common = set(pred_tokens) & set(ref_tokens)
    num_common = sum(min(pred_tokens.count(tok), ref_tokens.count(tok)) for tok in common)
    if num_common == 0:
        return 0.0
    precision = num_common / len(pred_tokens)
    recall = num_common / len(ref_tokens)
    return 2 * (precision * recall) / (precision + recall)

# Evaluate results from combined JSONL
file_path = 'results/medchatQA_result_lora_tuning__test_medchat_inputs_preds_labels.jsonl'

exacts = []
f1s = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        obj = json.loads(line)
        pred = obj.get('pred', '').strip()
        label = obj.get('label', '').strip()
        exacts.append(float(pred == label))
        f1s.append(compute_f1(pred, label))

# Aggregate and display
avg_em = sum(exacts) / len(exacts)
avg_f1 = sum(f1s) / len(f1s)
print(f"Average Exact Match (EM): {avg_em:.3f}")
print(f"Average F1 Score: {avg_f1:.3f}")

In [None]:
%%bash -s "$token"

# OPTIONAL: Assess performance on the original model

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

PRECISION=bf16
MODEL="models/llama-3.1-instruct/8B/nemo/$PRECISION/Llama-3_1-Instruct-8B.nemo"
TEST_DS="[datasets/medchat_test.jsonl]"
TEST_NAMES="[medchat]"
SCHEME="lora"
GPUS=4 # Adjust if necessary
TP_SIZE=1
PP_SIZE=1

# The generation run will save the generated outputs over the test dataset in a file prefixed like so
OUTPUT_PREFIX="results/medchatQA_result_no_tuning_"

export PYTHONWARNINGS="ignore"
export TOKENIZERS_PARALLELISM=true

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${MODEL} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    trainer.precision=${PRECISION} \
    model.megatron_amp_O2=True \
    model.global_batch_size=64 \
    model.micro_batch_size=16 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=64 \
    model.data.test_ds.tokens_to_generate=128 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \
    model.data.test_ds.write_predictions_to_file=True \
    model.data.test_ds.label_key='output' \
    model.data.test_ds.add_eos=True \
    model.data.test_ds.add_sep=False \
    model.data.test_ds.add_bos=False \
    model.data.test_ds.truncation_field="input" \
    model.data.test_ds.prompt_template="\{input\} \{output\}" \

In [None]:
!head -n 10 results/medchatQA_result_no_tuning__test_medchat_inputs_preds_labels.jsonl | jq .

In [None]:
import json

def compute_f1(pred: str, ref: str) -> float:
    pred_tokens = pred.lower().split()
    ref_tokens = ref.lower().split()
    common = set(pred_tokens) & set(ref_tokens)
    num_common = sum(min(pred_tokens.count(tok), ref_tokens.count(tok)) for tok in common)
    if num_common == 0:
        return 0.0
    precision = num_common / len(pred_tokens)
    recall = num_common / len(ref_tokens)
    return 2 * (precision * recall) / (precision + recall)

# Evaluate results from combined JSONL
file_path = 'results/medchatQA_result_no_tuning__test_medchat_inputs_preds_labels.jsonl'

exacts = []
f1s = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        obj = json.loads(line)
        pred = obj.get('pred', '').strip()
        label = obj.get('label', '').strip()
        exacts.append(float(pred == label))
        f1s.append(compute_f1(pred, label))

# Aggregate and display
avg_em = sum(exacts) / len(exacts)
avg_f1 = sum(f1s) / len(f1s)
print(f"Average Exact Match (EM): {avg_em:.3f}")
print(f"Average F1 Score: {avg_f1:.3f}")