# Learning Goals

## Optimizing Foundation Models with Parameter-Efficient Fine-Tuning (PEFT)

This notebook aims to demonstrate how to adapt or customize foundation models to improve performance on specific tasks using NeMo 2.0.

This optimization process is known as fine-tuning, which involves adjusting the weights of a pre-trained foundation model with custom data.

Considering that foundation models can be significantly large, a variant of fine-tuning has gained traction recently known as PEFT. PEFT encompasses several methods, including P-Tuning, LoRA, Adapters, IA3, etc. NeMo 2.0 currently supports [Low-Rank Adaptation (LoRA)](https://arxiv.org/pdf/2106.09685) method.





## Step 1. Import the Hugging Face Checkpoint
We use the `llm.import_ckpt` API to download the specified model using the "hf://<huggingface_model_id>" URL format. It will then convert the model into NeMo 2.0 format. For all model supported in NeMo 2.0, refer to [Large Language Models](https://docs.nvidia.com/nemo-framework/user-guide/24.09/llms/index.html#large-language-models) section of NeMo Framework User Guide.

In [1]:
import nemo_run as run
from nemo import lightning as nl
from nemo.collections import llm
from megatron.core.optimizer import OptimizerConfig
from nemo.collections.llm.peft.lora import LoRA
import torch
import pytorch_lightning as pl
from pathlib import Path
from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed


# llm.import_ckpt is the nemo2 API for converting Hugging Face checkpoint to NeMo format
# example usage:
# llm.import_ckpt(model=llm.llama3_8b.model(), source="hf://meta-llama/Meta-Llama-3-8B")
#
# We use run.Partial to configure this function
def configure_checkpoint_conversion():
    return run.Partial(
        llm.import_ckpt,
        model=llm.llama3_8b.model(),
        source="hf://meta-llama/Meta-Llama-3-8B",
        overwrite=False,
    )

# configure your function
import_ckpt = configure_checkpoint_conversion()
# define your executor
local_executor = run.LocalExecutor()

# run your experiment
run.run(import_ckpt, executor=local_executor)


  from .autonotebook import tqdm as notebook_tqdm
      assert (
    


Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.import_ckpt/nemo.collections.llm.api.import_ckpt_1744281251/nemo.collections.llm.api.import_ckpt


Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.import_ckpt/nemo.collections.llm.api.import_ckpt_1744281251/nemo.collections.llm.api.import_ckpt
Launched app: local_persistent://nemo_run/nemo.collections.llm.api.import_ckpt-xxpgdp94x93v2c


Waiting for job nemo.collections.llm.api.import_ckpt-xxpgdp94x93v2c to finish [log=True]...


mport_ckpt/0 Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.94it/s]
mport_ckpt/0 [INFO     | pytorch_lightning.utilities.rank_zero]: GPU available: True (cuda), used: False
mport_ckpt/0 [NeMo I 2025-04-10 10:34:35 nemo_logging:393] Fixing mis-match between ddp-config & mcore-optimizer config
mport_ckpt/0 [NeMo I 2025-04-10 10:34:35 nemo_logging:393] Rank 0 has data parallel group : [0]
mport_ckpt/0 [NeMo I 2025-04-10 10:34:35 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0]
mport_ckpt/0 [NeMo I 2025-04-10 10:34:35 nemo_logging:393] All data parallel group ranks with context parallel combined: [[0]]
mport_ckpt/0 [NeMo I 2025-04-10 10:34:35 nemo_logging:393] Ranks 0 has data parallel rank: 0
mport_ckpt/0 [NeMo I 2025-04-10 10:34:35 nemo_logging:393] Rank 0 has context parallel group: [0]
mport_ckpt/0 [NeMo I 2025-04-10 10:34:35 nemo_logging:393] All context parallel group ranks: [[0]]
mport_ckpt/0 [NeMo I 2025-04-10 10:34:35 nemo_l

Job nemo.collections.llm.api.import_ckpt-xxpgdp94x93v2c finished: SUCCEEDED


## Step 2. Prepare the Data

We will be using SQuAD for this notebook. NeMo 2.0 already provides a `SquadDataModule`. Example usage:

In [10]:
def squad() -> run.Config[pl.LightningDataModule]:
    return run.Config(llm.SquadDataModule, seq_length=2048, micro_batch_size=1, global_batch_size=8, num_workers=0)

In [33]:
from typing import List, Optional
from nemo.lightning.io.mixin import IOMixin
from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule
import json
import shutil
from typing import TYPE_CHECKING, Any, Dict, List, Optional

import numpy as np
from datasets import load_dataset

from nemo.collections.llm.gpt.data.core import get_dataset_root
from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule
from nemo.lightning.io.mixin import IOMixin
from nemo.utils import logging

if TYPE_CHECKING:
    from nemo.collections.common.tokenizers import TokenizerSpec
    from nemo.collections.llm.gpt.data.packed_sequence import PackedSequenceSpecs
class DollyDataModule(FineTuningDataModule, IOMixin):
    """A data module for fine-tuning on the Dolly dataset.

    This class inherits from the `FineTuningDataModule` class and is specifically designed for fine-tuning models on the
    "databricks/databricks-dolly-15k" dataset. It handles data download, preprocessing, splitting, and preparing the data
    in a format suitable for training, validation, and testing.

    Args:
        force_redownload (bool, optional): Whether to force re-download the dataset even if it exists locally. Defaults to False.
        delete_raw (bool, optional): Whether to delete the raw downloaded dataset after preprocessing. Defaults to True.
        See FineTuningDataModule for the other args
    """

    def __init__(
        self,
        seq_length: int = 2048,
        tokenizer: Optional["TokenizerSpec"] = None,
        micro_batch_size: int = 4,
        global_batch_size: int = 8,
        rampup_batch_size: Optional[List[int]] = None,
        force_redownload: bool = False,
        delete_raw: bool = False,
        seed: int = 1234,
        memmap_workers: int = 1,
        num_workers: int = 8,
        pin_memory: bool = True,
        persistent_workers: bool = False,
        packed_sequence_specs: Optional["PackedSequenceSpecs"] = None,
        dataset_kwargs: Optional[Dict[str, Any]] = None,
    ):
        self.force_redownload = force_redownload
        self.delete_raw = delete_raw
        print(get_dataset_root("dolly"))
        super().__init__(
            dataset_root=get_dataset_root("dolly"),
            seq_length=seq_length,
            tokenizer=tokenizer,
            micro_batch_size=micro_batch_size,
            global_batch_size=global_batch_size,
            rampup_batch_size=rampup_batch_size,
            seed=seed,
            memmap_workers=memmap_workers,
            num_workers=num_workers,
            pin_memory=pin_memory,
            persistent_workers=persistent_workers,
            packed_sequence_specs=packed_sequence_specs,
            dataset_kwargs=dataset_kwargs,
        )

    def prepare_data(self) -> None:
        # if train file is specified, no need to do anything
        if not self.train_path.exists() or self.force_redownload:
            dset = self._download_data()
            self._preprocess_and_split_data(dset)
        super().prepare_data()

    def _download_data(self):
        logging.info(f"Downloading {self.__class__.__name__}...")
        return load_dataset(
            "databricks/databricks-dolly-15k",
            cache_dir=str(self.dataset_root),
            download_mode="force_redownload" if self.force_redownload else None,
        )

    def _preprocess_and_split_data(self, dset, train_ratio: float = 0.80, val_ratio: float = 0.15):
        logging.info(f"Preprocessing {self.__class__.__name__} to jsonl format and splitting...")

        test_ratio = 1 - train_ratio - val_ratio
        save_splits = {}
        dataset = dset.get('train')
        split_dataset = dataset.train_test_split(test_size=val_ratio + test_ratio, seed=self.seed)
        split_dataset2 = split_dataset['test'].train_test_split(
            test_size=test_ratio / (val_ratio + test_ratio), seed=self.seed
        )
        save_splits['training'] = split_dataset['train']
        save_splits['validation'] = split_dataset2['train']
        save_splits['test'] = split_dataset2['test']

        for split_name, dataset in save_splits.items():
            output_file = self.dataset_root / f"{split_name}.jsonl"
            with output_file.open("w", encoding="utf-8") as f:
                for example in dataset:
                    context = example["context"].strip()
                    if context != "":
                        # Randomize context and instruction order.
                        context_first = np.random.randint(0, 2) == 0
                        if context_first:
                            instruction = example["instruction"].strip()
                            assert instruction != ""
                            _input = f"{context}\n\n{instruction}"
                            _output = example["response"]
                        else:
                            instruction = example["instruction"].strip()
                            assert instruction != ""
                            _input = f"{instruction}\n\n{context}"
                            _output = example["response"]
                    else:
                        _input = example["instruction"]
                        _output = example["response"]

                    f.write(json.dumps({"input": _input, "output": _output, "category": example["category"]}) + "\n")

            logging.info(f"{split_name} split saved to {output_file}")

        if self.delete_raw:
            for p in self.dataset_root.iterdir():
                if p.is_dir():
                    shutil.rmtree(p)
                elif '.jsonl' not in str(p.name):
                    p.unlink()
    
def dolly() -> run.Config[pl.LightningDataModule]:
    return run.Config(CustomizedDataModule, seq_length=2048, micro_batch_size=1, global_batch_size=8, num_workers=0)

In [35]:
ds = DollyDataModule(force_redownload=True)
ds.prepare_data()

/root/.cache/nemo/datasets/dolly
[NeMo I 2025-04-10 12:09:33 nemo_logging:393] Downloading DollyDataModule...


Generating train split: 100%|██████████████████████████████████████████| 15011/15011 [00:00<00:00, 345325.04 examples/s]

[NeMo I 2025-04-10 12:09:34 nemo_logging:393] Preprocessing DollyDataModule to jsonl format and splitting...





[NeMo I 2025-04-10 12:09:35 nemo_logging:393] training split saved to /root/.cache/nemo/datasets/dolly/training.jsonl
[NeMo I 2025-04-10 12:09:35 nemo_logging:393] validation split saved to /root/.cache/nemo/datasets/dolly/validation.jsonl
[NeMo I 2025-04-10 12:09:35 nemo_logging:393] test split saved to /root/.cache/nemo/datasets/dolly/test.jsonl


In [37]:
%%bash
head -n 100 /root/.cache/nemo/datasets/dolly/training.jsonl > hehe.jsonl
head -n 3 hehe.jsonl

{"input": "Which is a species of fish? Poacher or Hunter", "output": "Poacher", "category": "classification"}
{"input": "The genre has existed since the early years of silent cinema, when Georges Melies' A Trip to the Moon (1902) employed trick photography effects. The next major example (first in feature length in the genre) was the film Metropolis (1927). From the 1930s to the 1950s, the genre consisted mainly of low-budget B movies. After Stanley Kubrick's landmark 2001: A Space Odyssey (1968), the science fiction film genre was taken more seriously. In the late 1970s, big-budget science fiction films filled with special effects became popular with audiences after the success of Star Wars (1977) and paved the way for the blockbuster hits of subsequent decades.\n\nExtract all the movies from this passage and the year they were released out. Write each movie as a separate sentence", "output": "A Trip to the Moon was released in 1902. Metropolis came out in 1927. 2001: A Space Odyssey 

In [55]:
from pathlib import Path
import json
import shutil
import numpy as np
from datasets import load_dataset
from nemo.lightning.io.mixin import IOMixin
from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule
from nemo.utils import logging
from typing import Optional, List, Dict, Any, TYPE_CHECKING

if TYPE_CHECKING:
    from nemo.collections.common.tokenizers import TokenizerSpec
    from nemo.collections.llm.gpt.data.packed_sequence import PackedSequenceSpecs

class DollyDataModule(FineTuningDataModule, IOMixin):
    """
    A custom data module that uses pre-processed JSONL files as the data source.
    
    The expected files (in JSONL format) are:
      - training.jsonl
      - testing.jsonl
    Optionally, if a validation split is available, use validation.jsonl.
    
    These files should reside under the provided dataset_root directory.
    
    Args:
        seq_length (int): The maximum sequence length.
        tokenizer (Optional[TokenizerSpec]): An initialized tokenizer.
        micro_batch_size (int): The size for micro-batches.
        global_batch_size (int): The overall batch size.
        rampup_batch_size (Optional[List[int]]): Ramp-up schedule for the batch size.
        delete_raw (bool): Not used here since the dataset is pre-processed.
        seed (int): Random seed.
        memmap_workers (int): Number of memmap workers.
        num_workers (int): Number of workers to use for data loading.
        pin_memory (bool): Whether to pin memory.
        persistent_workers (bool): Use persistent workers.
        packed_sequence_specs (Optional[PackedSequenceSpecs]): Specifications for packed sequences.
        dataset_root (Optional[Path]): Root directory containing the JSONL files. Defaults to "./custom_dataset".
        dataset_kwargs (Optional[Dict[str, Any]]): Additional keyword arguments.
    """
    def __init__(
        self,
        seq_length: int = 2048,
        tokenizer: Optional["TokenizerSpec"] = None,
        micro_batch_size: int = 4,
        global_batch_size: int = 8,
        rampup_batch_size: Optional[List[int]] = None,
        delete_raw: bool = False,
        seed: int = 1234,
        memmap_workers: int = 1,
        num_workers: int = 8,
        pin_memory: bool = True,
        persistent_workers: bool = False,
        packed_sequence_specs: Optional["PackedSequenceSpecs"] = None,
        dataset_root: Optional[Path] = Path("/workspace/datasets/hey2"),
        dataset_kwargs: Optional[Dict[str, Any]] = None,
    ):
        if dataset_root is None:
            dataset_root = Path("./custom_dataset")
        self.dataset_root = dataset_root
        # Define file paths for training, testing, and optionally validation.
        self.train_path = self.dataset_root / "training.jsonl"
        self.test_path = self.dataset_root / "testing.jsonl"
        # If a validation file exists, it will be used.
        self.validation_path = self.dataset_root / "validation.jsonl"

        super().__init__(
            dataset_root=dataset_root,
            seq_length=seq_length,
            tokenizer=tokenizer,
            micro_batch_size=micro_batch_size,
            global_batch_size=global_batch_size,
            rampup_batch_size=rampup_batch_size,
            seed=seed,
            memmap_workers=memmap_workers,
            num_workers=num_workers,
            pin_memory=pin_memory,
            persistent_workers=persistent_workers,
            packed_sequence_specs=packed_sequence_specs,
            dataset_kwargs=dataset_kwargs,
        )

    def prepare_data(self) -> None:
        """
        Check for the existence of the pre-processed JSONL files.
        If necessary files are missing, a FileNotFoundError is raised.
        """
        if not self.train_path.exists():
            raise FileNotFoundError(f"Training file not found at: {self.train_path}")
        if not self.test_path.exists():
            raise FileNotFoundError(f"Testing file not found at: {self.test_path}")
        logging.info("Custom dataset files found at %s", self.dataset_root)
        # Call super() to allow further processing (e.g., tokenization and splitting into shards)
        super().prepare_data()

    def _setup_datasets(self):
        """
        Load the training, testing (and optionally validation) datasets from the JSONL files.
        The Hugging Face 'load_dataset' function is used to create a DatasetDict.
        """
        logging.info("Loading custom JSONL datasets...")
        data_files = {"train": str(self.train_path), "test": str(self.test_path)}
        # Optionally add the validation split if the file exists.
        if self.validation_path.exists():
            data_files["validation"] = str(self.validation_path)
        dataset = load_dataset("json", data_files=data_files)
        logging.info("Custom datasets loaded successfully.")
        return dataset

    # If needed, you can override other methods (e.g., setup, train_dataloader) 
    # to further control data preparation and batching.


def customDS() -> run.Config[pl.LightningDataModule]:
    return run.Config(DollyDataModule, seq_length=2048, micro_batch_size=1, global_batch_size=8, num_workers=0,)

To learn how to use your own data to create a custom `DataModule` for performing PEFT, refer to [NeMo 2.0 SFT notebook](./nemo2-sft.ipynb).

## Step 3.1: Configure PEFT with NeMo 2.0 API and NeMo-Run

The following Python script utilizes the NeMo 2.0 API to perform PEFT. In this script, we are configuring the following components for training. These components are similar between SFT and PEFT. SFT and PEFT both use `llm.finetune` API. To switch from SFT to PEFT, you just need to add `peft` with the LoRA adapter to the API parameter.

### Configure the Trainer
The NeMo 2.0 Trainer works similarly to the PyTorch Lightning trainer.


In [56]:
def trainer() -> run.Config[nl.Trainer]:
    strategy = run.Config(
        nl.MegatronStrategy,
        tensor_model_parallel_size=1
    )
    trainer = run.Config(
        nl.Trainer,
        devices=1,
        max_steps=20,
        accelerator="gpu",
        strategy=strategy,
        plugins=bf16_mixed(),
        log_every_n_steps=1,
        limit_val_batches=2,
        val_check_interval=2,
        num_sanity_val_steps=0,
    )
    return trainer


def logger() -> run.Config[nl.NeMoLogger]:
    ckpt = run.Config(
        nl.ModelCheckpoint,
        save_last=True,
        every_n_train_steps=10,
        monitor="reduced_train_loss",
        save_top_k=1,
        save_on_train_epoch_end=True,
        save_optim_on_train_end=True,
    )

    return run.Config(
        nl.NeMoLogger,
        name="nemo2_peft",
        log_dir="./results",
        use_datetime_version=False,
        ckpt=ckpt,
        wandb=None
    )

def adam_with_cosine_annealing() -> run.Config[nl.OptimizerModule]:
    opt_cfg = run.Config(
        OptimizerConfig,
        optimizer="adam",
        lr=0.0001,
        adam_beta2=0.98,
        use_distributed_optimizer=True,
        clip_grad=1.0,
        bf16=True,
    )
    return run.Config(
        nl.MegatronOptimizerModule,
        config=opt_cfg
    )


### Pass in the LoRA Adapter
We need to pass in the LoRA adapter to our fine-tuning API to perform LoRA fine-tuning. We can configure the adapter as follows. The target module we support includes: `linear_qkv`, `linear_proj`, `linear_fc1` and `linear_fc2`. In the final script, we used the default configurations for LoRA (`llm.peft.LoRA()`), which will use the full list with `dim=32`.

In [57]:
def lora() -> run.Config[nl.pytorch.callbacks.PEFT]:
    return run.Config(LoRA)

### Configure the Base Model
We will perform PEFT on top of Llama-3-8b, so we create a `LlamaModel` to pass to the NeMo 2.0 finetune API.
### Auto Resume
In NeMo 2.0, we can directly pass in the Llama3-8b Hugging Face ID to start PEFT without manually converting it into the NeMo checkpoint, as required in NeMo 1.0.

In [58]:
def llama3_8b() -> run.Config[pl.LightningModule]:
    return run.Config(llm.LlamaModel, config=run.Config(llm.Llama3Config8B))

def resume() -> run.Config[nl.AutoResume]:
    return run.Config(
        nl.AutoResume,
        restore_config=run.Config(nl.RestoreConfig,
            path="nemo://meta-llama/Meta-Llama-3-8B"
        ),
        resume_if_exists=True,
    )


### Configure the NeMo 2.0 finetune API
Using all the components we created above, we can call the NeMo 2.0 finetune API. The python example usage is as below:
```
llm.finetune(
    model=llama3_8b(),
    data=squad(),
    trainer=trainer(),
    peft=lora(),
    log=logger(),
    optim=adam_with_cosine_annealing(),
    resume=resume(),
)
```
We configure the `llm.finetune` API as below:

In [59]:
def configure_finetuning_recipe():
    return run.Partial(
        llm.finetune,
        model=llama3_8b(),
        trainer=trainer(),
        data=customDS(),
        log=logger(),
        peft=lora(),
        optim=adam_with_cosine_annealing(),
        resume=resume(),
    )

## Step 3.2: Run PEFT with NeMo 2.0 API and NeMo-Run

We use `LocalExecutor` for executing our configured finetune function. For more details on the NeMo-Run executor, refer to [Execute NeMo Run](https://github.com/NVIDIA/NeMo-Run/blob/main/docs/source/guides/execution.md) of NeMo-Run Guides. 

In [62]:
def local_executor_torchrun(nodes: int = 1, devices: int = 1) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor

if __name__ == '__main__':
    import DollyDataModule
    run.run(configure_finetuning_recipe(), executor=local_executor_torchrun())


ModuleNotFoundError: No module named 'DollyDataModule'

## Step 4. Generate Results from Trained PEFT Checkpoints 

We use the `llm.generate` API in NeMo 2.0 to generate results from the trained PEFT checkpoint. Find your last saved checkpoint from your experiment dir: `results/nemo2_peft/checkpoints`. 

In [11]:
peft_ckpt_path=str(next((d for d in Path("./results/nemo2_peft/checkpoints/").iterdir() if d.is_dir() and d.name.endswith("-last")), None))
print("We will load PEFT checkpoint from:", peft_ckpt_path)

We will load PEFT checkpoint from: results/nemo2_peft/checkpoints/nemo2_peft--reduced_train_loss=0.2265-epoch=0-consumed_samples=160.0-last


The SQuAD test set contains over 10,000 samples. For a quick demonstration, we will use the first 100 lines as an example input. 

In [12]:
%%bash
head -n 100 /root/.cache/nemo/datasets/squad/test.jsonl > toy_testset.jsonl
head -n 3 /root/.cache/nemo/datasets/squad/test.jsonl

{"input": "Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50. Question: Which NFL team represented the AFC at Super Bowl 50? Answer:", "output": "Denver Broncos", "original_answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"]}
{"input": "Context

We will pass the string `toy_testset.jsonl` to the `input_dataset` parameter of `llm.generate`. To evaluate the entire test set, you can instead pass the SQuAD data module directly, using `input_dataset=squad()`. The input JSONL file should follow the format shown above, containing `input` and `output` fields (additional keys are optional).

In [None]:
from megatron.core.inference.common_inference_params import CommonInferenceParams


def trainer() -> run.Config[nl.Trainer]:
    strategy = run.Config(
        nl.MegatronStrategy,
        tensor_model_parallel_size=1
    )
    trainer = run.Config(
        nl.Trainer,
        accelerator="gpu",
        devices=1,
        num_nodes=1,
        strategy=strategy,
        plugins=bf16_mixed(),
    )
    return trainer

def configure_inference():
    return run.Partial(
        llm.generate,
        path=str(peft_ckpt_path),
        trainer=trainer(),
        input_dataset="toy_testset.jsonl",
        inference_params=CommonInferenceParams(num_tokens_to_generate=20, top_k=1),
        output_path="peft_prediction.jsonl",
    )


def local_executor_torchrun(nodes: int = 1, devices: int = 1) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor

if __name__ == '__main__':
    run.run(configure_inference(), executor=local_executor_torchrun())


After the inference is complete, you will see results similar to the following:

In [14]:
%%bash
head -n 3 peft_prediction.jsonl

{"input": "Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50. Question: Which NFL team represented the AFC at Super Bowl 50? Answer:", "original_answers": ["Denver Broncos", "Denver Broncos", "Denver Broncos"], "label": "Denver Broncos", "prediction": " Den

## Step 5. Calculate Evaluation Metrics

We can evaluate the model's predictions by calculating the Exact Match (EM) and F1 scores.
- Exact Match is a binary measure (0 or 1) checking if the model outputs match one of the
ground truth answer exactly.
- F1 score is the harmonic mean of precision and recall for the answer words.

Below is a script that computes these metrics. The sample scores can be improved by training the model further and performing hyperparameter tuning. In this notebook, we only train for 20 steps.


In [15]:
!python /opt/NeMo/scripts/metric_calculation/peft_metric_calc.py --pred_file peft_prediction.jsonl --label_field "original_answers" --pred_field "prediction"

exact_match 0.000	f1 29.133	rougeL 34.474	total 100.000
