<center>
  <a href="https://escience.sdu.dk/index.php/ucloud/">
    <img src="https://escience.sdu.dk/wp-content/uploads/2020/03/logo_esc.svg" width="400" height="186" />
  </a>
</center>
<br>
<p style="font-size: 1.2em;">
  This notebook was tested using <strong>NeMo Framework v24.07</strong> and machine type <code>u3-gpu4</code> on UCloud.
</p>


# Building a Llama-3.3 LoRA Adapter with the NeMo Framework

This notebook showcases performing LoRA PEFT [**Llama 3.1 8B**](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/tree/main) on [PubMedQA](https://pubmedqa.github.io/) using NeMo Framework. PubMedQA is a Question-Answering dataset for biomedical texts.

In this notebook, we demonstrate how to apply Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning (PEFT) techniques to the Llama 3.3 70B model using the NeMo Framework. We use [PubMedQA](https://pubmedqa.github.io/), a specialized question-answering dataset derived from biomedical literature, to illustrate how LoRA adapters can efficiently enhance model performance within a domain-specific context.

**Disclaimer**: This notebook is adapted from the [NVIDIA NeMo tutorial on biomedical QA with Llama-3](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/biomedical-qa/llama3-lora-nemofw.ipynb).

## Estimating GPU Memory Requirements for Serving LLMs


### **1. Model Size**
Before you begin, it’s essential to understand how much GPU memory you’ll need to serve a large language model (LLM). A commonly used formula is:

$$
M_{\text{model}} = \frac{(P \times 4B)}{(32 / Q)}
$$

**Where:**

- **M**: The GPU memory required (in Gigabytes)  
- **P**: The number of parameters in the model (e.g., 7 billion parameters for a 7B model)  
- **4B**: 4 bytes, representing the size of each parameter at full precision (32 bits)  
- **32**: The number of bits in 4 bytes (32 bits)  
- **Q**: The model precision in bits used during serving (e.g., 16 bits, 8 bits, or 4 bits)  

**Explanation:**

- Start with $P \times 4B$ to get the base memory needed for all parameters at full precision (FP32).
- Divide by $(32/Q)$, which scales the memory requirement according to the lower-precision format you’re using. For example, loading a model in 16-bit precision effectively halves the memory usage compared to 32-bit.

#### **Example:**

For a 70B parameter model loaded in 8-bit precision:

- $P = 70 \times 10^9$ ($70$ billion)
- $Q = 8$

Plugging these in:

$$
M_{\text{model}} = \frac{(70 \times 10^9 \times 4B)}{(32 / 8)} 
= \frac{(280 \times 10^9 B)}{2} 
= 70 \times 10^9 B
$$

Convert bytes to gigabytes (1 GB = $10^9$ bytes):

$$
M = 70 \text{ GB}
$$

This rough calculation helps estimate the GPU memory needed for serving large models, ensuring you have the right hardware configuration before starting fine-tuning or inference steps.

### **2. Context Window**

The **context window** refers to the maximum number of tokens (words or subwords) the model can process in a single inference pass. During inference, the model needs to store activations for each token in the input sequence. This storage requirement scales linearly with the length of the context window.

#### **Memory Calculation for Context Window**

$$
M_{\text{context}} = L \times H \times D \times N
$$

- **$M_{\text{context}}$**: Memory required for the context window (in Gigabytes)
- **$L$**: Length of the context window (number of tokens)
- **$H$**: Hidden size (dimensionality of the model's hidden layers)
- **$D$**: Data type size (bytes per element, e.g., 2 for FP16)
- **$N$**: Number of transformer layers

#### **Example:**

Assume:
- **$L = 1024$** tokens
- **$H = 8192$** dimensions
- **$D = 1$** bytes (for INT8 precision)
- **$N = 80$** number of hidden layers

$$
M_{\text{context}} = 1024 \times 8192 \times 1 \times 80 = 671,088,640 \text{ bytes} \approx 671 \text{ MB}
$$

### **3. Batch Size**

**Batch size** determines how many input sequences the model processes simultaneously. Increasing the batch size can lead to higher GPU memory usage because the model needs to store activations for each sequence in the batch.

#### **Memory Calculation for Batch Size**

$$
M_{\text{batch}} = B \times M_{\text{context}}
$$

- **$M_{\text{batch}}$**: Additional memory required for batching (in Gigabytes)
- **$B$**: Batch size (number of sequences)
- **$M_{\text{context}}$**: Memory per sequence (from context window calculation)

#### **Example:**

Using the previous **$M_{\text{context}} =  671 \text{ MB}$** and a **batch size $B = 8$**:

$$
M_{\text{batch}} = 8 \times  671 \text{ MB} = 5.4 \text{ GB}
$$

### **4. Total Inference Memory Estimation**

Combining all these factors gives a more comprehensive estimate of the GPU memory required for inference:

$$
M_{\text{total}} = M_{\text{model}} + M_{\text{context}} \times B + M_{\text{overhead}}
$$

- **$M_{\text{total}}$**: Total GPU memory required (in Gigabytes)
- **$M_{\text{model}}$**: Memory for the model
- **$M_{\text{context}}$**: Memory per token sequence
- **$B$**: Batch size
- **$M_{\text{overhead}}$**: Additional overhead for operations like caching, temporary buffers, etc. (typically 10-20%)

#### Example

Using the previous results:

$$
M_{\text{total}} \approx 90 \text{ GB}
$$

In [None]:
from utils import estimate_gpu_memory

Q = 16  # 16-bit precision (bfloat16)
L = 1024  # Context window
B = 8  # Batch size

# Example usage for LLama-3.1 8B
P_8B = 8_000_000_000  # 8B parameters
H_8B = 4096  # Hidden size
N_8B = 32

estimated_memory_8B = estimate_gpu_memory(P_8B, Q, L, H_8B, B, N_8B)
print(f"Estimated GPU Memory Required for LLama-3 8B: {estimated_memory_8B:.2f} GB")

# Example usage for LLama-3.1 70B
P_70B = 70_000_000_000  # 70B parameters
H_70B = 8192  # Hidden size
N_70B = 80

estimated_memory_70B = estimate_gpu_memory(P_70B, Q, L, H_70B, B, N_70B)
print(f"Estimated GPU Memory Required for LLama-3 70B: {estimated_memory_70B:.2f} GB")

## Download the Model
Before you begin, ensure you have a local copy of the Meta Llama3.3 70B Instruct model. If you haven’t already downloaded it, you can obtain it from the official [Hugging Face repository](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/tree/main). This step is crucial to ensure that all subsequent operations in the notebook run smoothly.

In [None]:
from IPython.display import display
from ipywidgets import Password
from huggingface_hub import snapshot_download

pwd = Password(description="Hugging Face Token:")
display(pwd)

In [None]:
token = pwd.value
hf_model="meta-llama/Llama-3.1-8B-Instruct"
hf_model_path="models/llama-3.1/8B/hf"
snapshot_download(
    repo_id=hf_model,
    local_dir=hf_model_path,
    token=token
)

In [None]:
%%bash -s "$hf_model_path"

ls $1
du -sh $1

## Convert the Model in NeMo Format

To fully leverage the NeMo toolkit and its ecosystem of training, inference, and deployment tools, it’s often necessary to convert your model into NeMo’s native `.nemo` format. For detailed, step-by-step instructions on performing such conversions, refer to the [NeMo user guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/checkpoints/user_guide.html) on checkpoint conversion.

This conversion will help ensure compatibility and streamline the process of fine-tuning, evaluating, and deploying your NeMo-based LLM workflows.

In this case, we will use the `convert_llama_hf_to_nemo.py` script provided by NeMo:

```
$ python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --help
```

```text
    usage: convert_llama_hf_to_nemo.py [-h] --input_name_or_path INPUT_NAME_OR_PATH --output_path OUTPUT_PATH [--hparams_file HPARAMS_FILE] [--precision PRECISION]

    options:
      -h, --help            show this help message and exit
      --input_name_or_path INPUT_NAME_OR_PATH
                            Path to Huggingface LLaMA checkpoints
      --output_path OUTPUT_PATH
                            Path to output .nemo file.
      --hparams_file HPARAMS_FILE
                            Path config for restoring (hparams.yaml).
      --precision PRECISION
                            Model precision
```

Below is a summary of different model precision choices, along with their key trade-offs:
- **FP32 (32-bit Float):** Maximum precision, but slower and uses more memory.
- **FP16 (16-bit Float):** Reduces memory usage and speeds up training, but can be numerically unstable if used alone.
- **BF16 (BFloat16):** Offers similar speed and memory benefits to FP16, but with greater numerical stability due to a larger exponent range, making it more robust than pure FP16.
- **FP16 Mixed Precision:** Employs FP16 for most operations and FP32 for critical ones, striking a balance between performance and stability.
- **BF16 Mixed Precision:** Similar to FP16 mixed, but even more stable, leveraging BF16 for most operations and FP32 where necessary for optimal stability, performance, and memory usage.

In [None]:
%%bash

HF_MODEL="models/llama-3.1/8B/hf"
PRECISION=bf16
NeMo_MODEL="models/llama-3.1/8B/nemo/$PRECISION/Llama-3_1-8B-Instruct.nemo"

# Modify rope_scaling properties
[ ! -f "$HF_MODEL/config.json.bak" ] && cp "$HF_MODEL/config.json" "$HF_MODEL/config.json.bak"
jq '.rope_scaling = {"factor": 8.000000001, "type": "linear"}' "$HF_MODEL/config.json" > /tmp/config.tmp && mv /tmp/config.tmp "$HF_MODEL/config.json"

export TOKENIZERS_PARALLELISM=false

# Convert model to .nemo 
python3 /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
        --input_name_or_path "$HF_MODEL" \
        --output_path "$NeMo_MODEL" \
        --precision "$PRECISION"

In [None]:
%%bash

PRECISION=bf16
NeMo_MODEL="models/llama-3.1/8B/nemo/$PRECISION/Llama-3_1-8B-Instruct.nemo"

file "$NeMo_MODEL"
du -sh "$NeMo_MODEL"

##  Step-by-Step Instructions

This notebook is organized into four main steps:

1. **Prepare the Dataset:**
   Load and preprocess the PubMedQA dataset, ensuring that it’s correctly formatted and ready for fine-tuning.

2. **Run the PEFT Fine-Tuning Script:**
   Apply Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning methods to tailor the Llama 3.3 70B model to the PubMedQA domain.

3. **Perform Inference with the NeMo Framework:**
   Use the trained model to generate answers to biomedical questions and observe how it performs on real queries.

4. **Evaluate Model Accuracy:**
   Assess the quality and correctness of the model’s responses to measure improvements gained through the fine-tuning process.
   
5. **Export Model to TensorRT-LLM Format for Inference:**
   use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM.

### Step 1: Prepare the dataset

Download the PubMedQA dataset and run the pre-processing script in the cloned directory.

In [None]:
%%bash

# Download the dataset and prep. scripts
git clone https://github.com/pubmedqa/pubmedqa.git

# split it into train/val/test datasets
cd pubmedqa/preprocess
python split_dataset.py pqal

The following example shows what a single row looks inside of the PubMedQA train, validation and test splits.

```json
"18251357": {
    "QUESTION": "Does histologic chorioamnionitis correspond to clinical chorioamnionitis?",
    "CONTEXTS": [
        "To evaluate the degree to which histologic chorioamnionitis, a frequent finding in placentas submitted for histopathologic evaluation, correlates with clinical indicators of infection in the mother.",
        "A retrospective review was performed on 52 cases with a histologic diagnosis of acute chorioamnionitis from 2,051 deliveries at University Hospital, Newark, from January 2003 to July 2003. Third-trimester placentas without histologic chorioamnionitis (n = 52) served as controls. Cases and controls were selected sequentially. Maternal medical records were reviewed for indicators of maternal infection.",
        "Histologic chorioamnionitis was significantly associated with the usage of antibiotics (p = 0.0095) and a higher mean white blood cell count (p = 0.018). The presence of 1 or more clinical indicators was significantly associated with the presence of histologic chorioamnionitis (p = 0.019)."
    ],
    "reasoning_required_pred": "yes",
    "reasoning_free_pred": "yes",
    "final_decision": "yes",
    "LONG_ANSWER": "Histologic chorioamnionitis is a reliable indicator of infection whether or not it is clinically apparent."
},
```

Use the following code to convert the train, validation, and test PubMedQA data into the `JSONL` format that NeMo needs for PEFT.

In [None]:
import json

def read_jsonl(fname):
    obj = []
    with open(fname, 'rt') as f:
        st = f.readline()
        while st:
            obj.append(json.loads(st))
            st = f.readline()
    return obj

def write_jsonl(fname, json_objs):
    with open(fname, 'wt') as f:
        for o in json_objs:
            f.write(json.dumps(o)+"\n")
            
def form_question(obj):
    st = ""    
    for i, label in enumerate(obj['LABELS']):
        st += f"{label}: {obj['CONTEXTS'][i]}\n"
    st += f"QUESTION: {obj['QUESTION']}\n"
    st += f" ### ANSWER (yes|no|maybe): "
    return st

def convert_to_jsonl(data_path, output_path):
    data = json.load(open(data_path, 'rt'))
    json_objs = []
    for k in data.keys():
        obj = data[k]
        prompt = form_question(obj)
        completion = obj['final_decision']
        json_objs.append({"input": prompt, "output": f"<<< {completion} >>>"})
    write_jsonl(output_path, json_objs)
    return json_objs


test_json_objs = convert_to_jsonl("pubmedqa/data/test_set.json", "pubmedqa/data/pubmedqa_test.jsonl")
train_json_objs = convert_to_jsonl("pubmedqa/data/pqal_fold0/train_set.json", "pubmedqa/data/pubmedqa_train.jsonl")
dev_json_objs = convert_to_jsonl("pubmedqa/data/pqal_fold0/dev_set.json", "pubmedqa/data/pubmedqa_val.jsonl")

> `Note:` In the output, we enforce the inclusion of “<<<” and “>>>“ markers which would allow verification of the LoRA tuned model during inference. This is  because the base model can produce “yes” / “no” responses based on zero-shot templates as well.

After running the above script, you will see  `pubmedqa_train.jsonl`, `pubmedqa_val.jsonl`, and `pubmedqa_test.jsonl` files appear in the data directory.

This is what an example will be formatted like after the script has converted the PubMedQA data into `JSONL` -

```json
{"input": "QUESTION: Failed IUD insertions in community practice: an under-recognized problem?\nCONTEXT: The data analysis was conducted to describe the rate of unsuccessful copper T380A intrauterine device (IUD) insertions among women using the IUD for emergency contraception (EC) at community family planning clinics in Utah.\n ...  ### ANSWER (yes|no|maybe): ",
"output": "<<< yes >>>"}
```


In [None]:
%%bash

# clear up cached mem-map file
rm pubmedqa/data/*idx*

wc -l pubmedqa/data/pubmedqa_train.jsonl
wc -l pubmedqa/data/pubmedqa_val.jsonl
wc -l pubmedqa/data/pubmedqa_test.jsonl


### Step 2: Run PEFT finetuning script for LoRA

NeMo framework includes a high level python script for fine-tuning  [megatron_gpt_finetuning.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py) that can abstract away some of the lower level API calls. Once you have your model downloaded and the dataset ready, LoRA fine-tuning with NeMo is essentially just running this script!

For this demonstration, this training run is capped by `max_steps`, and validation is carried out every `val_check_interval` steps. If the validation loss does not improve after a few checks, training is halted to avoid overfitting.

> `NOTE:` In the block of code below, pass the paths to your train, test and validation data files as well as path to the .nemo model.

#### Understanding Global Batch Size (GBS) in Multi-GPU Training


##### **1. Global Batch Size (GBS)**
- **Definition:**
  - The **total number of training samples** processed in **one training step** across **all GPUs** involved.

##### **2. Data Parallelism (DP)**
- **Definition:**
  - The **number of GPUs** that each hold a **replica** of the entire model.
  - **Function:** Distributes different data batches to each GPU simultaneously.
  - **GAS (Gradient Accumulation Steps):** The number of mini-batches over which gradients are accumulated before performing a parameter update.
  - **DP formula:**
      $$
      \text{Data Parallelism (DP)} = \frac{\text{Total GPUs} \times \text{Gradient Accumulation Step (GAS)}}{\text{Tensor Parallelism (TP)} \times \text{Pipeline Parallelism (PP)}}
      $$


##### **3. Micro Batch Size (MB)**
- **Definition:**
  - The **number of samples** processed **per GPU** in a single forward/backward pass.

##### **4. GBS Formula**
$$
\text{Global Batch Size (GBS)} = \text{Data Parallelism (DP)} \times \text{Micro Batch Size (MB)}
$$

##### **5. How to Set GBS**
1. **Determine Available GPUs:**
   - Total GPUs (e.g., 4 GPUs).
2. **Choose Data Parallelism (DP):**
   - Decide how many GPUs to use for DP (e.g., DP = 4).
3. **Set Micro Batch Size (MB):**
   - Based on GPU memory capacity (e.g., MB = 8).
4. **Calculate GBS:**
   - Use the formula to find GBS (e.g., GBS = 4 × 8 = 32).

##### **Best Practices**
- **Align GBS with DP and MB:**
  - Ensure $\text{GBS} = \text{DP} \times \text{MB}$.
- **Monitor GPU Utilization:**
  - Use tools like `nvidia-smi` to ensure all GPUs are effectively utilized.
- **Adjust Batch Sizes as Needed:**
  - Optimize **MB** based on memory constraints and **GBS** to balance load.
- **Utilize Gradient Accumulation:**
  - When larger **GBS** is desired but constrained by memory.


In [None]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

# Set paths to the model, train, validation and test sets.
PRECISION=bf16
MODEL="models/llama-3.1/8B/nemo/$PRECISION/Llama-3_1-8B-Instruct.nemo"
OUTPUT_DIR="results/llama-3.1/8B/$PRECISION"
rm -rf "$OUTPUT_DIR"

TRAIN_DS="[pubmedqa/data/pubmedqa_train.jsonl]"
VALID_DS="[pubmedqa/data/pubmedqa_val.jsonl]"

SCHEME="lora"
GPUS=1       # set equal to 4 for 70B model
TP_SIZE=1    # set equal to 4 for 70B model
PP_SIZE=1

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${OUTPUT_DIR} \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    trainer.precision=${PRECISION} \
    trainer.val_check_interval=20 \
    trainer.max_steps=1000 \
    model.megatron_amp_O2=False \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.global_batch_size=8 \
    model.micro_batch_size=1 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.num_workers=10 \
    model.data.validation_ds.num_workers=10 \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME}

This will create a LoRA adapter - a file named `megatron_gpt_peft_lora_tuning.nemo` in `./results/.../checkpoints/`. We'll use this later.

To further configure the run above -

* **A different PEFT technique**: The `peft.peft_scheme` parameter determines the technique being used. In this case, we did LoRA, but NeMo Framework supports other techniques as well - such as P-tuning, Adapters, and IA3. For more information, refer to the [PEFT support matrix](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/peft/landing_page.html). For example, for P-tuning, simply set 

```bash
model.peft.peft_scheme="ptuning" # instead of "lora"
```

* **Tuning Llama-3.1 70B**: You will need 4xH100 GPUs. Provide the path to it's .nemo checkpoint (similar to the download and conversion steps earlier), and change the model parallelization settings for Llama-3 70B PEFT to distribute across the GPUs. It is also recommended to run the fine-tuning script from a terminal directly instead of Jupyter when using more than 1 GPU.
```bash
model.tensor_model_parallel_size=4
model.pipeline_model_parallel_size=1
```

You can override many such configurations while running the script. A full set of possible configurations is located in [NeMo Framework Github](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml).

### Step 3: Inference with NeMo Framework

Running text generation within the framework is also possible with running a Python script. Note that is more for testing and validation, not a full-fledged  deployment solution like NVIDIA NIM.

In [None]:
%%bash
# Check that the LORA model file exists

python -c "import torch; torch.cuda.empty_cache()"

PRECISION=bf16
OUTPUT_DIR="results/llama-3.1/8B/$PRECISION"
ls -l $OUTPUT_DIR/checkpoints

In the code snippet below, the following configurations are worth noting: 

1. `model.restore_from_path` to the path for the Meta-Llama-3-8B-Instruct.nemo file.
2. `model.peft.restore_from_path` to the path for the PEFT checkpoint that was created in the fine-tuning run in the last step.
3. `model.test_ds.file_names` to the path of the pubmedqa_test.jsonl file

If you have made any changes in model or experiment paths, please ensure they are configured correctly below.

In [None]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

PRECISION=bf16
MODEL="models/llama-3.1/8B/nemo/$PRECISION/Llama-3_1-8B-Instruct.nemo"
OUTPUT_DIR="results/llama-3.1/8B/$PRECISION"
TEST_DS="[pubmedqa/data/pubmedqa_test.jsonl]"
TEST_NAMES="[pubmedqa]"
SCHEME="lora"
GPUS=1
TP_SIZE=1
PP_SIZE=1

# This is where your LoRA checkpoint was saved
PATH_TO_TRAINED_MODEL="$OUTPUT_DIR/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

# The generation run will save the generated outputs over the test dataset in a file prefixed like so
OUTPUT_PREFIX="pubmedQA_result_"

export TOKENIZERS_PARALLELISM=true

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${MODEL} \
    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=1 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=3 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \
    model.data.test_ds.write_predictions_to_file=True

### Step 4: Check the model accuracy

Now that the results are in, let's read the results and calculate the accuracy on the pubmedQA task. You can compare your accuracy results with the public leaderboard at https://pubmedqa.github.io/.

Let's take a look at one of the predictions in the generated output file. The `pred` key indicates what was generated.

In [None]:
%%bash

tail -n 1 pubmedQA_result__test_pubmedqa_inputs_preds_labels.jsonl

Note that the model produces output in the specified format, such as `<<< no >>>`.

The following snippet loads the generated output and calculates accuracy in comparison to the test set using the `evaluation.py` script included in the PubMedQA repo.

In [None]:
import json

answers = []
with open("pubmedQA_result__test_pubmedqa_inputs_preds_labels.jsonl",'rt') as f:
    st = f.readline()
    while st:
        answers.append(json.loads(st))
        st = f.readline()

In [None]:
data_test = json.load(open("./pubmedqa/data/test_set.json",'rt'))

In [None]:
results = {}
sample_id = list(data_test.keys())

for i, key in enumerate(sample_id):
    answer = answers[i]['pred']
    if 'yes' in answer:
        results[key] = 'yes'
    elif 'no' in answer:
        results[key] = 'no'
    elif 'maybe' in answer:
        results[key] = 'maybe'
    else:
        print("Malformed answer: ", answer)
        results[key] = 'maybe'

In [None]:
# Dump results in a format that can be ingested by PubMedQA evaluation file
FILENAME="pubmedqa-llama-3-8b-lora.json"
with(open(FILENAME, "w")) as f:
    json.dump(results, f)

# Evaluation
!cp $FILENAME ./pubmedqa/
!cd ./pubmedqa/ && python evaluation.py $FILENAME

For the Llama-3-8B-Instruct model, you should see accuracy comparable to the below:
```
Accuracy 0.792000
Macro-F1 0.594778
```

## Export Model to TensorRT-LLM Format for Inference

In [None]:
from nemo.export.tensorrt_llm import TensorRTLLM

MODEL_DIR="models/llama-3.1/8B/trt_llm/bf16/tp_1"
MODEL_CKPT="models/llama-3.1/8B/nemo/bf16/Llama-3_1-8B-Instruct.nemo"
LORA_CKPT="results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

trt_llm_exporter = TensorRTLLM(
    model_dir=MODEL_DIR,
    lora_ckpt_list=[LORA_CKPT],
)

trt_llm_exporter.export(
    nemo_checkpoint_path=MODEL_CKPT,
    model_type="llama",
    n_gpus=1,
)

In [None]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

PRECISION=bf16
MODEL_DIR="models/llama-3.1/8B/trt_llm/$PRECISION/tp_1"
mkdir -p "$MODEL_DIR"
MODEL_CKPT="models/llama-3.1/8B/nemo/$PRECISION/Llama-3_1-8B-Instruct.nemo"
LORA_CKPT="results/llama-3.1/8B/$PRECISION/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

python /opt/NeMo/scripts/deploy/nlp/deploy_triton.py \
    --nemo_checkpoint "$MODEL_CKPT" \
    --lora_ckpt "$LORA_CKPT" \
    --use_lora_plugin \
    --model_type llama \
    --triton_model_name llama3-pubmedqa \
    --triton_model_repository "$MODEL_DIR" \
    --num_gpus 1 \
    --tensor_parallelism_size 1 \
    --pipeline_parallelism_size 1

Open a terminal to query the model:

```shell
QUERY="Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?"

python /opt/NeMo/scripts/deploy/nlp/query.py \
    -mn llama3-pubmedqa \
    -p "$QUERY" \
    -mol 5
```