<center>
  <a href="https://escience.sdu.dk/index.php/ucloud/">
    <img src="https://escience.sdu.dk/wp-content/uploads/2020/03/logo_esc.svg" width="400" height="186" />
  </a>
</center>
<br>
<p style="font-size: 1.2em;">
  This notebook was tested using <strong>NeMo Framework v24.07</strong> and machine type <code>u3-gpu4</code> on UCloud.
</p>


# Building a Llama-3.3 LoRA Adapter with the NeMo Framework

This notebook showcases performing LoRA PEFT [**Llama 3.1 8B**](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/tree/main) on [PubMedQA](https://pubmedqa.github.io/) using NeMo Framework. PubMedQA is a Question-Answering dataset for biomedical texts.

In this notebook, we demonstrate how to apply Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning (PEFT) techniques to the Llama 3.3 70B model using the NeMo Framework. We use [PubMedQA](https://pubmedqa.github.io/), a specialized question-answering dataset derived from biomedical literature, to illustrate how LoRA adapters can efficiently enhance model performance within a domain-specific context.

**Disclaimer**: This notebook is adapted from the [NVIDIA NeMo tutorial on biomedical QA with Llama-3](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/biomedical-qa/llama3-lora-nemofw.ipynb).

## Estimating GPU Memory Requirements for Serving LLMs


### **1. Model Size**
Before you begin, it’s essential to understand how much GPU memory you’ll need to serve a large language model (LLM). A commonly used formula is:

$$
M_{\text{model}} = \frac{(P \times 4B)}{(32 / Q)}
$$

**Where:**

- **M**: The GPU memory required (in Gigabytes)  
- **P**: The number of parameters in the model (e.g., 7 billion parameters for a 7B model)  
- **4B**: 4 bytes, representing the size of each parameter at full precision (32 bits)  
- **32**: The number of bits in 4 bytes (32 bits)  
- **Q**: The model precision in bits used during serving (e.g., 16 bits, 8 bits, or 4 bits)  

**Explanation:**

- Start with $P \times 4B$ to get the base memory needed for all parameters at full precision (FP32).
- Divide by $(32/Q)$, which scales the memory requirement according to the lower-precision format you’re using. For example, loading a model in 16-bit precision effectively halves the memory usage compared to 32-bit.

#### **Example:**

For a 70B parameter model loaded in 8-bit precision:

- $P = 70 \times 10^9$ ($70$ billion)
- $Q = 8$

Plugging these in:

$$
M_{\text{model}} = \frac{(70 \times 10^9 \times 4B)}{(32 / 8)} 
= \frac{(280 \times 10^9 B)}{2} 
= 70 \times 10^9 B
$$

Convert bytes to gigabytes (1 GB = $10^9$ bytes):

$$
M = 70 \text{ GB}
$$

This rough calculation helps estimate the GPU memory needed for serving large models, ensuring you have the right hardware configuration before starting fine-tuning or inference steps.

### **2. Context Window**

The **context window** refers to the maximum number of tokens (words or subwords) the model can process in a single inference pass. During inference, the model needs to store activations for each token in the input sequence. This storage requirement scales linearly with the length of the context window.

#### **Memory Calculation for Context Window**

$$
M_{\text{context}} = L \times H \times D \times N
$$

- **$M_{\text{context}}$**: Memory required for the context window (in Gigabytes)
- **$L$**: Length of the context window (number of tokens)
- **$H$**: Hidden size (dimensionality of the model's hidden layers)
- **$D$**: Data type size (bytes per element, e.g., 2 for FP16)
- **$N$**: Number of transformer layers

#### **Example:**

Assume:
- **$L = 1024$** tokens
- **$H = 8192$** dimensions
- **$D = 1$** bytes (for INT8 precision)
- **$N = 80$** number of hidden layers

$$
M_{\text{context}} = 1024 \times 8192 \times 1 \times 80 = 671,088,640 \text{ bytes} \approx 671 \text{ MB}
$$

### **3. Batch Size**

**Batch size** determines how many input sequences the model processes simultaneously. Increasing the batch size can lead to higher GPU memory usage because the model needs to store activations for each sequence in the batch.

#### **Memory Calculation for Batch Size**

$$
M_{\text{batch}} = B \times M_{\text{context}}
$$

- **$M_{\text{batch}}$**: Additional memory required for batching (in Gigabytes)
- **$B$**: Batch size (number of sequences)
- **$M_{\text{context}}$**: Memory per sequence (from context window calculation)

#### **Example:**

Using the previous **$M_{\text{context}} =  671 \text{ MB}$** and a **batch size $B = 8$**:

$$
M_{\text{batch}} = 8 \times  671 \text{ MB} = 5.4 \text{ GB}
$$

### **4. Total Inference Memory Estimation**

Combining all these factors gives a more comprehensive estimate of the GPU memory required for inference:

$$
M_{\text{total}} = M_{\text{model}} + M_{\text{context}} \times B + M_{\text{overhead}}
$$

- **$M_{\text{total}}$**: Total GPU memory required (in Gigabytes)
- **$M_{\text{model}}$**: Memory for the model
- **$M_{\text{context}}$**: Memory per token sequence
- **$B$**: Batch size
- **$M_{\text{overhead}}$**: Additional overhead for operations like caching, temporary buffers, etc. (typically 10-20%)

#### Example

Using the previous results:

$$
M_{\text{total}} \approx 90 \text{ GB}
$$

In [2]:
from utils import estimate_gpu_memory

Q = 16  # 16-bit precision (bfloat16)
L = 1024  # Context window
B = 8  # Batch size

# Example usage for LLama-3.1 8B
P_8B = 8_000_000_000  # 8B parameters
H_8B = 4096  # Hidden size
N_8B = 32

estimated_memory_8B = estimate_gpu_memory(P_8B, Q, L, H_8B, B, N_8B)
print(f"Estimated GPU Memory Required for LLama-3 8B: {estimated_memory_8B:.2f} GB")

# Example usage for LLama-3.1 70B
P_70B = 70_000_000_000  # 70B parameters
H_70B = 8192  # Hidden size
N_70B = 80

estimated_memory_70B = estimate_gpu_memory(P_70B, Q, L, H_70B, B, N_70B)
print(f"Estimated GPU Memory Required for LLama-3 70B: {estimated_memory_70B:.2f} GB")

Estimated GPU Memory Required for LLama-3 8B: 21.78 GB
Estimated GPU Memory Required for LLama-3 70B: 180.88 GB


## Download the Model
Before you begin, ensure you have a local copy of the Meta Llama3.3 70B Instruct model. If you haven’t already downloaded it, you can obtain it from the official [Hugging Face repository](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/tree/main). This step is crucial to ensure that all subsequent operations in the notebook run smoothly.

In [3]:
from IPython.display import display
from ipywidgets import Password
from huggingface_hub import snapshot_download

pwd = Password(description="Hugging Face Token:")
display(pwd)

Password(description='Hugging Face Token:')

In [4]:
token = pwd.value
hf_model="meta-llama/Llama-3.1-8B-Instruct"
hf_model_path="models/llama-3.1/8B/hf"
snapshot_download(
    repo_id=hf_model,
    local_dir=hf_model_path,
    token=token
)

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

'/work/ucloud-workshop-11-12-2024/models/llama-3.1/8B/hf'

In [5]:
%%bash -s "$hf_model_path"

ls $1
du -sh $1

LICENSE
README.md
USE_POLICY.md
config.json
config.json.bak
generation_config.json
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
model.safetensors.index.json
original
special_tokens_map.json
tokenizer.json
tokenizer_config.json
30G	models/llama-3.1/8B/hf


## Convert the Model in NeMo Format

To fully leverage the NeMo toolkit and its ecosystem of training, inference, and deployment tools, it’s often necessary to convert your model into NeMo’s native `.nemo` format. For detailed, step-by-step instructions on performing such conversions, refer to the [NeMo user guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/checkpoints/user_guide.html) on checkpoint conversion.

This conversion will help ensure compatibility and streamline the process of fine-tuning, evaluating, and deploying your NeMo-based LLM workflows.

In this case, we will use the `convert_llama_hf_to_nemo.py` script provided by NeMo:

```
$ python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --help
```

```text
    usage: convert_llama_hf_to_nemo.py [-h] --input_name_or_path INPUT_NAME_OR_PATH --output_path OUTPUT_PATH [--hparams_file HPARAMS_FILE] [--precision PRECISION]

    options:
      -h, --help            show this help message and exit
      --input_name_or_path INPUT_NAME_OR_PATH
                            Path to Huggingface LLaMA checkpoints
      --output_path OUTPUT_PATH
                            Path to output .nemo file.
      --hparams_file HPARAMS_FILE
                            Path config for restoring (hparams.yaml).
      --precision PRECISION
                            Model precision
```

Below is a summary of different model precision choices, along with their key trade-offs:
- **FP32 (32-bit Float):** Maximum precision, but slower and uses more memory.
- **FP16 (16-bit Float):** Reduces memory usage and speeds up training, but can be numerically unstable if used alone.
- **BF16 (BFloat16):** Offers similar speed and memory benefits to FP16, but with greater numerical stability due to a larger exponent range, making it more robust than pure FP16.
- **FP16 Mixed Precision:** Employs FP16 for most operations and FP32 for critical ones, striking a balance between performance and stability.
- **BF16 Mixed Precision:** Similar to FP16 mixed, but even more stable, leveraging BF16 for most operations and FP32 where necessary for optimal stability, performance, and memory usage.

In [7]:
%%bash

HF_MODEL="models/llama-3.1/8B/hf"
PRECISION=bf16
NeMo_MODEL="models/llama-3.1/8B/nemo/$PRECISION/Llama-3_1-8B-Instruct.nemo"

# Modify rope_scaling properties
[ ! -f "$HF_MODEL/config.json.bak" ] && cp "$HF_MODEL/config.json" "$HF_MODEL/config.json.bak"
jq '.rope_scaling = {"factor": 8.000000001, "type": "linear"}' "$HF_MODEL/config.json" > /tmp/config.tmp && mv /tmp/config.tmp "$HF_MODEL/config.json"

export TOKENIZERS_PARALLELISM=false

# Convert model to .nemo 
python3 /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
        --input_name_or_path "$HF_MODEL" \
        --output_path "$NeMo_MODEL" \
        --precision "$PRECISION"

`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).


[NeMo I 2024-12-10 15:40:41 convert_llama_hf_to_nemo:111] loading checkpoint models/llama-3.1/8B/hf


Loading checkpoint shards: 100%|██████████| 4/4 [00:26<00:00,  6.55s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


hf_config: {'vocab_size': 128256, 'max_position_embeddings': 131072, 'hidden_size': 4096, 'intermediate_size': 14336, 'num_hidden_layers': 32, 'num_attention_heads': 32, 'num_key_value_heads': 8, 'hidden_act': 'silu', 'initializer_range': 0.02, 'rms_norm_eps': 1e-05, 'pretraining_tp': 1, 'use_cache': True, 'rope_theta': 500000, 'rope_scaling': {'factor': 8.000000001, 'type': 'linear'}, 'attention_bias': False, 'attention_dropout': 0, 'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': torch.bfloat16, 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_embeddings': False, 'chunk_size_feed_forward': 0, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperatu

[NeMo W 2024-12-10 15:41:07 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    


- model.layers.16.mlp.up_proj.weight
- model.layers.16.mlp.down_proj.weight
- model.layers.16.input_layernorm.weight
- model.layers.16.post_attention_layernorm.weight
- model.layers.17.self_attn.q_proj.weight
- model.layers.17.self_attn.k_proj.weight
- model.layers.17.self_attn.v_proj.weight
- model.layers.17.self_attn.o_proj.weight
- model.layers.17.mlp.gate_proj.weight
- model.layers.17.mlp.up_proj.weight
- model.layers.17.mlp.down_proj.weight
- model.layers.17.input_layernorm.weight
- model.layers.17.post_attention_layernorm.weight
- model.layers.18.self_attn.q_proj.weight
- model.layers.18.self_attn.k_proj.weight
- model.layers.18.self_attn.v_proj.weight
- model.layers.18.self_attn.o_proj.weight
- model.layers.18.mlp.gate_proj.weight
- model.layers.18.mlp.up_proj.weight
- model.layers.18.mlp.down_proj.weight
- model.layers.18.input_layernorm.weight
- model.layers.18.post_attention_layernorm.weight
- model.layers.19.self_attn.q_proj.weight
- model.layers.19.self_attn.k_proj.weight
-

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-12-10 15:41:08 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
    
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:1

nemo_config: {'mcore_gpt': True, 'micro_batch_size': 4, 'global_batch_size': 8, 'tensor_model_parallel_size': 1, 'pipeline_model_parallel_size': 1, 'virtual_pipeline_model_parallel_size': None, 'encoder_seq_length': 131072, 'max_position_embeddings': 131072, 'num_layers': 32, 'hidden_size': 4096, 'ffn_hidden_size': 14336, 'num_attention_heads': 32, 'init_method_std': 0.02, 'use_scaled_init_method': True, 'hidden_dropout': 0.0, 'attention_dropout': 0.0, 'ffn_dropout': 0.0, 'kv_channels': None, 'apply_query_key_layer_scaling': True, 'normalization': 'rmsnorm', 'layernorm_epsilon': 1e-05, 'do_layer_norm_weight_decay': False, 'make_vocab_size_divisible_by': 128, 'pre_process': True, 'post_process': True, 'persist_layer_norm': True, 'bias': False, 'activation': 'fast-swiglu', 'headscale': False, 'transformer_block_type': 'pre_ln', 'openai_gelu': False, 'normalize_attention_scores': True, 'position_embedding_type': 'rope', 'rotary_percentage': 1.0, 'attention_type': 'multihead', 'share_embed

[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: deterministic_mode in its c

[NeMo I 2024-12-10 15:41:11 tokenizer_utils:183] Getting HuggingFace AutoTokenizer with pretrained_model_name: models/llama-3.1/8B/hf


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[NeMo I 2024-12-10 15:41:11 megatron_base_model:595] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:41:11 megatron_base_model:1182] The model: MegatronGPTModel() does not have field.name: deterministic_mode in its c

[NeMo I 2024-12-10 15:41:44 dist_ckpt_io:421] Using TorchDistSaveShardedStrategy(torch_dist, 1) dist-ckpt save strategy.


`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).


[NeMo I 2024-12-10 15:42:35 convert_llama_hf_to_nemo:307] NeMo model saved to: models/llama-3.1/8B/nemo/bf16/Llama-3_1-8B-Instruct.nemo


In [6]:
%%bash

PRECISION=bf16
NeMo_MODEL="models/llama-3.1/8B/nemo/$PRECISION/Llama-3_1-8B-Instruct.nemo"

file "$NeMo_MODEL"
du -sh "$NeMo_MODEL"

models/llama-3.1/8B/nemo/bf16/Llama-3_1-8B-Instruct.nemo: POSIX tar archive
22G	models/llama-3.1/8B/nemo/bf16/Llama-3_1-8B-Instruct.nemo


##  Step-by-Step Instructions

This notebook is organized into four main steps:

1. **Prepare the Dataset:**
   Load and preprocess the PubMedQA dataset, ensuring that it’s correctly formatted and ready for fine-tuning.

2. **Run the PEFT Fine-Tuning Script:**
   Apply Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning methods to tailor the Llama 3.3 70B model to the PubMedQA domain.

3. **Perform Inference with the NeMo Framework:**
   Use the trained model to generate answers to biomedical questions and observe how it performs on real queries.

4. **Evaluate Model Accuracy:**
   Assess the quality and correctness of the model’s responses to measure improvements gained through the fine-tuning process.
   
5. **Export Model to TensorRT-LLM Format for Inference:**
   use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM.

### Step 1: Prepare the dataset

Download the PubMedQA dataset and run the pre-processing script in the cloned directory.

In [9]:
%%bash

# Download the dataset and prep. scripts
git clone https://github.com/pubmedqa/pubmedqa.git

# split it into train/val/test datasets
cd pubmedqa/preprocess
python split_dataset.py pqal

fatal: destination path 'pubmedqa' already exists and is not an empty directory.


The following example shows what a single row looks inside of the PubMedQA train, validation and test splits.

```json
"18251357": {
    "QUESTION": "Does histologic chorioamnionitis correspond to clinical chorioamnionitis?",
    "CONTEXTS": [
        "To evaluate the degree to which histologic chorioamnionitis, a frequent finding in placentas submitted for histopathologic evaluation, correlates with clinical indicators of infection in the mother.",
        "A retrospective review was performed on 52 cases with a histologic diagnosis of acute chorioamnionitis from 2,051 deliveries at University Hospital, Newark, from January 2003 to July 2003. Third-trimester placentas without histologic chorioamnionitis (n = 52) served as controls. Cases and controls were selected sequentially. Maternal medical records were reviewed for indicators of maternal infection.",
        "Histologic chorioamnionitis was significantly associated with the usage of antibiotics (p = 0.0095) and a higher mean white blood cell count (p = 0.018). The presence of 1 or more clinical indicators was significantly associated with the presence of histologic chorioamnionitis (p = 0.019)."
    ],
    "reasoning_required_pred": "yes",
    "reasoning_free_pred": "yes",
    "final_decision": "yes",
    "LONG_ANSWER": "Histologic chorioamnionitis is a reliable indicator of infection whether or not it is clinically apparent."
},
```

Use the following code to convert the train, validation, and test PubMedQA data into the `JSONL` format that NeMo needs for PEFT.

In [10]:
import json

def read_jsonl(fname):
    obj = []
    with open(fname, 'rt') as f:
        st = f.readline()
        while st:
            obj.append(json.loads(st))
            st = f.readline()
    return obj

def write_jsonl(fname, json_objs):
    with open(fname, 'wt') as f:
        for o in json_objs:
            f.write(json.dumps(o)+"\n")
            
def form_question(obj):
    st = ""    
    for i, label in enumerate(obj['LABELS']):
        st += f"{label}: {obj['CONTEXTS'][i]}\n"
    st += f"QUESTION: {obj['QUESTION']}\n"
    st += f" ### ANSWER (yes|no|maybe): "
    return st

def convert_to_jsonl(data_path, output_path):
    data = json.load(open(data_path, 'rt'))
    json_objs = []
    for k in data.keys():
        obj = data[k]
        prompt = form_question(obj)
        completion = obj['final_decision']
        json_objs.append({"input": prompt, "output": f"<<< {completion} >>>"})
    write_jsonl(output_path, json_objs)
    return json_objs


test_json_objs = convert_to_jsonl("pubmedqa/data/test_set.json", "pubmedqa/data/pubmedqa_test.jsonl")
train_json_objs = convert_to_jsonl("pubmedqa/data/pqal_fold0/train_set.json", "pubmedqa/data/pubmedqa_train.jsonl")
dev_json_objs = convert_to_jsonl("pubmedqa/data/pqal_fold0/dev_set.json", "pubmedqa/data/pubmedqa_val.jsonl")

> `Note:` In the output, we enforce the inclusion of “<<<” and “>>>“ markers which would allow verification of the LoRA tuned model during inference. This is  because the base model can produce “yes” / “no” responses based on zero-shot templates as well.

After running the above script, you will see  `pubmedqa_train.jsonl`, `pubmedqa_val.jsonl`, and `pubmedqa_test.jsonl` files appear in the data directory.

This is what an example will be formatted like after the script has converted the PubMedQA data into `JSONL` -

```json
{"input": "QUESTION: Failed IUD insertions in community practice: an under-recognized problem?\nCONTEXT: The data analysis was conducted to describe the rate of unsuccessful copper T380A intrauterine device (IUD) insertions among women using the IUD for emergency contraception (EC) at community family planning clinics in Utah.\n ...  ### ANSWER (yes|no|maybe): ",
"output": "<<< yes >>>"}
```


In [7]:
%%bash

# clear up cached mem-map file
rm pubmedqa/data/*idx*

wc -l pubmedqa/data/pubmedqa_train.jsonl
wc -l pubmedqa/data/pubmedqa_val.jsonl
wc -l pubmedqa/data/pubmedqa_test.jsonl

450 pubmedqa/data/pubmedqa_train.jsonl
50 pubmedqa/data/pubmedqa_val.jsonl
500 pubmedqa/data/pubmedqa_test.jsonl



### Step 2: Run PEFT finetuning script for LoRA

NeMo framework includes a high level python script for fine-tuning  [megatron_gpt_finetuning.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py) that can abstract away some of the lower level API calls. Once you have your model downloaded and the dataset ready, LoRA fine-tuning with NeMo is essentially just running this script!

For this demonstration, this training run is capped by `max_steps`, and validation is carried out every `val_check_interval` steps. If the validation loss does not improve after a few checks, training is halted to avoid overfitting.

> `NOTE:` In the block of code below, pass the paths to your train, test and validation data files as well as path to the .nemo model.

#### Understanding Global Batch Size (GBS) in Multi-GPU Training


##### **1. Global Batch Size (GBS)**
- **Definition:**
  - The **total number of training samples** processed in **one training step** across **all GPUs** involved.

##### **2. Data Parallelism (DP)**
- **Definition:**
  - The **number of GPUs** that each hold a **replica** of the entire model.
  - **Function:** Distributes different data batches to each GPU simultaneously.
  - **GAS (Gradient Accumulation Steps):** The number of mini-batches over which gradients are accumulated before performing a parameter update.
  - **DP formula:**
      $$
      \text{Data Parallelism (DP)} = \frac{\text{Total GPUs} \times \text{Gradient Accumulation Step (GAS)}}{\text{Tensor Parallelism (TP)} \times \text{Pipeline Parallelism (PP)}}
      $$


##### **3. Micro Batch Size (MB)**
- **Definition:**
  - The **number of samples** processed **per GPU** in a single forward/backward pass.

##### **4. GBS Formula**
$$
\text{Global Batch Size (GBS)} = \text{Data Parallelism (DP)} \times \text{Micro Batch Size (MB)}
$$

##### **5. How to Set GBS**
1. **Determine Available GPUs:**
   - Total GPUs (e.g., 4 GPUs).
2. **Choose Data Parallelism (DP):**
   - Decide how many GPUs to use for DP (e.g., DP = 4).
3. **Set Micro Batch Size (MB):**
   - Based on GPU memory capacity (e.g., MB = 8).
4. **Calculate GBS:**
   - Use the formula to find GBS (e.g., GBS = 4 × 8 = 32).

##### **Best Practices**
- **Align GBS with DP and MB:**
  - Ensure $\text{GBS} = \text{DP} \times \text{MB}$.
- **Monitor GPU Utilization:**
  - Use tools like `nvidia-smi` to ensure all GPUs are effectively utilized.
- **Adjust Batch Sizes as Needed:**
  - Optimize **MB** based on memory constraints and **GBS** to balance load.
- **Utilize Gradient Accumulation:**
  - When larger **GBS** is desired but constrained by memory.


In [4]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

# Set paths to the model, train, validation and test sets.
PRECISION=bf16
MODEL="models/llama-3.1/8B/nemo/$PRECISION/Llama-3_1-8B-Instruct.nemo"
OUTPUT_DIR="results/llama-3.1/8B/$PRECISION"
rm -rf "$OUTPUT_DIR"

TRAIN_DS="[pubmedqa/data/pubmedqa_train.jsonl]"
VALID_DS="[pubmedqa/data/pubmedqa_val.jsonl]"

SCHEME="lora"
GPUS=1       # set equal to 4 for 70B model
TP_SIZE=1    # set equal to 4 for 70B model
PP_SIZE=1

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${OUTPUT_DIR} \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    trainer.precision=${PRECISION} \
    trainer.val_check_interval=20 \
    trainer.max_steps=1000 \
    model.megatron_amp_O2=False \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.global_batch_size=8 \
    model.micro_batch_size=1 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.num_workers=10 \
    model.data.validation_ds.num_workers=10 \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ucloud/.cache/huggingface/token
Login successful


`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-12-10 15:46:42 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-12-10 15:46:42 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 1000
      log_every_n_steps: 10
      val_check_interval: 20
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: results/llama-3.1/8B/bf16
      exp_dir: results/llama-3.1/8B/bf16
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.validation_ds.metric.name}
        s

[NeMo W 2024-12-10 15:46:42 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


[NeMo I 2024-12-10 15:46:42 exp_manager:396] ExpManager schema
[NeMo I 2024-12-10 15:46:42 exp_manager:397] {'explicit_log_dir': None, 'exp_dir': None, 'name': None, 'version': None, 'use_datetime_version': True, 'resume_if_exists': False, 'resume_past_end': False, 'resume_ignore_no_checkpoint': False, 'resume_from_checkpoint': None, 'create_tensorboard_logger': True, 'summary_writer_kwargs': None, 'create_wandb_logger': False, 'wandb_logger_kwargs': None, 'create_mlflow_logger': False, 'mlflow_logger_kwargs': {'experiment_name': None, 'tracking_uri': None, 'tags': None, 'save_dir': './mlruns', 'prefix': '', 'artifact_location': None, 'run_id': None, 'log_model': False}, 'create_dllogger_logger': False, 'dllogger_logger_kwargs': {'verbose': False, 'stdout': False, 'json_file': './dllogger.json'}, 'create_clearml_logger': False, 'clearml_logger_kwargs': {'project': None, 'task': None, 'connect_pytorch': False, 'model_name': None, 'tags': None, 'log_model': False, 'log_cfg': False, 'log_

[NeMo E 2024-12-10 15:46:42 exp_manager:830] exp_manager received explicit_log_dir: results/llama-3.1/8B/bf16 and at least one of exp_dir: results/llama-3.1/8B/bf16, or version: None. Please note that exp_dir, name, and version will be ignored.
[NeMo W 2024-12-10 15:46:42 exp_manager:757] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :results/llama-3.1/8B/bf16/checkpoints. Training from scratch.


[NeMo I 2024-12-10 15:46:42 exp_manager:455] Experiments will be logged at results/llama-3.1/8B/bf16
[NeMo I 2024-12-10 15:46:42 exp_manager:983] TensorboardLogger has been set up


[NeMo W 2024-12-10 15:46:42 exp_manager:1111] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 1000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo W 2024-12-10 15:46:59 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:46:59 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:46:59 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:46:59 megatron_base_model:1182] The model: MegatronGPTSFTModel() d

[NeMo I 2024-12-10 15:46:59 megatron_init:269] Rank 0 has data parallel group : [0]
[NeMo I 2024-12-10 15:46:59 megatron_init:275] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-12-10 15:46:59 megatron_init:280] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-12-10 15:46:59 megatron_init:283] Ranks 0 has data parallel rank: 0
[NeMo I 2024-12-10 15:46:59 megatron_init:291] Rank 0 has context parallel group: [0]
[NeMo I 2024-12-10 15:46:59 megatron_init:294] All context parallel group ranks: [[0]]
[NeMo I 2024-12-10 15:46:59 megatron_init:295] Ranks 0 has context parallel rank: 0
[NeMo I 2024-12-10 15:46:59 megatron_init:302] Rank 0 has model parallel group: [0]
[NeMo I 2024-12-10 15:46:59 megatron_init:303] All model parallel group ranks: [[0]]
[NeMo I 2024-12-10 15:46:59 megatron_init:312] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-12-10 15:46:59 megatron_init:316] All tensor model parallel group ranks: 

[NeMo W 2024-12-10 15:46:59 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:46:59 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:46:59 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:46:59 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:46:59 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-12-10 15:46:59 tokenizer_utils:183] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3-8B


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[NeMo I 2024-12-10 15:47:01 megatron_base_model:595] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-12-10 15:47:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:47:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:47:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:47:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 15:47:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-12-10 15:47:34 nlp_overrides:1346] Model MegatronGPTSFTModel was successfully restored from /work/ucloud-workshop-11-12-2024/models/llama-3.1/8B/nemo/bf16/Llama-3_1-8B-Instruct.nemo.
[NeMo I 2024-12-10 15:47:34 megatron_gpt_finetuning:72] Adding adapter weights to the model for PEFT
[NeMo I 2024-12-10 15:47:34 nlp_adapter_mixins:240] Before adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2024-12-10 15:47:35 nlp_adapter_mixins:245] After adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    10.5 M    Trainable params
    8.0 B

[NeMo W 2024-12-10 15:47:35 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-12-10 15:47:35 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-12-10 15:47:35 megatron_gpt_sft_model:801] Building GPT SFT validation datasets.
[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.067935
[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.068146
[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:249] Loading pubmedqa/data/pubmedqa_val.jsonl
[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001661
[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-12-10 15:47:35 megatron_gpt_sft_model:805] Length of val dataset: 50
[NeMo I 2024-12-10 15:47:35 megatron_gpt_sft_model:812] Building GPT SFT traing datasets.
[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.067487
[NeMo I 2024-12-10 15:47:35 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-10 15:47:36 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.067611
[NeMo I 2024-12-10 15:47:36 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-10 15:47:36 text_memmap_dataset:249] Loading pubmedqa/data/pubmedqa_train.jsonl
[NeMo I 2024-12-10 15:47:36 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001484
[NeMo I 2024-12-10 15:47:36 text_memmap_dataset:165] Computing global indices


      counts = torch.cuda.LongTensor([1])
    


make: Entering directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2024-12-10 15:47:38 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.06 (sec)
[NeMo I 2024-12-10 15:47:38 megatron_gpt_sft_model:814] Length of train dataset: 8040
[NeMo I 2024-12-10 15:47:38 megatron_gpt_sft_model:819] Building dataloader with consumed samples: 0
[NeMo I 2024-12-10 15:47:38 megatron_gpt_sft_model:819] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]


[NeMo I 2024-12-10 15:47:38 nlp_overrides:268] Configuring DDP for model parallelism.


[NeMo W 2024-12-10 15:47:38 megatron_base_model:1223] Ignoring `trainer.max_epochs` when computing `max_steps` because `trainer.max_steps` is already set to 1000.


[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 15:47:38 adapter_mixins:495] Unfrozen adapter : lora_kqv_


  | Name  | Type     | Params | Mode 
-------------------------------------------
0 | model | GPTModel | 8.0 B  | train
-------------------------------------------
10.5 M    Trainable params
8.0 B     Non-trainable params
8.0 B     Total params
32,162.988Total estimated model params size (MB)
[NeMo W 2024-12-10 15:47:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch 

Sanity Checking: |          | 0/? [00:00<?, ?it/s][NeMo I 2024-12-10 15:49:06 num_microbatches_calculator:119] setting number of micro-batches to constant 8
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:03<00:00,  0.55it/s][NeMo I 2024-12-10 15:49:09 num_microbatches_calculator:119] setting number of micro-batches to constant 8


[NeMo W 2024-12-10 15:49:09 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-10 15:49:09 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('validation_loss_dataloader0', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-10 15:49:09 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('validation_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 202

Epoch 0: :   2%|▏         | 20/1000 [00:20<16:27, reduced_train_loss=5.410, global_step=19.00, consumed_samples=160.0, train_step_timing in s=0.741]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:50:57 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.28it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.41it/s][A
Validation DataLoader 0:  43%|████▎     | 3/7 [00:01<00:01,  2.05it/s][A
Validation DataLoader 0:  57%|█████▋    | 4/7 [00:01<00:01,  2.16it/s][A
Validation DataLoader 0:  71%|███████▏  | 5/7 [00:02<00:00,  2.23it/s][A
Validation DataLoader 0:  86%|████████▌ | 6/7 [00:02<00:00,  2.26it/s][A
Validation DataLoader 0: 100%|██████████| 7/7 [00:03<00:00,  2.30it/s][A[NeMo I 2024-12-10 15:51:00 num_microbatches_calculator:119

Metric val_loss improved. New best score: 5.121
Epoch 0, global step 20: 'validation_loss' reached 5.12138 (best 5.12138), saving model to '/work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=5.121-step=20-consumed_samples=160.0.ckpt' as top 1
[NeMo W 2024-12-10 15:51:01 nlp_overrides:609] DistributedCheckpointIO configured but should not be used. Reverting back to TorchCheckpointIO


Epoch 0: :   4%|▍         | 40/1000 [00:39<15:37, reduced_train_loss=0.371, global_step=39.00, consumed_samples=320.0, train_step_timing in s=0.754, val_loss=5.120]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:51:16 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.28it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.41it/s][A
Validation DataLoader 0:  43%|████▎     | 3/7 [00:01<00:01,  2.43it/s][A
Validation DataLoader 0:  57%|█████▋    | 4/7 [00:01<00:01,  2.46it/s][A
Validation DataLoader 0:  71%|███████▏  | 5/7 [00:02<00:00,  2.48it/s][A
Validation DataLoader 0:  86%|████████▌ | 6/7 [00:02<00:00,  2.47it/s][A
Validation DataLoader 0: 100%|██████████| 7/7 [00:02<00:00,  2.48it/s][A[NeMo I 2024-12-10 15:51:19 num_microbatche

Metric val_loss improved by 4.808 >= min_delta = 0.001. New best score: 0.313
Epoch 0, global step 40: 'validation_loss' reached 0.31326 (best 0.31326), saving model to '/work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.313-step=40-consumed_samples=320.0.ckpt' as top 1


Epoch 0: :   4%|▍         | 40/1000 [00:41<16:45, reduced_train_loss=0.371, global_step=39.00, consumed_samples=320.0, train_step_timing in s=0.754, val_loss=0.313][NeMo I 2024-12-10 15:51:19 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=5.121-step=20-consumed_samples=160.0.ckpt
[NeMo I 2024-12-10 15:51:20 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=5.121-step=20-consumed_samples=160.0-last.ckpt
Epoch 0: :   6%|▌         | 60/1000 [00:57<14:53, reduced_train_loss=0.243, global_step=59.00, consumed_samples=480.0, train_step_timing in s=0.752, val_loss=0.313]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:51:34 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
V

Metric val_loss improved by 0.024 >= min_delta = 0.001. New best score: 0.289
Epoch 0, global step 60: 'validation_loss' reached 0.28923 (best 0.28923), saving model to '/work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.289-step=60-consumed_samples=480.0.ckpt' as top 1


Epoch 0: :   6%|▌         | 60/1000 [00:59<15:39, reduced_train_loss=0.243, global_step=59.00, consumed_samples=480.0, train_step_timing in s=0.752, val_loss=0.289][NeMo I 2024-12-10 15:51:37 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.313-step=40-consumed_samples=320.0.ckpt
[NeMo I 2024-12-10 15:51:38 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.313-step=40-consumed_samples=320.0-last.ckpt
Epoch 0: :   8%|▊         | 80/1000 [01:15<14:27, reduced_train_loss=0.225, global_step=79.00, consumed_samples=640.0, train_step_timing in s=0.737, val_loss=0.289]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:51:53 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
V

Metric val_loss improved by 0.027 >= min_delta = 0.001. New best score: 0.262
Epoch 0, global step 80: 'validation_loss' reached 0.26176 (best 0.26176), saving model to '/work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.262-step=80-consumed_samples=640.0.ckpt' as top 1


Epoch 0: :   8%|▊         | 80/1000 [01:18<15:00, reduced_train_loss=0.225, global_step=79.00, consumed_samples=640.0, train_step_timing in s=0.737, val_loss=0.262][NeMo I 2024-12-10 15:51:56 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.289-step=60-consumed_samples=480.0.ckpt
[NeMo I 2024-12-10 15:51:56 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.289-step=60-consumed_samples=480.0-last.ckpt
Epoch 0: :  10%|█         | 100/1000 [01:33<14:01, reduced_train_loss=0.217, global_step=99.00, consumed_samples=800.0, train_step_timing in s=0.735, val_loss=0.262]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:52:11 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A


Epoch 0, global step 100: 'validation_loss' was not in top 1


Epoch 0: :  10%|█         | 100/1000 [01:36<14:26, reduced_train_loss=0.217, global_step=99.00, consumed_samples=800.0, train_step_timing in s=0.735, val_loss=0.277][NeMo I 2024-12-10 15:52:14 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.262-step=80-consumed_samples=640.0-last.ckpt
Epoch 0: :  12%|█▏        | 120/1000 [01:51<13:36, reduced_train_loss=0.454, global_step=119.0, consumed_samples=960.0, train_step_timing in s=0.756, val_loss=0.277]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:52:29 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.29it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.40it/s][A
Validation Da

Epoch 0, global step 120: 'validation_loss' was not in top 1


Epoch 0: :  12%|█▏        | 120/1000 [01:54<13:57, reduced_train_loss=0.454, global_step=119.0, consumed_samples=960.0, train_step_timing in s=0.756, val_loss=0.395][NeMo I 2024-12-10 15:52:32 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.277-step=100-consumed_samples=800.0-last.ckpt
Epoch 0: :  14%|█▍        | 140/1000 [02:09<13:13, reduced_train_loss=0.176, global_step=139.0, consumed_samples=1120.0, train_step_timing in s=0.737, val_loss=0.395]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:52:47 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.27it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.40it/s][A
Validation 

Epoch 0, global step 140: 'validation_loss' was not in top 1


Epoch 0: :  14%|█▍        | 140/1000 [02:12<13:31, reduced_train_loss=0.176, global_step=139.0, consumed_samples=1120.0, train_step_timing in s=0.737, val_loss=0.333][NeMo I 2024-12-10 15:52:49 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.395-step=120-consumed_samples=960.0-last.ckpt
Epoch 0: :  16%|█▌        | 160/1000 [02:27<12:53, reduced_train_loss=0.226, global_step=159.0, consumed_samples=1280.0, train_step_timing in s=0.741, val_loss=0.333] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:53:05 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.29it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.41it/s][A
Validatio

Epoch 0, global step 160: 'validation_loss' was not in top 1


Epoch 0: :  16%|█▌        | 160/1000 [02:30<13:08, reduced_train_loss=0.226, global_step=159.0, consumed_samples=1280.0, train_step_timing in s=0.741, val_loss=0.294][NeMo I 2024-12-10 15:53:08 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.333-step=140-consumed_samples=1120.0-last.ckpt
Epoch 0: :  18%|█▊        | 180/1000 [02:45<12:33, reduced_train_loss=0.229, global_step=179.0, consumed_samples=1440.0, train_step_timing in s=0.740, val_loss=0.294]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:53:23 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.29it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.40it/s][A
Validatio

Epoch 0, global step 180: 'validation_loss' was not in top 1


Epoch 0: :  18%|█▊        | 180/1000 [02:48<12:45, reduced_train_loss=0.229, global_step=179.0, consumed_samples=1440.0, train_step_timing in s=0.740, val_loss=0.328][NeMo I 2024-12-10 15:53:26 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.294-step=160-consumed_samples=1280.0-last.ckpt
Epoch 0: :  20%|██        | 200/1000 [03:03<12:13, reduced_train_loss=0.240, global_step=199.0, consumed_samples=1600.0, train_step_timing in s=0.738, val_loss=0.328] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:53:41 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.29it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.41it/s][A
Validati

Epoch 0, global step 200: 'validation_loss' was not in top 1


Epoch 0: :  20%|██        | 200/1000 [03:06<12:25, reduced_train_loss=0.240, global_step=199.0, consumed_samples=1600.0, train_step_timing in s=0.738, val_loss=0.290][NeMo I 2024-12-10 15:53:44 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.328-step=180-consumed_samples=1440.0-last.ckpt
Epoch 0: :  22%|██▏       | 220/1000 [03:21<11:53, reduced_train_loss=0.170, global_step=219.0, consumed_samples=1760.0, train_step_timing in s=0.735, val_loss=0.290]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:53:59 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.29it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.41it/s][A
Validatio

Epoch 0, global step 220: 'validation_loss' was not in top 1


Epoch 0: :  22%|██▏       | 220/1000 [03:24<12:03, reduced_train_loss=0.170, global_step=219.0, consumed_samples=1760.0, train_step_timing in s=0.735, val_loss=0.280][NeMo I 2024-12-10 15:54:02 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.290-step=200-consumed_samples=1600.0-last.ckpt
Epoch 0: :  24%|██▍       | 240/1000 [03:39<11:34, reduced_train_loss=0.0792, global_step=239.0, consumed_samples=1920.0, train_step_timing in s=0.738, val_loss=0.280]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:54:17 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.30it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.40it/s][A
Validati

Epoch 0, global step 240: 'validation_loss' was not in top 1


Epoch 0: :  24%|██▍       | 240/1000 [03:42<11:43, reduced_train_loss=0.0792, global_step=239.0, consumed_samples=1920.0, train_step_timing in s=0.738, val_loss=0.284][NeMo I 2024-12-10 15:54:20 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.280-step=220-consumed_samples=1760.0-last.ckpt
Epoch 0: :  26%|██▌       | 260/1000 [03:57<11:14, reduced_train_loss=0.0281, global_step=259.0, consumed_samples=2080.0, train_step_timing in s=0.742, val_loss=0.284]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:54:34 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.29it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.41it/s][A
Validat

Epoch 0, global step 260: 'validation_loss' was not in top 1


Epoch 0: :  26%|██▌       | 260/1000 [03:59<11:22, reduced_train_loss=0.0281, global_step=259.0, consumed_samples=2080.0, train_step_timing in s=0.742, val_loss=0.494][NeMo I 2024-12-10 15:54:37 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.284-step=240-consumed_samples=1920.0-last.ckpt
Epoch 0: :  28%|██▊       | 280/1000 [04:15<10:56, reduced_train_loss=0.0584, global_step=279.0, consumed_samples=2240.0, train_step_timing in s=0.743, val_loss=0.494]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 15:54:53 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:00<00:02,  2.29it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:00<00:02,  2.42it/s][A
Validat

Monitored metric val_loss did not improve in the last 10 records. Best score: 0.262. Signaling Trainer to stop.
Epoch 0, global step 280: 'validation_loss' was not in top 1


Epoch 0: :  28%|██▊       | 280/1000 [04:18<11:03, reduced_train_loss=0.0584, global_step=279.0, consumed_samples=2240.0, train_step_timing in s=0.743, val_loss=0.394][NeMo I 2024-12-10 15:54:56 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.494-step=260-consumed_samples=2080.0-last.ckpt
Epoch 0: :  28%|██▊       | 280/1000 [04:18<11:04, reduced_train_loss=0.0584, global_step=279.0, consumed_samples=2240.0, train_step_timing in s=0.743, val_loss=0.394]


Restoring states from the checkpoint path at /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.262-step=80-consumed_samples=640.0.ckpt
Restored all states from the checkpoint at /work/ucloud-workshop-11-12-2024/results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.262-step=80-consumed_samples=640.0.ckpt


This will create a LoRA adapter - a file named `megatron_gpt_peft_lora_tuning.nemo` in `./results/.../checkpoints/`. We'll use this later.

To further configure the run above -

* **A different PEFT technique**: The `peft.peft_scheme` parameter determines the technique being used. In this case, we did LoRA, but NeMo Framework supports other techniques as well - such as P-tuning, Adapters, and IA3. For more information, refer to the [PEFT support matrix](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/peft/landing_page.html). For example, for P-tuning, simply set 

```bash
model.peft.peft_scheme="ptuning" # instead of "lora"
```

* **Tuning Llama-3.1 70B**: You will need 4xH100 GPUs. Provide the path to it's .nemo checkpoint (similar to the download and conversion steps earlier), and change the model parallelization settings for Llama-3 70B PEFT to distribute across the GPUs. It is also recommended to run the fine-tuning script from a terminal directly instead of Jupyter when using more than 1 GPU.
```bash
model.tensor_model_parallel_size=4
model.pipeline_model_parallel_size=1
```

You can override many such configurations while running the script. A full set of possible configurations is located in [NeMo Framework Github](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml).

### Step 3: Inference with NeMo Framework

Running text generation within the framework is also possible with running a Python script. Note that is more for testing and validation, not a full-fledged  deployment solution like NVIDIA NIM.

In [7]:
%%bash
# Check that the LORA model file exists

python -c "import torch; torch.cuda.empty_cache()"

PRECISION=bf16
OUTPUT_DIR="results/llama-3.1/8B/$PRECISION"
ls -l $OUTPUT_DIR/checkpoints

total 286968
-rw-r--r--. 1 ucloud ucloud 125934382 Dec 10 15:51 megatron_gpt_peft_lora_tuning--validation_loss=0.262-step=80-consumed_samples=640.0.ckpt
-rw-r--r--. 1 ucloud ucloud 125934382 Dec 10 15:54 megatron_gpt_peft_lora_tuning--validation_loss=0.394-step=280-consumed_samples=2240.0-last.ckpt
-rw-r--r--. 1 ucloud ucloud  41984000 Dec 10 15:54 megatron_gpt_peft_lora_tuning.nemo


In the code snippet below, the following configurations are worth noting: 

1. `model.restore_from_path` to the path for the Meta-Llama-3-8B-Instruct.nemo file.
2. `model.peft.restore_from_path` to the path for the PEFT checkpoint that was created in the fine-tuning run in the last step.
3. `model.test_ds.file_names` to the path of the pubmedqa_test.jsonl file

If you have made any changes in model or experiment paths, please ensure they are configured correctly below.

In [15]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

PRECISION=bf16
MODEL="models/llama-3.1/8B/nemo/$PRECISION/Llama-3_1-8B-Instruct.nemo"
OUTPUT_DIR="results/llama-3.1/8B/$PRECISION"
TEST_DS="[pubmedqa/data/pubmedqa_test.jsonl]"
TEST_NAMES="[pubmedqa]"
SCHEME="lora"
GPUS=1
TP_SIZE=1
PP_SIZE=1

# This is where your LoRA checkpoint was saved
PATH_TO_TRAINED_MODEL="$OUTPUT_DIR/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

# The generation run will save the generated outputs over the test dataset in a file prefixed like so
OUTPUT_PREFIX="pubmedQA_result_"

export TOKENIZERS_PARALLELISM=true

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${MODEL} \
    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=1 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=3 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \
    model.data.test_ds.write_predictions_to_file=True

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ucloud/.cache/huggingface/token
Login successful


`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-12-10 16:10:06 megatron_gpt_generate:125] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-12-10 16:10:06 megatron_gpt_generate:126] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 20000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.test_ds.metric.name}
        save_top_k: 1
        mode: max
        save_nemo_o

[NeMo W 2024-12-10 16:10:06 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-12-10 16:10:24 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 16:10:24 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 16:10:24 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it 

[NeMo I 2024-12-10 16:10:24 megatron_init:269] Rank 0 has data parallel group : [0]
[NeMo I 2024-12-10 16:10:24 megatron_init:275] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-12-10 16:10:24 megatron_init:280] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-12-10 16:10:24 megatron_init:283] Ranks 0 has data parallel rank: 0
[NeMo I 2024-12-10 16:10:24 megatron_init:291] Rank 0 has context parallel group: [0]
[NeMo I 2024-12-10 16:10:24 megatron_init:294] All context parallel group ranks: [[0]]
[NeMo I 2024-12-10 16:10:24 megatron_init:295] Ranks 0 has context parallel rank: 0
[NeMo I 2024-12-10 16:10:24 megatron_init:302] Rank 0 has model parallel group: [0]
[NeMo I 2024-12-10 16:10:24 megatron_init:303] All model parallel group ranks: [[0]]
[NeMo I 2024-12-10 16:10:24 megatron_init:312] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-12-10 16:10:24 megatron_init:316] All tensor model parallel group ranks: 

[NeMo W 2024-12-10 16:10:24 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 16:10:24 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 16:10:24 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 16:10:24 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 16:10:24 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-12-10 16:10:24 tokenizer_utils:183] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3-8B


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[NeMo I 2024-12-10 16:10:25 megatron_base_model:595] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-12-10 16:10:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 16:10:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 16:10:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 16:10:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 16:10:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-12-10 16:10:57 nlp_overrides:1346] Model MegatronGPTSFTModel was successfully restored from /work/ucloud-workshop-11-12-2024/models/llama-3.1/8B/nemo/bf16/Llama-3_1-8B-Instruct.nemo.
[NeMo I 2024-12-10 16:10:57 nlp_adapter_mixins:240] Before adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2024-12-10 16:10:59 nlp_adapter_mixins:245] After adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    10.5 M    Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,162.988Total estimated model params size 

[NeMo W 2024-12-10 16:10:59 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-12-10 16:10:59 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-12-10 16:10:59 megatron_gpt_sft_model:793] Building GPT SFT test datasets.
[NeMo I 2024-12-10 16:10:59 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-10 16:10:59 text_memmap_dataset:525] Processing 1 data files using 96 workers
[NeMo I 2024-12-10 16:11:01 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:01.866361
[NeMo I 2024-12-10 16:11:01 text_memmap_dataset:525] Processing 1 data files using 96 workers
[NeMo I 2024-12-10 16:11:02 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:01.881394
[NeMo I 2024-12-10 16:11:02 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-10 16:11:02 text_memmap_dataset:249] Loading pubmedqa/data/pubmedqa_test.jsonl
[NeMo I 2024-12-10 16:11:02 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001769
[NeMo I 2024-12-10 16:11:02 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-12-10 16:11:02 megatron_gpt_sft_model:796] Length of test dataset: 500
[NeMo I 2

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[NeMo W 2024-12-10 16:11:03 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=191` in the `DataLoader` to improve performance.
    
[NeMo W 2024-12-10 16:11:03 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `test_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    


Testing: |          | 0/? [00:00<?, ?it/s]setting number of micro-batches to constant 1


      input_info_tensor = torch.cuda.FloatTensor(input_info)
    
      string_tensor = torch.as_tensor(
    


Testing DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   0%|          | 1/500 [00:07<1:00:32,  0.14it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   0%|          | 2/500 [00:07<31:14,  0.27it/s]  setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   1%|          | 3/500 [00:07<21:27,  0.39it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   1%|          | 4/500 [00:08<16:40,  0.50it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   1%|          | 5/500 [00:08<13:43,  0.60it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   1%|          |

[NeMo W 2024-12-10 16:13:30 megatron_gpt_sft_model:642] No training data found, reconfiguring microbatches based on validation batch sizes.


setting number of micro-batches to constant 1


[NeMo W 2024-12-10 16:13:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-10 16:13:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss_pubmedqa', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-10 16:13:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    


Testing DataLoader 0: 100%|██████████| 500/500 [02:27<00:00,  3.39it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m       Test metric       [0m[1m [0m┃[1m [0m[1m      DataLoader 0       [0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m        test_loss        [0m[36m [0m│[35m [0m[35m   0.2418619990348816    [0m[35m [0m│
│[36m [0m[36m   test_loss_pubmedqa    [0m[36m [0m│[35m [0m[35m   0.2418619990348816    [0m[35m [0m│
│[36m [0m[36m        val_loss         [0m[36m [0m│[35m [0m[35m   0.2418619990348816    [0m[35m [0m│
└───────────────────────────┴───────────────────────────┘


### Step 4: Check the model accuracy

Now that the results are in, let's read the results and calculate the accuracy on the pubmedQA task. You can compare your accuracy results with the public leaderboard at https://pubmedqa.github.io/.

Let's take a look at one of the predictions in the generated output file. The `pred` key indicates what was generated.

In [16]:
%%bash

tail -n 1 pubmedQA_result__test_pubmedqa_inputs_preds_labels.jsonl

{"input": "OBJECTIVES: Outcome feedback is the process of learning patient outcomes after their care within the emergency department. We conducted a national survey of Canadian Royal College emergency medicine (EM) residents and program directors to determine the extent to which active outcome feedback and follow-up occurred. We also compared the perceived educational value of outcome feedback between residents and program directors.\nMETHODS: We distributed surveys to all Royal College-accredited adult and pediatric EM training programs using a modified Dillman method. We analyzed the data using student's t-test for continuous variables and Fisher's exact test for categorical variables.\nRESULTS: We received 210 completed surveys from 260 eligible residents (80.8%) and 21 of 24 program directors (87.5%) (overall 81.3%). Mandatory active outcome feedback was not present in any EM training program for admitted or discharged patients (0/21). Follow-up was performed electively by 89.4% of

Note that the model produces output in the specified format, such as `<<< no >>>`.

The following snippet loads the generated output and calculates accuracy in comparison to the test set using the `evaluation.py` script included in the PubMedQA repo.

In [17]:
import json

answers = []
with open("pubmedQA_result__test_pubmedqa_inputs_preds_labels.jsonl",'rt') as f:
    st = f.readline()
    while st:
        answers.append(json.loads(st))
        st = f.readline()

In [18]:
data_test = json.load(open("./pubmedqa/data/test_set.json",'rt'))

In [19]:
results = {}
sample_id = list(data_test.keys())

for i, key in enumerate(sample_id):
    answer = answers[i]['pred']
    if 'yes' in answer:
        results[key] = 'yes'
    elif 'no' in answer:
        results[key] = 'no'
    elif 'maybe' in answer:
        results[key] = 'maybe'
    else:
        print("Malformed answer: ", answer)
        results[key] = 'maybe'

In [21]:
# Dump results in a format that can be ingested by PubMedQA evaluation file
FILENAME="pubmedqa-llama-3-8b-lora.json"
with(open(FILENAME, "w")) as f:
    json.dump(results, f)

# Evaluation
!cp $FILENAME ./pubmedqa/
!cd ./pubmedqa/ && python evaluation.py $FILENAME

Accuracy 0.552000
Macro-F1 0.237113


For the Llama-3-8B-Instruct model, you should see accuracy comparable to the below:
```
Accuracy 0.792000
Macro-F1 0.594778
```

## Export Model to TensorRT-LLM Format for Inference

In [12]:
from nemo.export.tensorrt_llm import TensorRTLLM

MODEL_DIR="models/llama-3.1/8B/trt_llm/bf16/tp_1"
MODEL_CKPT="models/llama-3.1/8B/nemo/bf16/Llama-3_1-8B-Instruct.nemo"
LORA_CKPT="results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

trt_llm_exporter = TensorRTLLM(
    model_dir=MODEL_DIR,
    lora_ckpt_list=[LORA_CKPT],
)

trt_llm_exporter.export(
    nemo_checkpoint_path=MODEL_CKPT,
    model_type="llama",
    n_gpus=1,
)

      trt_llm_exporter.export(
    
    
    
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
saving weights: 100%|██████████| 193/193 [00:22<00:00,  8.43it/s]


[12/10/2024-19:33:25] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


I1210 19:33:25.021300 140016166184064 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


[12/10/2024-19:33:25] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


I1210 19:33:25.022238 140016166184064 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


[12/10/2024-19:33:25] [TRT-LLM] [I] Set multi_block_mode to False.


I1210 19:33:25.022658 140016166184064 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.


[12/10/2024-19:33:25] [TRT-LLM] [I] Set paged_kv_cache to True.


I1210 19:33:25.023044 140016166184064 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.


[12/10/2024-19:33:25] [TRT-LLM] [I] Set tokens_per_block to 128.


I1210 19:33:25.023427 140016166184064 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.


[12/10/2024-19:33:25] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.


W1210 19:33:25.023816 140016166184064 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.


[12/10/2024-19:33:25] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 



W1210 19:33:25.024170 140016166184064 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 



[12/10/2024-19:33:53] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 35139, GPU 16556 (MiB)
[12/10/2024-19:33:57] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +4312, GPU +1148, now: CPU 39584, GPU 17706 (MiB)
[12/10/2024-19:33:57] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[12/10/2024-19:33:57] [TRT-LLM] [I] Set nccl_plugin to None.


I1210 19:33:57.562298 140016166184064 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to None.


[12/10/2024-19:33:57] [TRT-LLM] [I] Set use_custom_all_reduce to True.


I1210 19:33:57.563257 140016166184064 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/10/2024-19:33:57] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-19:33:57] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-19:33:57] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-19:33:57] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_laye

I1210 19:33:57.751935 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.0.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.753167 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.753668 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.754108 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.754545 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.5.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.754979 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.5.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.6.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.755409 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.6.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.7.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.755842 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.7.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.8.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.756255 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.8.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.9.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.757529 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.9.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.10.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.757969 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.10.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.11.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.758401 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.11.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.12.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.758834 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.12.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.13.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.759242 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.13.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.14.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.759636 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.14.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.15.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.760049 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.15.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.16.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.760453 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.16.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.17.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.760867 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.17.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.18.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.761236 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.18.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.19.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.761660 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.19.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.20.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.763089 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.20.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.21.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.763520 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.21.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.22.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.763967 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.22.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.23.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.764358 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.23.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.24.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.764776 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.24.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.25.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.765185 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.25.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.26.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.765588 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.26.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.27.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.765981 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.27.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.28.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.766387 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.28.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.29.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.766775 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.29.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.30.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.767216 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.30.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Parameter transformer.layers.31.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 19:33:57.767641 140016166184064 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.31.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-19:33:57] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0


I1210 19:33:57.768037 140016166184064 logger.py:92] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0


[12/10/2024-19:33:57] [TRT] [W] Unused Input: position_ids
[12/10/2024-19:33:57] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[12/10/2024-19:33:57] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[12/10/2024-19:34:03] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[12/10/2024-19:34:03] [TRT] [I] Detected 14 inputs and 1 output network tensors.
[12/10/2024-19:34:09] [TRT] [I] Total Host Persistent Memory: 103744
[12/10/2024-19:34:09] [TRT] [I] Total Device Persistent Memory: 0
[12/10/2024-19:34:09] [TRT] [I] Total Scratch Memory: 33565056
[12/10/2024-19:34:09] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 590 steps to complete.
[12/10/2024-19:34:09] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 20.819ms to assign 17 blocks to 590 nodes requiring 402658816 bytes.
[12/10/2024-19:34:09] [TRT] [

I1210 19:34:16.961966 140016166184064 logger.py:92] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:19


[12/10/2024-19:34:16] [TRT-LLM] [I] Serializing engine to models/llama-3.1/8B/trt_llm/bf16/tp_1/rank0.engine...


I1210 19:34:16.967163 140016166184064 logger.py:92] [TRT-LLM] [I] Serializing engine to models/llama-3.1/8B/trt_llm/bf16/tp_1/rank0.engine...


[12/10/2024-19:34:26] [TRT-LLM] [I] Engine serialized. Total time: 00:00:09


I1210 19:34:26.120570 140016166184064 logger.py:92] [TRT-LLM] [I] Engine serialized. Total time: 00:00:09
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set bert_attention_plugin to float16.


I1210 19:34:35.050756 140016166184064 logger.py:92] [TRT-LLM] [I] Set bert_attention_plugin to float16.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


I1210 19:34:35.051626 140016166184064 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


I1210 19:34:35.052032 140016166184064 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.


I1210 19:34:35.052399 140016166184064 logger.py:92] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set identity_plugin to None.


I1210 19:34:35.052774 140016166184064 logger.py:92] [TRT-LLM] [I] Set identity_plugin to None.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.


I1210 19:34:35.053117 140016166184064 logger.py:92] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.


I1210 19:34:35.053474 140016166184064 logger.py:92] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set nccl_plugin to None.


I1210 19:34:35.053820 140016166184064 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to None.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set lookup_plugin to None.


I1210 19:34:35.054188 140016166184064 logger.py:92] [TRT-LLM] [I] Set lookup_plugin to None.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set lora_plugin to None.


I1210 19:34:35.055349 140016166184064 logger.py:92] [TRT-LLM] [I] Set lora_plugin to None.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.


I1210 19:34:35.055704 140016166184064 logger.py:92] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.


I1210 19:34:35.056076 140016166184064 logger.py:92] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set quantize_per_token_plugin to False.


I1210 19:34:35.056402 140016166184064 logger.py:92] [TRT-LLM] [I] Set quantize_per_token_plugin to False.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set quantize_tensor_plugin to False.


I1210 19:34:35.056735 140016166184064 logger.py:92] [TRT-LLM] [I] Set quantize_tensor_plugin to False.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set moe_plugin to float16.


I1210 19:34:35.057059 140016166184064 logger.py:92] [TRT-LLM] [I] Set moe_plugin to float16.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.


I1210 19:34:35.057392 140016166184064 logger.py:92] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set context_fmha to True.


I1210 19:34:35.057753 140016166184064 logger.py:92] [TRT-LLM] [I] Set context_fmha to True.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.


I1210 19:34:35.058078 140016166184064 logger.py:92] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set paged_kv_cache to True.


I1210 19:34:35.058415 140016166184064 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set remove_input_padding to True.


I1210 19:34:35.058729 140016166184064 logger.py:92] [TRT-LLM] [I] Set remove_input_padding to True.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set use_custom_all_reduce to True.


I1210 19:34:35.059078 140016166184064 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set multi_block_mode to False.


I1210 19:34:35.059432 140016166184064 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set enable_xqa to True.


I1210 19:34:35.059765 140016166184064 logger.py:92] [TRT-LLM] [I] Set enable_xqa to True.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.


I1210 19:34:35.060095 140016166184064 logger.py:92] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set tokens_per_block to 128.


I1210 19:34:35.060425 140016166184064 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set use_paged_context_fmha to False.


I1210 19:34:35.060770 140016166184064 logger.py:92] [TRT-LLM] [I] Set use_paged_context_fmha to False.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set use_fp8_context_fmha to False.


I1210 19:34:35.061083 140016166184064 logger.py:92] [TRT-LLM] [I] Set use_fp8_context_fmha to False.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.


I1210 19:34:35.061447 140016166184064 logger.py:92] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set multiple_profiles to False.


I1210 19:34:35.061794 140016166184064 logger.py:92] [TRT-LLM] [I] Set multiple_profiles to False.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set paged_state to True.


I1210 19:34:35.062124 140016166184064 logger.py:92] [TRT-LLM] [I] Set paged_state to True.


[12/10/2024-19:34:35] [TRT-LLM] [I] Set streamingllm to False.


I1210 19:34:35.062477 140016166184064 logger.py:92] [TRT-LLM] [I] Set streamingllm to False.


[12/10/2024-19:34:35] [TRT] [I] Loaded engine size: 15383 MiB
[12/10/2024-19:34:35] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 30761 (MiB)
[12/10/2024-19:34:35] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 30761 (MiB)
[12/10/2024-19:34:35] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.


W1210 19:34:35.973186 140016166184064 logger.py:92] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.


[12/10/2024-19:34:35] [TRT-LLM] [I] Load engine takes: 13.958819389343262 sec


I1210 19:34:35.974622 140016166184064 logger.py:92] [TRT-LLM] [I] Load engine takes: 13.958819389343262 sec


In [None]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

PRECISION=bf16
MODEL_DIR="models/llama-3.1/8B/trt_llm/$PRECISION/tp_1"
mkdir -p "$MODEL_DIR"
MODEL_CKPT="models/llama-3.1/8B/nemo/$PRECISION/Llama-3_1-8B-Instruct.nemo"
LORA_CKPT="results/llama-3.1/8B/$PRECISION/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

python /opt/NeMo/scripts/deploy/nlp/deploy_triton.py \
    --nemo_checkpoint "$MODEL_CKPT" \
    --lora_ckpt "$LORA_CKPT" \
    --use_lora_plugin \
    --model_type llama \
    --triton_model_name llama3-pubmedqa \
    --triton_model_repository "$MODEL_DIR" \
    --num_gpus 1 \
    --tensor_parallelism_size 1 \
    --pipeline_parallelism_size 1

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ucloud/.cache/huggingface/token
Login successful


`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
W1210 19:55:35.402386 140579951768704 logger.py:92] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage


[TensorRT-LLM] TensorRT-LLM version: 0.10.0


I1210 19:55:36.112548 140579951768704 deploy_triton.py:344] Logging level set to 20
I1210 19:55:36.112711 140579951768704 deploy_triton.py:345] Namespace(nemo_checkpoint='models/llama-3.1/8B/nemo/bf16/Llama-3_1-8B-Instruct.nemo', ptuning_nemo_checkpoint=None, task_ids=None, model_type='llama', triton_model_name='llama3-pubmedqa', triton_model_version=1, triton_port=8000, triton_http_address='0.0.0.0', triton_request_timeout=60, triton_model_repository='models/llama-3.1/8B/trt_llm/bf16/tp_1', num_gpus=1, tensor_parallelism_size=1, pipeline_parallelism_size=1, dtype='bfloat16', max_input_len=256, max_output_len=256, max_batch_size=8, max_num_tokens=None, opt_num_tokens=None, max_prompt_embedding_table_size=None, no_paged_kv_cache=False, disable_remove_input_padding=False, use_parallel_embedding=False, multi_block_mode=False, enable_streaming=False, use_lora_plugin=None, lora_target_modules=None, max_lora_rank=64, lora_ckpt=['results/llama-3.1/8B/bf16/checkpoints/megatron_gpt_peft_lora_tu

Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully


saving weights: 100%|██████████| 193/193 [00:28<00:00,  6.81it/s]
I1210 19:56:16.408668 140579951768704 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
I1210 19:56:16.408972 140579951768704 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
I1210 19:56:16.409007 140579951768704 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.
I1210 19:56:16.409029 140579951768704 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.
I1210 19:56:16.409046 140579951768704 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.
W1210 19:56:16.409075 140579951768704 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
W1210 19:56:16.409

[12/10/2024-19:56:47] [TRT] [I] [MemUsageChange] Init CUDA: CPU +16, GPU +0, now: CPU 34991, GPU 18482 (MiB)
[12/10/2024-19:56:51] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +4312, GPU +1150, now: CPU 39439, GPU 19634 (MiB)
[12/10/2024-19:56:51] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.


I1210 19:56:51.692784 140579951768704 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to None.
I1210 19:56:51.692900 140579951768704 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/10/2024-19:56:51] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-19:56:51] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-19:56:51] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-19:56:51] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_laye

I1210 19:56:51.933476 140579951768704 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.0.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1210 19:56:51.933691 140579951768704 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1210 19:56:51.933785 140579951768704 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1210 19:56:51.933868 140579951768704 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1210 19:56:51.933943 140579951768704 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1210 19:56:51.934018 140579951768704 logger.py:92] [TRT-LLM] [I] Parameter tran

[12/10/2024-19:56:51] [TRT] [W] Unused Input: position_ids
[12/10/2024-19:56:51] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[12/10/2024-19:56:51] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[12/10/2024-19:56:57] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[12/10/2024-19:56:57] [TRT] [I] Detected 14 inputs and 1 output network tensors.
[12/10/2024-19:57:05] [TRT] [I] Total Host Persistent Memory: 103744
[12/10/2024-19:57:05] [TRT] [I] Total Device Persistent Memory: 0
[12/10/2024-19:57:05] [TRT] [I] Total Scratch Memory: 33565056
[12/10/2024-19:57:05] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 590 steps to complete.
[12/10/2024-19:57:05] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 26.6302ms to assign 17 blocks to 590 nodes requiring 402658816 bytes.
[12/10/2024-19:57:05] [TRT] 

I1210 19:57:13.314188 140579951768704 logger.py:92] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:21
I1210 19:57:13.319579 140579951768704 logger.py:92] [TRT-LLM] [I] Serializing engine to models/llama-3.1/8B/trt_llm/bf16/tp_1/rank0.engine...
I1210 19:57:24.630646 140579951768704 logger.py:92] [TRT-LLM] [I] Engine serialized. Total time: 00:00:11
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I1210 19:57:33.846337 140579951768704 logger.py:92] [TRT-LLM] [I] Set bert_attention_plugin to float16.
I1210 19:57:33.846453 140579951768704 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
I1210 19:57:33.846481 140579951768704 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
I1210 19:57:33.846505 140579951768704 logger.py:92] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
I1210 19:57:33.846524 140579951768704 logger.py:92] [TRT-LLM] [I] Set identity_plugin to None.
I1210 19:57:33.

[12/10/2024-19:57:33] [TRT] [I] Loaded engine size: 15383 MiB
[12/10/2024-19:57:34] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)
[12/10/2024-19:57:34] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)


W1210 19:57:34.818560 140579951768704 logger.py:92] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.
I1210 19:57:34.820907 140579951768704 logger.py:92] [TRT-LLM] [I] Load engine takes: 8.207995176315308 sec
I1210 19:57:35.950140 140579951768704 deploy_triton.py:377] Triton deploy function will be called.
I1210 19:57:35.952382 140579951768704 deploy_triton.py:384] Model serving on Triton is will be started.
      return _nested.nested_tensor(
    


Open a terminal to query the model:

```shell
QUERY="Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?"

python /opt/NeMo/scripts/deploy/nlp/query.py \
    -mn llama3-pubmedqa \
    -p "$QUERY" \
    -mol 5
```