<center>
  <a href="https://escience.sdu.dk/index.php/ucloud/">
    <img src="https://escience.sdu.dk/wp-content/uploads/2020/03/logo_esc.svg" width="400" height="186" />
  </a>
</center>
<br>
<p style="font-size: 1.2em;">
  This notebook was tested using <strong>NeMo Framework v24.07</strong> and machine type <code>u3-gpu4</code> on UCloud.
</p>


# Building a Llama-3.1 LoRA Adapter with the NeMo Framework

This notebook showcases performing LoRA PEFT [**Llama 3.1 70B**](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct) on [PubMedQA](https://pubmedqa.github.io/) using NeMo Framework. PubMedQA is a Question-Answering dataset for biomedical texts.

In this notebook, we demonstrate how to apply Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning (PEFT) techniques to the Llama 3.3 70B model using the NeMo Framework. We use [PubMedQA](https://pubmedqa.github.io/), a specialized question-answering dataset derived from biomedical literature, to illustrate how LoRA adapters can efficiently enhance model performance within a domain-specific context.

**Disclaimer**: This notebook is adapted from the [NVIDIA NeMo tutorial on biomedical QA with Llama-3](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/biomedical-qa/llama3-lora-nemofw.ipynb).

## Estimating GPU Memory Requirements for Serving LLMs


### **1. Model Size**
Before you begin, it’s essential to understand how much GPU memory you’ll need to serve a large language model (LLM). A commonly used formula is:

$$
M_{\text{model}} = \frac{(P \times 4B)}{(32 / Q)}
$$

**Where:**

- **M**: The GPU memory required (in Gigabytes)  
- **P**: The number of parameters in the model (e.g., 7 billion parameters for a 7B model)  
- **4B**: 4 bytes, representing the size of each parameter at full precision (32 bits)  
- **32**: The number of bits in 4 bytes (32 bits)  
- **Q**: The model precision in bits used during serving (e.g., 16 bits, 8 bits, or 4 bits)  

**Explanation:**

- Start with $P \times 4B$ to get the base memory needed for all parameters at full precision (FP32).
- Divide by $(32/Q)$, which scales the memory requirement according to the lower-precision format you’re using. For example, loading a model in 16-bit precision effectively halves the memory usage compared to 32-bit.

#### **Example:**

For a 70B parameter model loaded in 8-bit precision:

- $P = 70 \times 10^9$ ($70$ billion)
- $Q = 8$

Plugging these in:

$$
M_{\text{model}} = \frac{(70 \times 10^9 \times 4B)}{(32 / 8)} 
= \frac{(280 \times 10^9 B)}{2} 
= 70 \times 10^9 B
$$

Convert bytes to gigabytes (1 GB = $10^9$ bytes):

$$
M = 70 \text{ GB}
$$

This rough calculation helps estimate the GPU memory needed for serving large models, ensuring you have the right hardware configuration before starting fine-tuning or inference steps.

### **2. Context Window**

The **context window** refers to the maximum number of tokens (words or subwords) the model can process in a single inference pass. During inference, the model needs to store activations for each token in the input sequence. This storage requirement scales linearly with the length of the context window.

#### **Memory Calculation for Context Window**

$$
M_{\text{context}} = L \times H \times D \times N
$$

- **$M_{\text{context}}$**: Memory required for the context window (in Gigabytes)
- **$L$**: Length of the context window (number of tokens)
- **$H$**: Hidden size (dimensionality of the model's hidden layers)
- **$D$**: Data type size (bytes per element, e.g., 2 for FP16)
- **$N$**: Number of transformer layers

#### **Example:**

Assume:
- **$L = 1024$** tokens
- **$H = 8192$** dimensions
- **$D = 1$** bytes (for INT8 precision)
- **$N = 80$** number of hidden layers

$$
M_{\text{context}} = 1024 \times 8192 \times 1 \times 80 = 671,088,640 \text{ bytes} \approx 671 \text{ MB}
$$

### **3. Batch Size**

**Batch size** determines how many input sequences the model processes simultaneously. Increasing the batch size can lead to higher GPU memory usage because the model needs to store activations for each sequence in the batch.

#### **Memory Calculation for Batch Size**

$$
M_{\text{batch}} = B \times M_{\text{context}}
$$

- **$M_{\text{batch}}$**: Additional memory required for batching (in Gigabytes)
- **$B$**: Batch size (number of sequences)
- **$M_{\text{context}}$**: Memory per sequence (from context window calculation)

#### **Example:**

Using the previous **$M_{\text{context}} =  671 \text{ MB}$** and a **batch size $B = 8$**:

$$
M_{\text{batch}} = 8 \times  671 \text{ MB} = 5.4 \text{ GB}
$$

### **4. Total Inference Memory Estimation**

Combining all these factors gives a more comprehensive estimate of the GPU memory required for inference:

$$
M_{\text{total}} = M_{\text{model}} + M_{\text{context}} \times B + M_{\text{overhead}}
$$

- **$M_{\text{total}}$**: Total GPU memory required (in Gigabytes)
- **$M_{\text{model}}$**: Memory for the model
- **$M_{\text{context}}$**: Memory per token sequence
- **$B$**: Batch size
- **$M_{\text{overhead}}$**: Additional overhead for operations like caching, temporary buffers, etc. (typically 10-20%)

#### Example

Using the previous results:

$$
M_{\text{total}} \approx 90 \text{ GB}
$$

In [1]:
from utils import estimate_gpu_memory

Q = 16  # 16-bit precision (bfloat16)
L = 1024  # Context window
B = 8  # Batch size

# Example usage for LLama-3.1 8B
P_8B = 8_000_000_000  # 8B parameters
H_8B = 4096  # Hidden size
N_8B = 32

estimated_memory_8B = estimate_gpu_memory(P_8B, Q, L, H_8B, B, N_8B)
print(f"Estimated GPU Memory Required for LLama-3 8B: {estimated_memory_8B:.2f} GB")

# Example usage for LLama-3.1 70B
P_70B = 70_000_000_000  # 70B parameters
H_70B = 8192  # Hidden size
N_70B = 80

estimated_memory_70B = estimate_gpu_memory(P_70B, Q, L, H_70B, B, N_70B)
print(f"Estimated GPU Memory Required for LLama-3 70B: {estimated_memory_70B:.2f} GB")

Estimated GPU Memory Required for LLama-3 8B: 21.78 GB
Estimated GPU Memory Required for LLama-3 70B: 180.88 GB


## Download the Model
Before you begin, ensure you have a local copy of the Meta Llama3.3 70B Instruct model. If you haven’t already downloaded it, you can obtain it from the official [Hugging Face repository](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/tree/main). This step is crucial to ensure that all subsequent operations in the notebook run smoothly.

In [2]:
from IPython.display import display
from ipywidgets import Password
from huggingface_hub import snapshot_download

pwd = Password(description="Hugging Face Token:")
display(pwd)

Password(description='Hugging Face Token:')

In [3]:
token = pwd.value
hf_model="nvidia/Llama-3.1-Nemotron-70B-Instruct"
hf_model_path="models/llama-3.1-nemotron/70B/hf"
snapshot_download(
    repo_id=hf_model,
    local_dir=hf_model_path,
    token=token
)

Fetching 3711 files:   0%|          | 0/3711 [00:00<?, ?it/s]

'/work/ucloud-workshop-11-12-2024/models/llama-3.1-nemotron/70B/hf'

In [4]:
%%bash -s "$hf_model_path"

ls $1
du -sh $1

README.md
model_config.yaml
model_weights
132G	models/llama-3.1-nemotron/70B/hf


## Convert the Model in NeMo Format

In [4]:
%%bash

# Define paths
HF_MODEL="models/llama-3.1-nemotron/70B/hf"
NeMo_MODEL="models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo"

# List the contents of the Hugging Face model directory
echo "Listing contents of $HF_MODEL:"
ls -l "$HF_MODEL"

# Check if the NeMo_MODEL archive already exists
if [ ! -f "$NeMo_MODEL" ]; then
    echo "NeMo archive not found. Creating archive: $NeMo_MODEL"
    
    # Ensure the destination directory exists
    mkdir -p "$(dirname "$NeMo_MODEL")"
    
    # Create the .nemo archive using tar
    tar cf "$NeMo_MODEL" "$HF_MODEL/model_config.yaml" "$HF_MODEL/model_weights"
    
    if [ $? -eq 0 ]; then
        echo "NeMo archive created successfully at $NeMo_MODEL."
    else
        echo "Error: Failed to create NeMo archive."
        exit 1
    fi
else
    echo "NeMo archive already exists at $NeMo_MODEL. Skipping creation."
fi

Listing contents of models/llama-3.1-nemotron/70B/hf:
total 16
-rw-r--r--. 1 ucloud ucloud 9770 Dec 10 11:21 README.md
-rw-r--r--. 1 ucloud ucloud 2936 Dec 10 11:21 model_config.yaml
drwxr-xr-x. 1 ucloud ucloud    0 Dec 10 11:32 model_weights
NeMo archive already exists at models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo. Skipping creation.


In [5]:
%%bash

NeMo_MODEL="models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo"

file "$NeMo_MODEL"
du -sh "$NeMo_MODEL"

models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo: POSIX tar archive (GNU)
132G	models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo


##  Step-by-Step Instructions

This notebook is organized into four main steps:

1. **Prepare the Dataset:**
   Load and preprocess the PubMedQA dataset, ensuring that it’s correctly formatted and ready for fine-tuning.

2. **Run the PEFT Fine-Tuning Script:**
   Apply Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning methods to tailor the Llama 3.3 70B model to the PubMedQA domain.

3. **Perform Inference with the NeMo Framework:**
   Use the trained model to generate answers to biomedical questions and observe how it performs on real queries.

4. **Evaluate Model Accuracy:**
   Assess the quality and correctness of the model’s responses to measure improvements gained through the fine-tuning process.
   
5. **Export Model to TensorRT-LLM Format for Inference:**
   use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM.

### Step 1: Prepare the dataset

Download the PubMedQA dataset and run the pre-processing script in the cloned directory.

In [5]:
%%bash

# Download the dataset and prep. scripts
git clone https://github.com/pubmedqa/pubmedqa.git

# split it into train/val/test datasets
cd pubmedqa/preprocess
python split_dataset.py pqal

fatal: destination path 'pubmedqa' already exists and is not an empty directory.


The following example shows what a single row looks inside of the PubMedQA train, validation and test splits.

```json
"18251357": {
    "QUESTION": "Does histologic chorioamnionitis correspond to clinical chorioamnionitis?",
    "CONTEXTS": [
        "To evaluate the degree to which histologic chorioamnionitis, a frequent finding in placentas submitted for histopathologic evaluation, correlates with clinical indicators of infection in the mother.",
        "A retrospective review was performed on 52 cases with a histologic diagnosis of acute chorioamnionitis from 2,051 deliveries at University Hospital, Newark, from January 2003 to July 2003. Third-trimester placentas without histologic chorioamnionitis (n = 52) served as controls. Cases and controls were selected sequentially. Maternal medical records were reviewed for indicators of maternal infection.",
        "Histologic chorioamnionitis was significantly associated with the usage of antibiotics (p = 0.0095) and a higher mean white blood cell count (p = 0.018). The presence of 1 or more clinical indicators was significantly associated with the presence of histologic chorioamnionitis (p = 0.019)."
    ],
    "reasoning_required_pred": "yes",
    "reasoning_free_pred": "yes",
    "final_decision": "yes",
    "LONG_ANSWER": "Histologic chorioamnionitis is a reliable indicator of infection whether or not it is clinically apparent."
},
```

Use the following code to convert the train, validation, and test PubMedQA data into the `JSONL` format that NeMo needs for PEFT.

In [6]:
import json

def read_jsonl(fname):
    obj = []
    with open(fname, 'rt') as f:
        st = f.readline()
        while st:
            obj.append(json.loads(st))
            st = f.readline()
    return obj

def write_jsonl(fname, json_objs):
    with open(fname, 'wt') as f:
        for o in json_objs:
            f.write(json.dumps(o)+"\n")
            
def form_question(obj):
    st = ""    
    for i, label in enumerate(obj['LABELS']):
        st += f"{label}: {obj['CONTEXTS'][i]}\n"
    st += f"QUESTION: {obj['QUESTION']}\n"
    st += f" ### ANSWER (yes|no|maybe): "
    return st

def convert_to_jsonl(data_path, output_path):
    data = json.load(open(data_path, 'rt'))
    json_objs = []
    for k in data.keys():
        obj = data[k]
        prompt = form_question(obj)
        completion = obj['final_decision']
        json_objs.append({"input": prompt, "output": f"<<< {completion} >>>"})
    write_jsonl(output_path, json_objs)
    return json_objs


test_json_objs = convert_to_jsonl("pubmedqa/data/test_set.json", "pubmedqa/data/pubmedqa_test.jsonl")
train_json_objs = convert_to_jsonl("pubmedqa/data/pqal_fold0/train_set.json", "pubmedqa/data/pubmedqa_train.jsonl")
dev_json_objs = convert_to_jsonl("pubmedqa/data/pqal_fold0/dev_set.json", "pubmedqa/data/pubmedqa_val.jsonl")

> `Note:` In the output, we enforce the inclusion of “<<<” and “>>>“ markers which would allow verification of the LoRA tuned model during inference. This is  because the base model can produce “yes” / “no” responses based on zero-shot templates as well.

After running the above script, you will see  `pubmedqa_train.jsonl`, `pubmedqa_val.jsonl`, and `pubmedqa_test.jsonl` files appear in the data directory.

This is what an example will be formatted like after the script has converted the PubMedQA data into `JSONL` -

```json
{"input": "QUESTION: Failed IUD insertions in community practice: an under-recognized problem?\nCONTEXT: The data analysis was conducted to describe the rate of unsuccessful copper T380A intrauterine device (IUD) insertions among women using the IUD for emergency contraception (EC) at community family planning clinics in Utah.\n ...  ### ANSWER (yes|no|maybe): ",
"output": "<<< yes >>>"}
```


In [7]:
%%bash

# clear up cached mem-map file
rm pubmedqa/data/*idx*

wc -l pubmedqa/data/pubmedqa_train.jsonl
wc -l pubmedqa/data/pubmedqa_val.jsonl
wc -l pubmedqa/data/pubmedqa_test.jsonl

450 pubmedqa/data/pubmedqa_train.jsonl
50 pubmedqa/data/pubmedqa_val.jsonl
500 pubmedqa/data/pubmedqa_test.jsonl



### Step 2: Run PEFT finetuning script for LoRA

NeMo framework includes a high level python script for fine-tuning  [megatron_gpt_finetuning.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py) that can abstract away some of the lower level API calls. Once you have your model downloaded and the dataset ready, LoRA fine-tuning with NeMo is essentially just running this script!

For this demonstration, this training run is capped by `max_steps`, and validation is carried out every `val_check_interval` steps. If the validation loss does not improve after a few checks, training is halted to avoid overfitting.

> `NOTE:` In the block of code below, pass the paths to your train, test and validation data files as well as path to the .nemo model.

#### Understanding Global Batch Size (GBS) in Multi-GPU Training


##### **1. Global Batch Size (GBS)**
- **Definition:**
  - The **total number of training samples** processed in **one training step** across **all GPUs** involved.

##### **2. Data Parallelism (DP)**
- **Definition:**
  - The **number of GPUs** that each hold a **replica** of the entire model.
  - **Function:** Distributes different data batches to each GPU simultaneously.
  - **GAS (Gradient Accumulation Steps):** The number of mini-batches over which gradients are accumulated before performing a parameter update.
  - **DP formula:**
      $$
      \text{Data Parallelism (DP)} = \frac{\text{Total GPUs} \times \text{Gradient Accumulation Step (GAS)}}{\text{Tensor Parallelism (TP)} \times \text{Pipeline Parallelism (PP)}}
      $$


##### **3. Micro Batch Size (MB)**
- **Definition:**
  - The **number of samples** processed **per GPU** in a single forward/backward pass.

##### **4. GBS Formula**
$$
\text{Global Batch Size (GBS)} = \text{Data Parallelism (DP)} \times \text{Micro Batch Size (MB)}
$$

##### **5. How to Set GBS**
1. **Determine Available GPUs:**
   - Total GPUs (e.g., 4 GPUs).
2. **Choose Data Parallelism (DP):**
   - Decide how many GPUs to use for DP (e.g., DP = 4).
3. **Set Micro Batch Size (MB):**
   - Based on GPU memory capacity (e.g., MB = 8).
4. **Calculate GBS:**
   - Use the formula to find GBS (e.g., GBS = 4 × 8 = 32).

##### **Best Practices**
- **Align GBS with DP and MB:**
  - Ensure $\text{GBS} = \text{DP} \times \text{MB}$.
- **Monitor GPU Utilization:**
  - Use tools like `nvidia-smi` to ensure all GPUs are effectively utilized.
- **Adjust Batch Sizes as Needed:**
  - Optimize **MB** based on memory constraints and **GBS** to balance load.
- **Utilize Gradient Accumulation:**
  - When larger **GBS** is desired but constrained by memory.


In [6]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

# Set paths to the model, train, validation and test sets.
PRECISION=bf16-mixed
MODEL="models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo"
OUTPUT_DIR="results/llama-3.1-nemotron/70B/$PRECISION"
rm -rf "$OUTPUT_DIR"

TRAIN_DS="[pubmedqa/data/pubmedqa_train.jsonl]"
VALID_DS="[pubmedqa/data/pubmedqa_val.jsonl]"

SCHEME="lora"
GPUS=4   
TP_SIZE=4
PP_SIZE=1

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${OUTPUT_DIR} \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    trainer.precision=${PRECISION} \
    trainer.val_check_interval=20 \
    trainer.max_steps=1000 \
    model.megatron_amp_O2=True \  # enforce mixed precision
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.global_batch_size=8 \
    model.micro_batch_size=1 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.num_workers=10 \
    model.data.validation_ds.num_workers=10 \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ucloud/.cache/huggingface/token
Login successful


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it

[NeMo I 2024-12-10 13:24:32 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-12-10 13:24:32 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 4
      accelerator: gpu
      num_nodes: 1
      precision: bf16-mixed
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 1000
      log_every_n_steps: 10
      val_check_interval: 20
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: results/llama-3.1-nemotron/70B/bf16-mixed
      exp_dir: results/llama-3.1-nemotron/70B/bf16-mixed
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.dat

[NeMo W 2024-12-10 13:24:32 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


[NeMo I 2024-12-10 13:24:32 exp_manager:396] ExpManager schema
[NeMo I 2024-12-10 13:24:32 exp_manager:397] {'explicit_log_dir': None, 'exp_dir': None, 'name': None, 'version': None, 'use_datetime_version': True, 'resume_if_exists': False, 'resume_past_end': False, 'resume_ignore_no_checkpoint': False, 'resume_from_checkpoint': None, 'create_tensorboard_logger': True, 'summary_writer_kwargs': None, 'create_wandb_logger': False, 'wandb_logger_kwargs': None, 'create_mlflow_logger': False, 'mlflow_logger_kwargs': {'experiment_name': None, 'tracking_uri': None, 'tags': None, 'save_dir': './mlruns', 'prefix': '', 'artifact_location': None, 'run_id': None, 'log_model': False}, 'create_dllogger_logger': False, 'dllogger_logger_kwargs': {'verbose': False, 'stdout': False, 'json_file': './dllogger.json'}, 'create_clearml_logger': False, 'clearml_logger_kwargs': {'project': None, 'task': None, 'connect_pytorch': False, 'model_name': None, 'tags': None, 'log_model': False, 'log_cfg': False, 'log_

[NeMo E 2024-12-10 13:24:32 exp_manager:830] exp_manager received explicit_log_dir: results/llama-3.1-nemotron/70B/bf16-mixed and at least one of exp_dir: results/llama-3.1-nemotron/70B/bf16-mixed, or version: None. Please note that exp_dir, name, and version will be ignored.
[NeMo W 2024-12-10 13:24:32 exp_manager:757] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints. Training from scratch.


[NeMo I 2024-12-10 13:24:32 exp_manager:455] Experiments will be logged at results/llama-3.1-nemotron/70B/bf16-mixed
[NeMo I 2024-12-10 13:24:32 exp_manager:983] TensorboardLogger has been set up


[NeMo W 2024-12-10 13:24:32 exp_manager:1111] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 1000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo W 2024-12-10 13:28:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() d

[NeMo I 2024-12-10 13:28:25 megatron_init:269] Rank 0 has data parallel group : [0]
[NeMo I 2024-12-10 13:28:25 megatron_init:275] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-12-10 13:28:25 megatron_init:280] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3]]
[NeMo I 2024-12-10 13:28:25 megatron_init:283] Ranks 0 has data parallel rank: 0
[NeMo I 2024-12-10 13:28:25 megatron_init:291] Rank 0 has context parallel group: [0]
[NeMo I 2024-12-10 13:28:25 megatron_init:294] All context parallel group ranks: [[0], [1], [2], [3]]
[NeMo I 2024-12-10 13:28:25 megatron_init:295] Ranks 0 has context parallel rank: 0
[NeMo I 2024-12-10 13:28:25 megatron_init:302] Rank 0 has model parallel group: [0, 1, 2, 3]
[NeMo I 2024-12-10 13:28:25 megatron_init:303] All model parallel group ranks: [[0, 1, 2, 3]]
[NeMo I 2024-12-10 13:28:25 megatron_init:312] Rank 0 has tensor model parallel group: [0, 1, 2, 3]
[NeMo I 2024-12-10 13:28:25 m

[NeMo W 2024-12-10 13:28:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:25 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-12-10 13:28:25 tokenizer_utils:183] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3.1-70B-Instruct


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[NeMo I 2024-12-10 13:28:27 megatron_base_model:595] Padded vocab_size: 128512, original vocab_size: 128256, dummy tokens: 256.


[NeMo W 2024-12-10 13:28:27 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:27 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:27 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:27 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 13:28:27 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

Loading distributed checkpoint with TensorStoreLoadShardedStrategy
[NeMo I 2024-12-10 13:36:34 nlp_overrides:1346] Model MegatronGPTSFTModel was successfully restored from /work/ucloud-workshop-11-12-2024/models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo.
[NeMo I 2024-12-10 13:36:34 megatron_gpt_finetuning:72] Adding adapter weights to the model for PEFT
[NeMo I 2024-12-10 13:36:34 nlp_adapter_mixins:240] Before adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 17.6 B | train
    ------------------------------------------------
    0         Trainable params
    17.6 B    Non-trainable params
    17.6 B    Total params
    70,561.858Total estimated model params size (MB)
[NeMo I 2024-12-10 13:36:38 nlp_adapter_mixins:245] After adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Fl

[NeMo W 2024-12-10 13:36:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-12-10 13:36:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-12-10 13:36:41 megatron_gpt_sft_model:801] Building GPT SFT validation datasets.
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.067161
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.079545
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:249] Loading pubmedqa/data/pubmedqa_val.jsonl
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001530
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-12-10 13:36:41 megatron_gpt_sft_model:805] Length of val dataset: 50
[NeMo I 2024-12-10 13:36:41 megatron_gpt_sft_model:812] Building GPT SFT traing datasets.
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.075593
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.075534
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:249] Loading pubmedqa/data/pubmedqa_train.jsonl
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001608
[NeMo I 2024-12-10 13:36:41 text_memmap_dataset:165] Computing global indices


      counts = torch.cuda.LongTensor([1])
    


make: Entering directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2024-12-10 13:36:44 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.06 (sec)
[NeMo I 2024-12-10 13:36:44 megatron_gpt_sft_model:814] Length of train dataset: 8040
[NeMo I 2024-12-10 13:36:44 megatron_gpt_sft_model:819] Building dataloader with consumed samples: 0
[NeMo I 2024-12-10 13:36:44 megatron_gpt_sft_model:819] Building dataloader with consumed samples: 0


LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[NeMo W 2024-12-10 13:36:44 megatron_base_model:1223] Ignoring `trainer.max_epochs` when computing `max_steps` because `trainer.max_steps` is already set to 1000.


[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-10 13:36:44 adapter_mixins:495] Unfrozen adapter : lora_kqv_


  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | Float16Module | 17.7 B | train
------------------------------------------------
11.8 M    Trainable params
17.6 B    Non-trainable params
17.7 B    Total params
70,609.043Total estimated model params size (MB)
[NeMo W 2024-12-10 13:36:45 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Pleas

Sanity Checking: |          | 0/? [00:00<?, ?it/s][NeMo I 2024-12-10 13:38:11 num_microbatches_calculator:119] setting number of micro-batches to constant 8


`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
    


Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:13<00:00,  0.15it/s][NeMo I 2024-12-10 13:38:24 num_microbatches_calculator:119] setting number of micro-batches to constant 8


[NeMo W 2024-12-10 13:38:24 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-10 13:38:24 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('validation_loss_dataloader0', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-10 13:38:24 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('validation_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 202

Epoch 0: :   2%|▏         | 20/1000 [00:52<42:29, reduced_train_loss=1.390, global_step=19.00, consumed_samples=160.0, train_step_timing in s=2.150] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:40:45 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.90it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:02<00:05,  0.91it/s][A
Validation DataLoader 0:  43%|████▎     | 3/7 [00:03<00:04,  0.85it/s][A
Validation DataLoader 0:  57%|█████▋    | 4/7 [00:04<00:03,  0.87it/s][A
Validation DataLoader 0:  71%|███████▏  | 5/7 [00:05<00:02,  0.88it/s][A
Validation DataLoader 0:  86%|████████▌ | 6/7 [00:06<00:01,  0.88it/s][A
Validation DataLoader 0: 100%|██████████| 7/7 [00:07<00:00,  0.89it/s][A[NeMo I 2024-12-10 13:40:53 num_microbatches_calculator:11

[rank: 0] Metric val_loss improved. New best score: 0.908
[rank: 1] Metric val_loss improved. New best score: 0.908
[rank: 2] Metric val_loss improved. New best score: 0.908
[rank: 3] Metric val_loss improved. New best score: 0.908
Epoch 0, global step 20: 'validation_loss' reached 0.90807 (best 0.90807), saving model to '/work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.908-step=20-consumed_samples=160.0.ckpt' as top 1
[NeMo W 2024-12-10 13:40:53 nlp_overrides:609] DistributedCheckpointIO configured but should not be used. Reverting back to TorchCheckpointIO


Epoch 0: :   4%|▍         | 40/1000 [01:46<42:25, reduced_train_loss=0.0777, global_step=39.00, consumed_samples=320.0, train_step_timing in s=2.160, val_loss=0.908]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:41:39 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.89it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:02<00:05,  0.91it/s][A
Validation DataLoader 0:  43%|████▎     | 3/7 [00:03<00:04,  0.90it/s][A
Validation DataLoader 0:  57%|█████▋    | 4/7 [00:04<00:03,  0.91it/s][A
Validation DataLoader 0:  71%|███████▏  | 5/7 [00:05<00:02,  0.91it/s][A
Validation DataLoader 0:  86%|████████▌ | 6/7 [00:06<00:01,  0.91it/s][A
Validation DataLoader 0: 100%|██████████| 7/7 [00:07<00:00,  0.91it/s][A[NeMo I 2024-12-10 13:41:47 num_microbatch

[rank: 0] Metric val_loss improved by 0.756 >= min_delta = 0.001. New best score: 0.152
[rank: 2] Metric val_loss improved by 0.756 >= min_delta = 0.001. New best score: 0.152
[rank: 1] Metric val_loss improved by 0.756 >= min_delta = 0.001. New best score: 0.152
[rank: 3] Metric val_loss improved by 0.756 >= min_delta = 0.001. New best score: 0.152
Epoch 0, global step 40: 'validation_loss' reached 0.15225 (best 0.15225), saving model to '/work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.152-step=40-consumed_samples=320.0.ckpt' as top 1


Epoch 0: :   4%|▍         | 40/1000 [01:53<45:30, reduced_train_loss=0.0777, global_step=39.00, consumed_samples=320.0, train_step_timing in s=2.160, val_loss=0.152][NeMo I 2024-12-10 13:41:47 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.908-step=20-consumed_samples=160.0.ckpt
[NeMo I 2024-12-10 13:41:48 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.908-step=20-consumed_samples=160.0-last.ckpt
Epoch 0: :   6%|▌         | 60/1000 [02:38<41:23, reduced_train_loss=0.0944, global_step=59.00, consumed_samples=480.0, train_step_timing in s=2.150, val_loss=0.152]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:42:32 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Epoch 0, global step 60: 'validation_loss' was not in top 1


Epoch 0: :   6%|▌         | 60/1000 [02:46<43:26, reduced_train_loss=0.0944, global_step=59.00, consumed_samples=480.0, train_step_timing in s=2.150, val_loss=0.195][NeMo I 2024-12-10 13:42:40 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.152-step=40-consumed_samples=320.0-last.ckpt
Epoch 0: :   8%|▊         | 80/1000 [03:31<40:31, reduced_train_loss=0.0283, global_step=79.00, consumed_samples=640.0, train_step_timing in s=2.140, val_loss=0.195] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:43:25 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.86it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:02<00:05,

Epoch 0, global step 80: 'validation_loss' was not in top 1


Epoch 0: :   8%|▊         | 80/1000 [03:39<42:03, reduced_train_loss=0.0283, global_step=79.00, consumed_samples=640.0, train_step_timing in s=2.140, val_loss=0.157][NeMo I 2024-12-10 13:43:33 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.195-step=60-consumed_samples=480.0-last.ckpt
Epoch 0: :  10%|█         | 100/1000 [04:24<39:39, reduced_train_loss=0.123, global_step=99.00, consumed_samples=800.0, train_step_timing in s=2.120, val_loss=0.157]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:44:18 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.90it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:02<00:05, 

Epoch 0, global step 100: 'validation_loss' was not in top 1


Epoch 0: :  10%|█         | 100/1000 [04:32<40:48, reduced_train_loss=0.123, global_step=99.00, consumed_samples=800.0, train_step_timing in s=2.120, val_loss=0.168][NeMo I 2024-12-10 13:44:26 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.157-step=80-consumed_samples=640.0-last.ckpt
Epoch 0: :  12%|█▏        | 120/1000 [05:17<38:47, reduced_train_loss=0.136, global_step=119.0, consumed_samples=960.0, train_step_timing in s=2.460, val_loss=0.168]  
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:45:11 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.89it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:02<00:05

Epoch 0, global step 120: 'validation_loss' was not in top 1


Epoch 0: :  12%|█▏        | 120/1000 [05:25<39:43, reduced_train_loss=0.136, global_step=119.0, consumed_samples=960.0, train_step_timing in s=2.460, val_loss=0.237][NeMo I 2024-12-10 13:45:19 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.168-step=100-consumed_samples=800.0-last.ckpt
Epoch 0: :  14%|█▍        | 140/1000 [06:09<37:51, reduced_train_loss=0.021, global_step=139.0, consumed_samples=1120.0, train_step_timing in s=2.140, val_loss=0.237]  
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:46:03 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.87it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:02<00:

Epoch 0, global step 140: 'validation_loss' was not in top 1


Epoch 0: :  14%|█▍        | 140/1000 [06:17<38:40, reduced_train_loss=0.021, global_step=139.0, consumed_samples=1120.0, train_step_timing in s=2.140, val_loss=0.203][NeMo I 2024-12-10 13:46:11 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.237-step=120-consumed_samples=960.0-last.ckpt
Epoch 0: :  16%|█▌        | 160/1000 [07:03<37:02, reduced_train_loss=0.0483, global_step=159.0, consumed_samples=1280.0, train_step_timing in s=2.470, val_loss=0.203] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:46:57 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.89it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:02<00

Epoch 0, global step 160: 'validation_loss' was not in top 1


Epoch 0: :  16%|█▌        | 160/1000 [07:11<37:42, reduced_train_loss=0.0483, global_step=159.0, consumed_samples=1280.0, train_step_timing in s=2.470, val_loss=0.210][NeMo I 2024-12-10 13:47:05 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.203-step=140-consumed_samples=1120.0-last.ckpt
Epoch 0: :  18%|█▊        | 180/1000 [07:55<36:06, reduced_train_loss=0.0557, global_step=179.0, consumed_samples=1440.0, train_step_timing in s=2.150, val_loss=0.210] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:47:49 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.89it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:02<

Epoch 0, global step 180: 'validation_loss' was not in top 1


Epoch 0: :  18%|█▊        | 180/1000 [08:03<36:41, reduced_train_loss=0.0557, global_step=179.0, consumed_samples=1440.0, train_step_timing in s=2.150, val_loss=0.213][NeMo I 2024-12-10 13:47:57 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.210-step=160-consumed_samples=1280.0-last.ckpt
Epoch 0: :  20%|██        | 200/1000 [08:48<35:14, reduced_train_loss=0.00644, global_step=199.0, consumed_samples=1600.0, train_step_timing in s=2.150, val_loss=0.213] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:48:42 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.90it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:02

Epoch 0, global step 200: 'validation_loss' was not in top 1


Epoch 0: :  20%|██        | 200/1000 [08:56<35:45, reduced_train_loss=0.00644, global_step=199.0, consumed_samples=1600.0, train_step_timing in s=2.150, val_loss=0.272][NeMo I 2024-12-10 13:48:50 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.213-step=180-consumed_samples=1440.0-last.ckpt
Epoch 0: :  22%|██▏       | 220/1000 [09:41<34:20, reduced_train_loss=0.017, global_step=219.0, consumed_samples=1760.0, train_step_timing in s=2.120, val_loss=0.272]   
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:49:34 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.89it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:0

Epoch 0, global step 220: 'validation_loss' was not in top 1


Epoch 0: :  22%|██▏       | 220/1000 [09:48<34:48, reduced_train_loss=0.017, global_step=219.0, consumed_samples=1760.0, train_step_timing in s=2.120, val_loss=0.244][NeMo I 2024-12-10 13:49:42 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.272-step=200-consumed_samples=1600.0-last.ckpt
Epoch 0: :  24%|██▍       | 240/1000 [10:34<33:28, reduced_train_loss=0.00261, global_step=239.0, consumed_samples=1920.0, train_step_timing in s=2.130, val_loss=0.244]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-10 13:50:28 num_microbatches_calculator:119] setting number of micro-batches to constant 8

Validation:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/7 [00:00<?, ?it/s][A
Validation DataLoader 0:  14%|█▍        | 1/7 [00:01<00:06,  0.90it/s][A
Validation DataLoader 0:  29%|██▊       | 2/7 [00:02<0

[rank: 0] Monitored metric val_loss did not improve in the last 10 records. Best score: 0.152. Signaling Trainer to stop.
[rank: 3] Monitored metric val_loss did not improve in the last 10 records. Best score: 0.152. Signaling Trainer to stop.
[rank: 2] Monitored metric val_loss did not improve in the last 10 records. Best score: 0.152. Signaling Trainer to stop.
[rank: 1] Monitored metric val_loss did not improve in the last 10 records. Best score: 0.152. Signaling Trainer to stop.
Epoch 0, global step 240: 'validation_loss' was not in top 1


Epoch 0: :  24%|██▍       | 240/1000 [10:42<33:53, reduced_train_loss=0.00261, global_step=239.0, consumed_samples=1920.0, train_step_timing in s=2.130, val_loss=0.396][NeMo I 2024-12-10 13:50:36 nlp_overrides:593] Removing checkpoint: /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/mp_rank_00/megatron_gpt_peft_lora_tuning--validation_loss=0.244-step=220-consumed_samples=1760.0-last.ckpt
Epoch 0: :  24%|██▍       | 240/1000 [10:42<33:54, reduced_train_loss=0.00261, global_step=239.0, consumed_samples=1920.0, train_step_timing in s=2.130, val_loss=0.396]


Restoring states from the checkpoint path at /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.152-step=40-consumed_samples=320.0.ckpt
Restored all states from the checkpoint at /work/ucloud-workshop-11-12-2024/results/llama-3.1-nemotron/70B/bf16-mixed/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.152-step=40-consumed_samples=320.0.ckpt


This will create a LoRA adapter - a file named `megatron_gpt_peft_lora_tuning.nemo` in `./results/.../checkpoints/`. We'll use this later.

To further configure the run above -

* **A different PEFT technique**: The `peft.peft_scheme` parameter determines the technique being used. In this case, we did LoRA, but NeMo Framework supports other techniques as well - such as P-tuning, Adapters, and IA3. For more information, refer to the [PEFT support matrix](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/peft/landing_page.html). For example, for P-tuning, simply set 

```bash
model.peft.peft_scheme="ptuning" # instead of "lora"
```

* **Tuning Llama-3.3 70B**: You will need 4xH100 GPUs. Provide the path to it's .nemo checkpoint (similar to the download and conversion steps earlier), and change the model parallelization settings for Llama-3 70B PEFT to distribute across the GPUs. It is also recommended to run the fine-tuning script from a terminal directly instead of Jupyter when using more than 1 GPU.
```bash
model.tensor_model_parallel_size=4
model.pipeline_model_parallel_size=1
```

You can override many such configurations while running the script. A full set of possible configurations is located in [NeMo Framework Github](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml).

### Step 3: Inference with NeMo Framework

Running text generation within the framework is also possible with running a Python script. Note that is more for testing and validation, not a full-fledged  deployment solution like NVIDIA NIM.

In [11]:
%%bash
# Check that the LORA model file exists

python -c "import torch; torch.cuda.empty_cache()"

OUTPUT_DIR="results/llama-3.1-nemotron/70B/bf16-mixed"
ls -l $OUTPUT_DIR/checkpoints

total 92440
-rw-r--r--. 1 ucloud ucloud 94658560 Dec 10 13:50 megatron_gpt_peft_lora_tuning.nemo
drwxr-xr-x. 1 ucloud ucloud        0 Dec 10 13:50 mp_rank_00
drwxr-xr-x. 1 ucloud ucloud        0 Dec 10 13:50 mp_rank_01
drwxr-xr-x. 1 ucloud ucloud        0 Dec 10 13:50 mp_rank_02
drwxr-xr-x. 1 ucloud ucloud        0 Dec 10 13:50 mp_rank_03


In the code snippet below, the following configurations are worth noting: 

1. `model.restore_from_path` to the path for the Meta-Llama-3-8B-Instruct.nemo file.
2. `model.peft.restore_from_path` to the path for the PEFT checkpoint that was created in the fine-tuning run in the last step.
3. `model.test_ds.file_names` to the path of the pubmedqa_test.jsonl file

If you have made any changes in model or experiment paths, please ensure they are configured correctly below.

In [8]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

PRECISION=bf16-mixed
MODEL="models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo"
OUTPUT_DIR="results/llama-3.1-nemotron/70B/$PRECISION"
TEST_DS="[pubmedqa/data/pubmedqa_test.jsonl]"
TEST_NAMES="[pubmedqa]"
SCHEME="lora"
GPUS=4
TP_SIZE=4
PP_SIZE=1

# This is where your LoRA checkpoint was saved
PATH_TO_TRAINED_MODEL="$OUTPUT_DIR/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

# The generation run will save the generated outputs over the test dataset in a file prefixed like so
OUTPUT_PREFIX="pubmedQA_result_"

export TOKENIZERS_PARALLELISM=true

torchrun --nproc_per_node=${GPUS} \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${MODEL} \
    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \
    trainer.devices=${GPUS} \
    trainer.num_nodes=1 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=1 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=3 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \
    model.data.test_ds.write_predictions_to_file=True

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ucloud/.cache/huggingface/token
Login successful


`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-12-10 13:56:36 megatron_gpt_generate:125] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-12-10 13:56:36 megatron_gpt_generate:126] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 4
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 20000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.test_ds.metric.name}
        save_top_k: 1
        mode: max
        save_nemo_o

[NeMo W 2024-12-10 13:56:36 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it 

[NeMo I 2024-12-10 14:00:01 megatron_init:269] Rank 0 has data parallel group : [0]
[NeMo I 2024-12-10 14:00:01 megatron_init:275] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-12-10 14:00:01 megatron_init:280] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3]]
[NeMo I 2024-12-10 14:00:01 megatron_init:283] Ranks 0 has data parallel rank: 0
[NeMo I 2024-12-10 14:00:01 megatron_init:291] Rank 0 has context parallel group: [0]
[NeMo I 2024-12-10 14:00:01 megatron_init:294] All context parallel group ranks: [[0], [1], [2], [3]]
[NeMo I 2024-12-10 14:00:01 megatron_init:295] Ranks 0 has context parallel rank: 0
[NeMo I 2024-12-10 14:00:01 megatron_init:302] Rank 0 has model parallel group: [0, 1, 2, 3]
[NeMo I 2024-12-10 14:00:01 megatron_init:303] All model parallel group ranks: [[0, 1, 2, 3]]
[NeMo I 2024-12-10 14:00:01 megatron_init:312] Rank 0 has tensor model parallel group: [0, 1, 2, 3]
[NeMo I 2024-12-10 14:00:01 m

[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-12-10 14:00:01 tokenizer_utils:183] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3.1-70B-Instruct


    
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[NeMo I 2024-12-10 14:00:01 megatron_base_model:595] Padded vocab_size: 128512, original vocab_size: 128256, dummy tokens: 256.


[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-10 14:00:01 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

Loading distributed checkpoint with TensorStoreLoadShardedStrategy
[NeMo I 2024-12-10 14:07:23 nlp_overrides:1346] Model MegatronGPTSFTModel was successfully restored from /work/ucloud-workshop-11-12-2024/models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo.
[NeMo I 2024-12-10 14:07:23 nlp_adapter_mixins:240] Before adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 17.6 B | train
    -------------------------------------------
    0         Trainable params
    17.6 B    Non-trainable params
    17.6 B    Total params
    70,561.858Total estimated model params size (MB)
[NeMo I 2024-12-10 14:07:27 nlp_adapter_mixins:245] After adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 17.7 B | train
    -------------------------------------------
    11.8 M    Trainable params
    17.6 B    Non-trainable 

[NeMo W 2024-12-10 14:07:27 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-12-10 14:07:27 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-12-10 14:07:27 megatron_gpt_sft_model:793] Building GPT SFT test datasets.
[NeMo I 2024-12-10 14:07:27 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-10 14:07:27 text_memmap_dataset:525] Processing 1 data files using 96 workers
[NeMo I 2024-12-10 14:07:29 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:01.724135
[NeMo I 2024-12-10 14:07:29 text_memmap_dataset:525] Processing 1 data files using 96 workers
[NeMo I 2024-12-10 14:07:31 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:01.775241
[NeMo I 2024-12-10 14:07:31 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-10 14:07:31 text_memmap_dataset:249] Loading pubmedqa/data/pubmedqa_test.jsonl
[NeMo I 2024-12-10 14:07:31 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001674
[NeMo I 2024-12-10 14:07:31 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-12-10 14:07:31 megatron_gpt_sft_model:796] Length of test dataset: 500
[NeMo I 2

LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[NeMo W 2024-12-10 14:07:31 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=47` in the `DataLoader` to improve performance.
    
[NeMo W 2024-12-10 14:07:31 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `test_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    


Testing: |          | 0/? [00:00<?, ?it/s]setting number of micro-batches to constant 1


      input_info_tensor = torch.cuda.FloatTensor(input_info)
    
      string_tensor = torch.as_tensor(
    


Testing DataLoader 0:   0%|          | 0/500 [00:00<?, ?it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   0%|          | 1/500 [00:14<1:57:57,  0.07it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   0%|          | 2/500 [00:14<1:01:30,  0.13it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   1%|          | 3/500 [00:15<42:38,  0.19it/s]  setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   1%|          | 4/500 [00:16<33:12,  0.25it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   1%|          | 5/500 [00:16<27:32,  0.30it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 1
Testing DataLoader 0:   1%|         

[NeMo W 2024-12-10 14:12:57 megatron_gpt_sft_model:642] No training data found, reconfiguring microbatches based on validation batch sizes.


setting number of micro-batches to constant 1


[NeMo W 2024-12-10 14:12:57 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-10 14:12:57 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss_pubmedqa', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-10 14:12:57 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    


Testing DataLoader 0: 100%|██████████| 500/500 [05:26<00:00,  1.53it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m       Test metric       [0m[1m [0m┃[1m [0m[1m      DataLoader 0       [0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m        test_loss        [0m[36m [0m│[35m [0m[35m   0.15388180315494537   [0m[35m [0m│
│[36m [0m[36m   test_loss_pubmedqa    [0m[36m [0m│[35m [0m[35m   0.15388180315494537   [0m[35m [0m│
│[36m [0m[36m        val_loss         [0m[36m [0m│[35m [0m[35m   0.15388180315494537   [0m[35m [0m│
└───────────────────────────┴───────────────────────────┘


### Step 4: Check the model accuracy

Now that the results are in, let's read the results and calculate the accuracy on the pubmedQA task. You can compare your accuracy results with the public leaderboard at https://pubmedqa.github.io/.

Let's take a look at one of the predictions in the generated output file. The `pred` key indicates what was generated.

In [9]:
%%bash

tail -n 1 pubmedQA_result__test_pubmedqa_inputs_preds_labels.jsonl

{"input": "OBJECTIVES: Outcome feedback is the process of learning patient outcomes after their care within the emergency department. We conducted a national survey of Canadian Royal College emergency medicine (EM) residents and program directors to determine the extent to which active outcome feedback and follow-up occurred. We also compared the perceived educational value of outcome feedback between residents and program directors.\nMETHODS: We distributed surveys to all Royal College-accredited adult and pediatric EM training programs using a modified Dillman method. We analyzed the data using student's t-test for continuous variables and Fisher's exact test for categorical variables.\nRESULTS: We received 210 completed surveys from 260 eligible residents (80.8%) and 21 of 24 program directors (87.5%) (overall 81.3%). Mandatory active outcome feedback was not present in any EM training program for admitted or discharged patients (0/21). Follow-up was performed electively by 89.4% of

Note that the model produces output in the specified format, such as `<<< no >>>`.

The following snippet loads the generated output and calculates accuracy in comparison to the test set using the `evaluation.py` script included in the PubMedQA repo.

In [7]:
import json

answers = []
with open("pubmedQA_result__test_pubmedqa_inputs_preds_labels.jsonl",'rt') as f:
    st = f.readline()
    while st:
        answers.append(json.loads(st))
        st = f.readline()

In [8]:
data_test = json.load(open("./pubmedqa/data/test_set.json",'rt'))

In [9]:
results = {}
sample_id = list(data_test.keys())

for i, key in enumerate(sample_id):
    answer = answers[i]['pred']
    if 'yes' in answer:
        results[key] = 'yes'
    elif 'no' in answer:
        results[key] = 'no'
    elif 'maybe' in answer:
        results[key] = 'maybe'
    else:
        print("Malformed answer: ", answer)
        results[key] = 'maybe'

In [10]:
# Dump results in a format that can be ingested by PubMedQA evaluation file
FILENAME="pubmedqa-llama-3-70b-lora.json"
with(open(FILENAME, "w")) as f:
    json.dump(results, f)

# Evaluation
!cp $FILENAME ./pubmedqa/
!cd ./pubmedqa/ && python evaluation.py $FILENAME

Accuracy 0.796000
Macro-F1 0.570244


## Export Model to TensorRT-LLM Format for Inference

In [32]:
from nemo.export.tensorrt_llm import TensorRTLLM

PRECISION=bf16-mixed
MODEL_DIR="models/llama-3.1-nemotron/70B/trt_llm"
MODEL_CKPT="models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo"
LORA_CKPT="results/llama-3.1-nemotron/70B/$PRECISION/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

trt_llm_exporter = TensorRTLLM(
    model_dir=MODEL_DIR,
    lora_ckpt_list=[LORA_CKPT],
)

trt_llm_exporter.export(
    nemo_checkpoint_path=MODEL_CKPT,
    model_type="llama",
    n_gpus=4,
)

      trt_llm_exporter.export(
    
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
saving weights: 100%|██████████| 481/481 [03:09<00:00,  2.54it/s]


[12/10/2024-14:45:24] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


I1210 14:45:24.515959 139999580046464 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


[12/10/2024-14:45:24] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


I1210 14:45:24.517820 139999580046464 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


[12/10/2024-14:45:24] [TRT-LLM] [I] Set multi_block_mode to False.


I1210 14:45:24.518685 139999580046464 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.


[12/10/2024-14:45:24] [TRT-LLM] [I] Set paged_kv_cache to True.


I1210 14:45:24.519243 139999580046464 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.


[12/10/2024-14:45:24] [TRT-LLM] [I] Set tokens_per_block to 128.


I1210 14:45:24.519765 139999580046464 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.


[12/10/2024-14:45:24] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.


W1210 14:45:24.520328 139999580046464 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.


[12/10/2024-14:45:24] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 



W1210 14:45:24.520862 139999580046464 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 



[12/10/2024-14:46:39] [TRT] [I] [MemUsageChange] Init CUDA: CPU +16, GPU +0, now: CPU 423065, GPU 528 (MiB)
[12/10/2024-14:46:45] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +4312, GPU +1150, now: CPU 427513, GPU 1678 (MiB)
[12/10/2024-14:46:45] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[12/10/2024-14:46:45] [TRT-LLM] [I] Set nccl_plugin to bfloat16.


I1210 14:46:45.100662 139999580046464 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.


[12/10/2024-14:46:45] [TRT-LLM] [I] Set use_custom_all_reduce to True.


I1210 14:46:45.101771 139999580046464 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/10/2024-14:46:45] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:46:45] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:46:45] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:46:45] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_laye

I1210 14:46:45.723005 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.0.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.724067 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.724715 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.725315 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.725916 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.5.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.726485 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.5.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.6.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.727061 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.6.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.7.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.727629 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.7.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.8.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.728212 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.8.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.9.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.728749 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.9.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.10.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.729308 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.10.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.11.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.729916 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.11.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.12.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.730469 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.12.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.13.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.731508 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.13.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.14.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.732935 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.14.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.15.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.733507 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.15.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.16.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.734072 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.16.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.17.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.734632 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.17.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.18.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.735191 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.18.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.19.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.735736 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.19.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.20.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.736295 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.20.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.21.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.736810 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.21.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.22.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.738280 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.22.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.23.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.739035 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.23.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.24.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.739575 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.24.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.25.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.740158 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.25.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.26.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.740705 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.26.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.27.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.741308 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.27.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.28.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.741841 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.28.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.29.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.742399 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.29.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.30.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.742977 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.30.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.31.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.743518 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.31.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.32.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.744076 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.32.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.33.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.744623 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.33.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.34.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.745176 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.34.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.35.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.745734 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.35.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.36.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.746273 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.36.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.37.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.746827 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.37.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.38.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.747391 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.38.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.39.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.747920 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.39.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.40.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.748481 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.40.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.41.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.749029 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.41.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.42.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.749550 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.42.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.43.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.750087 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.43.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.44.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.750609 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.44.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.45.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.751127 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.45.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.46.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.751658 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.46.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.47.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.752209 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.47.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.48.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.752729 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.48.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.49.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.753320 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.49.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.50.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.753901 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.50.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.51.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.754459 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.51.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.52.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.755016 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.52.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.53.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.755560 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.53.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.54.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.756096 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.54.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.55.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.756646 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.55.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.56.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.757173 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.56.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.57.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.757739 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.57.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.58.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.758271 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.58.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.59.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.758808 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.59.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.60.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.759369 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.60.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.61.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.759902 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.61.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.62.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.760447 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.62.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.63.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.760957 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.63.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.64.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.761473 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.64.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.65.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.762000 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.65.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.66.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.762529 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.66.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.67.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.763062 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.67.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.68.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.763626 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.68.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.69.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.764171 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.69.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.70.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.764695 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.70.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.71.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.765227 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.71.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.72.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.765876 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.72.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.73.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.766418 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.73.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.74.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.766977 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.74.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.75.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.767523 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.75.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.76.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.768057 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.76.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.77.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.768594 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.77.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.78.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.774603 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.78.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Parameter transformer.layers.79.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:46:45.775199 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.79.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:46:45] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0


I1210 14:46:45.775758 139999580046464 logger.py:92] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0


[12/10/2024-14:46:45] [TRT] [W] Unused Input: position_ids
[12/10/2024-14:46:45] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[12/10/2024-14:46:45] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[12/10/2024-14:46:57] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[12/10/2024-14:46:57] [TRT] [I] Detected 15 inputs and 5 output network tensors.
[12/10/2024-14:47:16] [TRT] [I] Total Host Persistent Memory: 305280
[12/10/2024-14:47:16] [TRT] [I] Total Device Persistent Memory: 0
[12/10/2024-14:47:16] [TRT] [I] Total Scratch Memory: 67117056
[12/10/2024-14:47:16] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1617 steps to complete.
[12/10/2024-14:47:16] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 82.6494ms to assign 17 blocks to 1617 nodes requiring 251672576 bytes.
[12/10/2024-14:47:16] [TRT

I1210 14:47:32.935170 139999580046464 logger.py:92] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:47


[12/10/2024-14:47:32] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank0.engine...


I1210 14:47:32.942047 139999580046464 logger.py:92] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank0.engine...


[12/10/2024-14:47:52] [TRT-LLM] [I] Engine serialized. Total time: 00:00:19


I1210 14:47:52.927093 139999580046464 logger.py:92] [TRT-LLM] [I] Engine serialized. Total time: 00:00:19


[12/10/2024-14:47:55] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


I1210 14:47:55.815099 139999580046464 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


[12/10/2024-14:47:55] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


I1210 14:47:55.816414 139999580046464 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


[12/10/2024-14:47:55] [TRT-LLM] [I] Set multi_block_mode to False.


I1210 14:47:55.816962 139999580046464 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.


[12/10/2024-14:47:55] [TRT-LLM] [I] Set paged_kv_cache to True.


I1210 14:47:55.817445 139999580046464 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.


[12/10/2024-14:47:55] [TRT-LLM] [I] Set tokens_per_block to 128.


I1210 14:47:55.817957 139999580046464 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.


[12/10/2024-14:47:55] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.


W1210 14:47:55.818501 139999580046464 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.


[12/10/2024-14:47:55] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 



W1210 14:47:55.818986 139999580046464 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 



[12/10/2024-14:49:11] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 433067, GPU 1764 (MiB)
[12/10/2024-14:49:11] [TRT-LLM] [I] Set nccl_plugin to bfloat16.


I1210 14:49:11.673665 139999580046464 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.


[12/10/2024-14:49:11] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[12/10/2024-14:49:11] [TRT-LLM] [I] Set use_custom_all_reduce to True.


I1210 14:49:11.675464 139999580046464 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/10/2024-14:49:11] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:49:11] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:49:11] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:49:11] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_laye

I1210 14:49:12.214994 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.0.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.216323 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.216948 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.217521 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.218111 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.5.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.219141 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.5.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.6.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.219693 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.6.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.7.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.220276 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.7.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.8.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.220879 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.8.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.9.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.221455 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.9.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.10.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.222030 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.10.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.11.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.222585 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.11.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.12.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.223142 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.12.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.13.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.223692 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.13.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.14.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.224219 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.14.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.15.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.224749 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.15.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.16.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.225296 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.16.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.17.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.225846 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.17.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.18.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.226379 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.18.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.19.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.226922 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.19.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.20.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.227445 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.20.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.21.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.227985 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.21.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.22.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.228502 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.22.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.23.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.229031 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.23.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.24.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.229557 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.24.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.25.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.230095 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.25.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.26.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.230611 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.26.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.27.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.231179 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.27.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.28.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.231689 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.28.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.29.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.232213 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.29.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.30.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.232759 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.30.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.31.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.233258 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.31.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.32.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.233812 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.32.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.33.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.234361 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.33.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.34.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.234897 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.34.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.35.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.238439 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.35.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.36.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.239149 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.36.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.37.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.239726 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.37.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.38.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.240578 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.38.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.39.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.241140 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.39.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.40.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.241673 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.40.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.41.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.242236 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.41.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.42.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.242744 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.42.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.43.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.243291 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.43.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.44.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.243832 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.44.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.45.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.245110 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.45.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.46.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.245652 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.46.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.47.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.246177 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.47.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.48.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.246689 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.48.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.49.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.247235 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.49.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.50.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.247756 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.50.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.51.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.248285 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.51.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.52.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.248822 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.52.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.53.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.249367 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.53.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.54.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.249886 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.54.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.55.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.250402 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.55.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.56.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.250918 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.56.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.57.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.251436 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.57.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.58.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.251971 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.58.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.59.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.252484 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.59.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.60.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.253012 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.60.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.61.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.253529 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.61.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.62.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.254039 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.62.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.63.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.254561 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.63.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.64.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.255112 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.64.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.65.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.255633 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.65.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.66.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.256168 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.66.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.67.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.256691 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.67.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.68.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.257224 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.68.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.69.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.257733 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.69.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.70.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.258264 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.70.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.71.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.258784 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.71.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.72.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.259323 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.72.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.73.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.259815 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.73.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.74.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.260344 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.74.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.75.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.263870 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.75.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.76.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.264416 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.76.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.77.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.264948 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.77.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.78.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.265493 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.78.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Parameter transformer.layers.79.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:49:12.266012 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.79.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:49:12] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0


I1210 14:49:12.266564 139999580046464 logger.py:92] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0


[12/10/2024-14:49:12] [TRT] [W] Unused Input: position_ids
[12/10/2024-14:49:12] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[12/10/2024-14:49:12] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[12/10/2024-14:49:25] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[12/10/2024-14:49:25] [TRT] [I] Detected 15 inputs and 5 output network tensors.
[12/10/2024-14:49:42] [TRT] [I] Total Host Persistent Memory: 305280
[12/10/2024-14:49:42] [TRT] [I] Total Device Persistent Memory: 0
[12/10/2024-14:49:42] [TRT] [I] Total Scratch Memory: 67117056
[12/10/2024-14:49:42] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1617 steps to complete.
[12/10/2024-14:49:42] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 82.1465ms to assign 17 blocks to 1617 nodes requiring 251672576 bytes.
[12/10/2024-14:49:42] [TRT

I1210 14:50:03.740431 139999580046464 logger.py:92] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:51


[12/10/2024-14:50:03] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank1.engine...


I1210 14:50:03.745454 139999580046464 logger.py:92] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank1.engine...


[12/10/2024-14:50:28] [TRT-LLM] [I] Engine serialized. Total time: 00:00:24


I1210 14:50:28.277422 139999580046464 logger.py:92] [TRT-LLM] [I] Engine serialized. Total time: 00:00:24


[12/10/2024-14:50:31] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


I1210 14:50:31.216939 139999580046464 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


[12/10/2024-14:50:31] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


I1210 14:50:31.219235 139999580046464 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


[12/10/2024-14:50:31] [TRT-LLM] [I] Set multi_block_mode to False.


I1210 14:50:31.220340 139999580046464 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.


[12/10/2024-14:50:31] [TRT-LLM] [I] Set paged_kv_cache to True.


I1210 14:50:31.220829 139999580046464 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.


[12/10/2024-14:50:31] [TRT-LLM] [I] Set tokens_per_block to 128.


I1210 14:50:31.221545 139999580046464 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.


[12/10/2024-14:50:31] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.


W1210 14:50:31.222020 139999580046464 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.


[12/10/2024-14:50:31] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 



W1210 14:50:31.222528 139999580046464 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 



[12/10/2024-14:51:07] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 433068, GPU 1764 (MiB)
[12/10/2024-14:51:07] [TRT-LLM] [I] Set nccl_plugin to bfloat16.


I1210 14:51:07.784151 139999580046464 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.


[12/10/2024-14:51:07] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[12/10/2024-14:51:07] [TRT-LLM] [I] Set use_custom_all_reduce to True.


I1210 14:51:07.786616 139999580046464 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/10/2024-14:51:07] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:51:07] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:51:07] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:51:07] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_laye

I1210 14:51:08.321146 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.0.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.322551 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.323274 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.323924 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.324608 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.5.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.325243 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.5.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.6.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.325892 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.6.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.7.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.327007 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.7.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.8.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.327620 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.8.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.9.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.328361 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.9.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.10.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.329011 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.10.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.11.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.329670 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.11.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.12.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.330280 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.12.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.13.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.330902 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.13.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.14.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.331507 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.14.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.15.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.332121 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.15.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.16.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.332759 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.16.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.17.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.333380 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.17.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.18.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.333992 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.18.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.19.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.334624 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.19.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.20.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.336239 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.20.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.21.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.336885 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.21.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.22.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.337494 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.22.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.23.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.338116 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.23.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.24.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.338730 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.24.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.25.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.339352 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.25.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.26.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.339977 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.26.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.27.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.340584 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.27.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.28.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.341235 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.28.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.29.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.341819 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.29.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.30.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.342453 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.30.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.31.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.343135 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.31.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.32.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.343737 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.32.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.33.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.344348 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.33.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.34.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.344969 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.34.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.35.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.345816 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.35.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.36.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.346443 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.36.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.37.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.347060 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.37.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.38.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.347657 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.38.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.39.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.348290 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.39.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.40.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.348900 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.40.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.41.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.351215 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.41.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.42.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.351829 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.42.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.43.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.352481 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.43.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.44.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.353208 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.44.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.45.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.353834 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.45.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.46.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.354438 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.46.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.47.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.355059 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.47.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.48.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.355674 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.48.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.49.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.356340 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.49.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.50.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.356945 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.50.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.51.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.357561 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.51.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.52.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.359235 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.52.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.53.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.359869 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.53.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.54.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.360531 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.54.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.55.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.361173 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.55.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.56.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.363218 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.56.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.57.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.363832 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.57.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.58.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.364482 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.58.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.59.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.365218 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.59.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.60.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.365838 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.60.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.61.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.366447 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.61.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.62.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.367039 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.62.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.63.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.367661 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.63.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.64.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.368282 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.64.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.65.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.368860 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.65.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.66.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.369497 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.66.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.67.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.370081 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.67.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.68.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.370690 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.68.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.69.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.371311 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.69.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.70.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.371918 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.70.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.71.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.372529 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.71.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.72.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.373127 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.72.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.73.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.373737 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.73.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.74.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.374333 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.74.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.75.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.374936 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.75.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.76.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.375533 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.76.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.77.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.376116 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.77.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.78.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.376737 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.78.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Parameter transformer.layers.79.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:51:08.377318 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.79.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:51:08] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0


I1210 14:51:08.377989 139999580046464 logger.py:92] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0


[12/10/2024-14:51:08] [TRT] [W] Unused Input: position_ids
[12/10/2024-14:51:08] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[12/10/2024-14:51:08] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[12/10/2024-14:51:18] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[12/10/2024-14:51:18] [TRT] [I] Detected 15 inputs and 5 output network tensors.
[12/10/2024-14:51:34] [TRT] [I] Total Host Persistent Memory: 305344
[12/10/2024-14:51:34] [TRT] [I] Total Device Persistent Memory: 0
[12/10/2024-14:51:34] [TRT] [I] Total Scratch Memory: 67117056
[12/10/2024-14:51:34] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1619 steps to complete.
[12/10/2024-14:51:34] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 83.1084ms to assign 17 blocks to 1619 nodes requiring 251672576 bytes.
[12/10/2024-14:51:34] [TRT

I1210 14:51:39.475409 139999580046464 logger.py:92] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:31


[12/10/2024-14:51:39] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank2.engine...


I1210 14:51:39.479081 139999580046464 logger.py:92] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank2.engine...


[12/10/2024-14:52:00] [TRT-LLM] [I] Engine serialized. Total time: 00:00:20


I1210 14:52:00.103983 139999580046464 logger.py:92] [TRT-LLM] [I] Engine serialized. Total time: 00:00:20


[12/10/2024-14:52:00] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


I1210 14:52:00.822478 139999580046464 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.


[12/10/2024-14:52:00] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


I1210 14:52:00.825022 139999580046464 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.


[12/10/2024-14:52:00] [TRT-LLM] [I] Set multi_block_mode to False.


I1210 14:52:00.825633 139999580046464 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.


[12/10/2024-14:52:00] [TRT-LLM] [I] Set paged_kv_cache to True.


I1210 14:52:00.826249 139999580046464 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.


[12/10/2024-14:52:00] [TRT-LLM] [I] Set tokens_per_block to 128.


I1210 14:52:00.826817 139999580046464 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.


[12/10/2024-14:52:00] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.


W1210 14:52:00.828420 139999580046464 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.


[12/10/2024-14:52:00] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 



W1210 14:52:00.829135 139999580046464 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 



[12/10/2024-14:52:07] [TRT-LLM] [I] Set nccl_plugin to bfloat16.
[12/10/2024-14:52:07] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 433070, GPU 1764 (MiB)


I1210 14:52:07.347326 139999580046464 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.


[12/10/2024-14:52:07] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[12/10/2024-14:52:07] [TRT-LLM] [I] Set use_custom_all_reduce to True.


I1210 14:52:07.348982 139999580046464 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/10/2024-14:52:07] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:52:07] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:52:07] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/10/2024-14:52:07] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_laye

I1210 14:52:07.990309 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.0.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.991483 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.992201 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.993193 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.993812 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.5.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.994467 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.5.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.6.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.995064 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.6.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.7.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.995667 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.7.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.8.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.996292 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.8.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.9.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.996880 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.9.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.10.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.997497 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.10.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.11.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.998110 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.11.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.12.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.998737 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.12.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:07] [TRT-LLM] [I] Parameter transformer.layers.13.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:07.999467 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.13.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.14.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.000076 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.14.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.15.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.000676 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.15.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.16.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.001293 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.16.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.17.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.001872 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.17.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.18.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.004095 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.18.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.19.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.004711 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.19.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.20.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.005301 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.20.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.21.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.005894 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.21.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.22.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.006503 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.22.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.23.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.007137 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.23.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.24.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.007736 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.24.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.25.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.008334 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.25.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.26.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.008957 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.26.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.27.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.009530 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.27.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.28.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.010120 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.28.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.29.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.010698 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.29.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.30.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.011320 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.30.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.31.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.011945 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.31.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.32.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.012536 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.32.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.33.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.013139 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.33.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.34.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.013739 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.34.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.35.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.014312 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.35.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.36.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.015142 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.36.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.37.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.015749 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.37.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.38.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.016365 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.38.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.39.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.016984 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.39.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.40.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.017555 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.40.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.41.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.018166 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.41.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.42.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.018755 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.42.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.43.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.019361 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.43.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.44.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.020554 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.44.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.45.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.021195 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.45.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.46.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.021785 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.46.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.47.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.022382 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.47.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.48.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.022972 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.48.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.49.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.023591 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.49.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.50.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.024177 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.50.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.51.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.024784 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.51.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.52.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.025355 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.52.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.53.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.025957 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.53.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.54.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.026561 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.54.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.55.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.030355 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.55.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.56.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.031123 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.56.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.57.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.031727 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.57.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.58.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.032334 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.58.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.59.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.032935 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.59.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.60.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.033537 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.60.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.61.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.034141 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.61.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.62.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.034725 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.62.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.63.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.035343 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.63.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.64.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.035930 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.64.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.65.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.036548 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.65.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.66.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.037152 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.66.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.67.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.038928 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.67.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.68.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.039566 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.68.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.69.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.040164 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.69.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.70.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.040759 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.70.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.71.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.041342 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.71.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.72.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.041944 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.72.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.73.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.042558 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.73.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.74.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.043127 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.74.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.75.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.043737 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.75.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.76.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.044358 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.76.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.77.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.045000 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.77.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.78.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.045611 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.78.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Parameter transformer.layers.79.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


I1210 14:52:08.046214 139999580046464 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.79.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network


[12/10/2024-14:52:08] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0


I1210 14:52:08.046862 139999580046464 logger.py:92] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0


[12/10/2024-14:52:08] [TRT] [W] Unused Input: position_ids
[12/10/2024-14:52:08] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[12/10/2024-14:52:08] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[12/10/2024-14:52:18] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[12/10/2024-14:52:18] [TRT] [I] Detected 15 inputs and 5 output network tensors.
[12/10/2024-14:52:33] [TRT] [I] Total Host Persistent Memory: 305280
[12/10/2024-14:52:33] [TRT] [I] Total Device Persistent Memory: 0
[12/10/2024-14:52:33] [TRT] [I] Total Scratch Memory: 67117056
[12/10/2024-14:52:33] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1617 steps to complete.
[12/10/2024-14:52:33] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 82.283ms to assign 17 blocks to 1617 nodes requiring 251672576 bytes.
[12/10/2024-14:52:33] [TRT]

I1210 14:52:49.608895 139999580046464 logger.py:92] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:41


[12/10/2024-14:52:49] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank3.engine...


I1210 14:52:49.613179 139999580046464 logger.py:92] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank3.engine...


[12/10/2024-14:53:13] [TRT-LLM] [I] Engine serialized. Total time: 00:00:23


I1210 14:53:13.333480 139999580046464 logger.py:92] [TRT-LLM] [I] Engine serialized. Total time: 00:00:23
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[12/10/2024-14:53:22] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
[12/10/2024-14:53:22] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
[12/10/2024-14:53:22] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
[12/10/2024-14:53:22] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
[12/10/2024-14:53:34] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[12/10/2024-14:53:34] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[12/10/2024-14:53:34] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[12/10/2024-14:53:34] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
[12/10/2024-14:53:34] [TRT-LLM] [I] Set identity_plugin to None.
[12/10/2024-14:53:34

In [None]:
trt_llm_exporter.forward(
    ["Comment about extended aortic replacement in acute type A dissection."],
    max_output_token=150,
    top_k=1,
    top_p=0.0,
    temperature=1.0,
)

In [None]:
%%bash -s "$token"

# Log in to HuggingFace to get AutoTokenizer with pretrained_model_name
HF_TOKEN="$1"
huggingface-cli login --token "$HF_TOKEN"

PRECISION=bf16-mixed
MODEL_DIR="models/llama-3.1-nemotron/70B/trt_llm"
mkdir -p "$MODEL_DIR"
MODEL_CKPT="models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo"
LORA_CKPT="results/llama-3.1-nemotron/70B/$PRECISION/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

python /opt/NeMo/scripts/deploy/nlp/deploy_triton.py \
    --nemo_checkpoint "$MODEL_CKPT" \
    --lora_ckpt "$LORA_CKPT" \
    --model_type llama \
    --triton_model_name llama3-finetuned \
    --triton_model_repository "$MODEL_DIR" \
    --num_gpus 4 \
    --tensor_parallelism_size 4 \
    --pipeline_parallelism_size 1

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ucloud/.cache/huggingface/token
Login successful


`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
W1211 09:40:31.525615 140666415416448 logger.py:92] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage


[TensorRT-LLM] TensorRT-LLM version: 0.10.0


I1211 09:40:32.337549 140666415416448 deploy_triton.py:344] Logging level set to 20
I1211 09:40:32.337709 140666415416448 deploy_triton.py:345] Namespace(nemo_checkpoint='models/llama-3.1-nemotron/70B/nemo/Llama-3_1-Nemotron-70B-Instruct.nemo', ptuning_nemo_checkpoint=None, task_ids=None, model_type='llama', triton_model_name='llama3-finetuned', triton_model_version=1, triton_port=8000, triton_http_address='0.0.0.0', triton_request_timeout=60, triton_model_repository='models/llama-3.1-nemotron/70B/trt_llm', num_gpus=4, tensor_parallelism_size=4, pipeline_parallelism_size=1, dtype='bfloat16', max_input_len=256, max_output_len=256, max_batch_size=8, max_num_tokens=None, opt_num_tokens=None, max_prompt_embedding_table_size=None, no_paged_kv_cache=False, disable_remove_input_padding=False, use_parallel_embedding=False, multi_block_mode=False, enable_streaming=False, use_lora_plugin=None, lora_target_modules=None, max_lora_rank=64, lora_ckpt=['results/llama-3.1-nemotron/70B/bf16-mixed/check

Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully


saving weights: 100%|██████████| 481/481 [06:35<00:00,  1.22it/s]
I1211 09:50:17.085886 140666415416448 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
I1211 09:50:17.086187 140666415416448 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
I1211 09:50:17.086217 140666415416448 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.
I1211 09:50:17.086241 140666415416448 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.
I1211 09:50:17.086260 140666415416448 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.
W1211 09:50:17.086288 140666415416448 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
W1211 09:50:17.086

[12/11/2024-09:53:35] [TRT] [I] [MemUsageChange] Init CUDA: CPU +16, GPU +0, now: CPU 279691, GPU 528 (MiB)
[12/11/2024-09:53:41] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +4312, GPU +1150, now: CPU 284139, GPU 1678 (MiB)
[12/11/2024-09:53:41] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.


I1211 09:53:41.346775 140666415416448 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.
I1211 09:53:41.347060 140666415416448 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/11/2024-09:53:41] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:53:41] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:53:41] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:53:41] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_laye

I1211 09:53:41.964361 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.0.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:53:41.964623 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:53:41.964730 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:53:41.964814 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:53:41.964894 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:53:41.964977 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter tran

[12/11/2024-09:53:41] [TRT] [W] Unused Input: position_ids
[12/11/2024-09:53:42] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[12/11/2024-09:53:42] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[12/11/2024-09:53:56] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[12/11/2024-09:53:56] [TRT] [I] Detected 15 inputs and 5 output network tensors.
[12/11/2024-09:54:21] [TRT] [I] Total Host Persistent Memory: 305280
[12/11/2024-09:54:21] [TRT] [I] Total Device Persistent Memory: 0
[12/11/2024-09:54:21] [TRT] [I] Total Scratch Memory: 67117056
[12/11/2024-09:54:21] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1617 steps to complete.
[12/11/2024-09:54:21] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 83.9577ms to assign 17 blocks to 1617 nodes requiring 251672576 bytes.
[12/11/2024-09:54:21] [TRT

I1211 09:54:39.642799 140666415416448 logger.py:92] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:57
I1211 09:54:39.649972 140666415416448 logger.py:92] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank0.engine...
I1211 09:55:05.325882 140666415416448 logger.py:92] [TRT-LLM] [I] Engine serialized. Total time: 00:00:25
I1211 09:55:08.996442 140666415416448 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
I1211 09:55:08.996592 140666415416448 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
I1211 09:55:08.996623 140666415416448 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.
I1211 09:55:08.996644 140666415416448 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.
I1211 09:55:08.996661 140666415416448 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.
W1211 09:55:08.996694 140666415416448 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_bat

[12/11/2024-09:55:36] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 289693, GPU 1764 (MiB)
[12/11/2024-09:55:36] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.


I1211 09:55:36.487796 140666415416448 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.
I1211 09:55:36.488140 140666415416448 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/11/2024-09:55:36] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:55:36] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:55:36] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:55:36] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_laye

I1211 09:55:36.970633 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.0.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:55:36.970934 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:55:36.971036 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:55:36.971118 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:55:36.971198 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:55:36.971277 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter tran

[12/11/2024-09:55:36] [TRT] [W] Unused Input: position_ids
[12/11/2024-09:55:37] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[12/11/2024-09:55:37] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[12/11/2024-09:55:48] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[12/11/2024-09:55:48] [TRT] [I] Detected 15 inputs and 5 output network tensors.
[12/11/2024-09:56:06] [TRT] [I] Total Host Persistent Memory: 305280
[12/11/2024-09:56:06] [TRT] [I] Total Device Persistent Memory: 0
[12/11/2024-09:56:06] [TRT] [I] Total Scratch Memory: 67117056
[12/11/2024-09:56:06] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1617 steps to complete.
[12/11/2024-09:56:06] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 81.0358ms to assign 17 blocks to 1617 nodes requiring 251672576 bytes.
[12/11/2024-09:56:06] [TRT

I1211 09:56:23.044344 140666415416448 logger.py:92] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:46
I1211 09:56:23.046135 140666415416448 logger.py:92] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank1.engine...
I1211 09:56:45.491527 140666415416448 logger.py:92] [TRT-LLM] [I] Engine serialized. Total time: 00:00:22
I1211 09:56:48.858254 140666415416448 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
I1211 09:56:48.858414 140666415416448 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
I1211 09:56:48.858443 140666415416448 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.
I1211 09:56:48.858466 140666415416448 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.
I1211 09:56:48.858484 140666415416448 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.
W1211 09:56:48.858505 140666415416448 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_bat

[12/11/2024-09:57:08] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 289695, GPU 1764 (MiB)
[12/11/2024-09:57:08] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.


I1211 09:57:08.733625 140666415416448 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.
I1211 09:57:08.735268 140666415416448 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/11/2024-09:57:08] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:57:08] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:57:08] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:57:08] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_laye

I1211 09:57:09.442889 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.0.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:57:09.443202 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:57:09.443302 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:57:09.443384 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:57:09.443461 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:57:09.443536 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter tran

[12/11/2024-09:57:09] [TRT] [W] Unused Input: position_ids
[12/11/2024-09:57:09] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[12/11/2024-09:57:09] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[12/11/2024-09:57:22] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[12/11/2024-09:57:22] [TRT] [I] Detected 15 inputs and 5 output network tensors.
[12/11/2024-09:57:41] [TRT] [I] Total Host Persistent Memory: 305280
[12/11/2024-09:57:41] [TRT] [I] Total Device Persistent Memory: 0
[12/11/2024-09:57:41] [TRT] [I] Total Scratch Memory: 67117056
[12/11/2024-09:57:41] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1617 steps to complete.
[12/11/2024-09:57:41] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 79.2674ms to assign 17 blocks to 1617 nodes requiring 251672576 bytes.
[12/11/2024-09:57:41] [TRT

I1211 09:57:58.303178 140666415416448 logger.py:92] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:48
I1211 09:57:58.305532 140666415416448 logger.py:92] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank2.engine...
I1211 09:58:20.362878 140666415416448 logger.py:92] [TRT-LLM] [I] Engine serialized. Total time: 00:00:22
I1211 09:58:23.286481 140666415416448 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
I1211 09:58:23.286608 140666415416448 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
I1211 09:58:23.286639 140666415416448 logger.py:92] [TRT-LLM] [I] Set multi_block_mode to False.
I1211 09:58:23.286660 140666415416448 logger.py:92] [TRT-LLM] [I] Set paged_kv_cache to True.
I1211 09:58:23.286679 140666415416448 logger.py:92] [TRT-LLM] [I] Set tokens_per_block to 128.
W1211 09:58:23.286702 140666415416448 logger.py:92] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_bat

[12/11/2024-09:58:30] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 289696, GPU 1764 (MiB)
[12/11/2024-09:58:30] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.


I1211 09:58:30.041434 140666415416448 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.
I1211 09:58:30.041722 140666415416448 logger.py:92] [TRT-LLM] [I] Set use_custom_all_reduce to True.


[12/11/2024-09:58:30] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:58:30] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:58:30] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[12/11/2024-09:58:30] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/post_laye

I1211 09:58:30.526968 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.0.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:58:30.527212 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.1.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:58:30.527307 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.2.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:58:30.527388 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.3.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:58:30.527463 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter transformer.layers.4.attention.embed_positions (1, 131072, 128) float32 was not materialized to TRT network
I1211 09:58:30.527534 140666415416448 logger.py:92] [TRT-LLM] [I] Parameter tran

[12/11/2024-09:58:30] [TRT] [W] Unused Input: position_ids
[12/11/2024-09:58:30] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[12/11/2024-09:58:30] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[12/11/2024-09:58:41] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[12/11/2024-09:58:41] [TRT] [I] Detected 15 inputs and 5 output network tensors.
[12/11/2024-09:59:02] [TRT] [I] Total Host Persistent Memory: 305280
[12/11/2024-09:59:02] [TRT] [I] Total Device Persistent Memory: 0
[12/11/2024-09:59:02] [TRT] [I] Total Scratch Memory: 67117056
[12/11/2024-09:59:02] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1617 steps to complete.
[12/11/2024-09:59:02] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 80.8076ms to assign 17 blocks to 1617 nodes requiring 251672576 bytes.
[12/11/2024-09:59:02] [TRT

I1211 09:59:20.141936 140666415416448 logger.py:92] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:49
I1211 09:59:20.145720 140666415416448 logger.py:92] [TRT-LLM] [I] Serializing engine to models/llama-3.1-nemotron/70B/trt_llm/rank3.engine...
I1211 09:59:42.510133 140666415416448 logger.py:92] [TRT-LLM] [I] Engine serialized. Total time: 00:00:22
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
W1211 10:00:02.130824 139674090443904 logger.py:92] [TRT-LLM] [W]

[TensorRT-LLM] TensorRT-LLM version: 0.10.0
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully
Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully
Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully
Loaded mpi lib /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so successfully


I1211 10:00:21.108588 140691274339456 logger.py:92] [TRT-LLM] [I] Set bert_attention_plugin to float16.
I1211 10:00:21.108868 140691274339456 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
I1211 10:00:21.108896 140691274339456 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
I1211 10:00:21.108930 140691274339456 logger.py:92] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
I1211 10:00:21.108949 140691274339456 logger.py:92] [TRT-LLM] [I] Set identity_plugin to None.
I1211 10:00:21.108966 140691274339456 logger.py:92] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
I1211 10:00:21.108981 140691274339456 logger.py:92] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
I1211 10:00:21.108997 140691274339456 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.
I1211 10:00:21.109013 140691274339456 logger.py:92] [TRT-LLM] [I] Set lookup_plugin to None.
I1211 10:00:21.109030 140691274339456 logger.py:92] [TRT-LLM] [I] Set lora_plugin to None.
I12

[12/11/2024-10:00:21] [TRT] [I] Loaded engine size: 35234 MiB


I1211 10:00:24.998260 140483455378560 logger.py:92] [TRT-LLM] [I] Set bert_attention_plugin to float16.
I1211 10:00:24.998497 140483455378560 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
I1211 10:00:24.998527 140483455378560 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
I1211 10:00:24.998553 140483455378560 logger.py:92] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
I1211 10:00:24.998573 140483455378560 logger.py:92] [TRT-LLM] [I] Set identity_plugin to None.
I1211 10:00:24.998593 140483455378560 logger.py:92] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
I1211 10:00:24.998612 140483455378560 logger.py:92] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
I1211 10:00:24.998631 140483455378560 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.
I1211 10:00:24.998651 140483455378560 logger.py:92] [TRT-LLM] [I] Set lookup_plugin to None.
I1211 10:00:24.998671 140483455378560 logger.py:92] [TRT-LLM] [I] Set lora_plugin to None.
I12

[12/11/2024-10:00:25] [TRT] [I] Loaded engine size: 35234 MiB


I1211 10:00:28.351375 139674090443904 logger.py:92] [TRT-LLM] [I] Set bert_attention_plugin to float16.
I1211 10:00:28.351610 139674090443904 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
I1211 10:00:28.351641 139674090443904 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
I1211 10:00:28.351668 139674090443904 logger.py:92] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
I1211 10:00:28.351686 139674090443904 logger.py:92] [TRT-LLM] [I] Set identity_plugin to None.
I1211 10:00:28.351706 139674090443904 logger.py:92] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
I1211 10:00:28.351723 139674090443904 logger.py:92] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
I1211 10:00:28.351742 139674090443904 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.
I1211 10:00:28.351760 139674090443904 logger.py:92] [TRT-LLM] [I] Set lookup_plugin to None.
I1211 10:00:28.351781 139674090443904 logger.py:92] [TRT-LLM] [I] Set lora_plugin to None.
I12

[12/11/2024-10:00:28] [TRT] [I] Loaded engine size: 35234 MiB


I1211 10:00:29.169286 140543358837888 logger.py:92] [TRT-LLM] [I] Set bert_attention_plugin to float16.
I1211 10:00:29.169537 140543358837888 logger.py:92] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
I1211 10:00:29.169566 140543358837888 logger.py:92] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
I1211 10:00:29.169593 140543358837888 logger.py:92] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
I1211 10:00:29.169615 140543358837888 logger.py:92] [TRT-LLM] [I] Set identity_plugin to None.
I1211 10:00:29.169633 140543358837888 logger.py:92] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
I1211 10:00:29.169651 140543358837888 logger.py:92] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
I1211 10:00:29.169668 140543358837888 logger.py:92] [TRT-LLM] [I] Set nccl_plugin to bfloat16.
I1211 10:00:29.169687 140543358837888 logger.py:92] [TRT-LLM] [I] Set lookup_plugin to None.
I1211 10:00:29.169706 140543358837888 logger.py:92] [TRT-LLM] [I] Set lora_plugin to None.
I12

[12/11/2024-10:00:29] [TRT] [I] Loaded engine size: 35234 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[12/11/2024-10:00:38] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 35226 (MiB)
[12/11/2024-10:00:38] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 35226 (MiB)
[12/11/2024-10:00:38] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 35226 (MiB)
[12/11/2024-10:00:38] [TRT] [I] [M

W1211 10:00:39.117069 139674090443904 logger.py:92] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.


[12/11/2024-10:00:39] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 35226 (MiB)


W1211 10:00:39.123890 140483455378560 logger.py:92] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.
I1211 10:00:39.129375 140483455378560 logger.py:92] [TRT-LLM] [I] Load engine takes: 36.43295121192932 sec
I1211 10:00:39.129376 139674090443904 logger.py:92] [TRT-LLM] [I] Load engine takes: 36.56637644767761 sec


[12/11/2024-10:00:39] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 35226 (MiB)
[12/11/2024-10:00:39] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 35226 (MiB)


W1211 10:00:39.154350 140543358837888 logger.py:92] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.
W1211 10:00:39.154768 140691274339456 logger.py:92] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime.
I1211 10:00:39.155468 140543358837888 logger.py:92] [TRT-LLM] [I] Load engine takes: 36.52978515625 sec
I1211 10:00:39.156029 140691274339456 logger.py:92] [TRT-LLM] [I] Load engine takes: 36.460482120513916 sec
I1211 10:00:43.900752 140666415416448 deploy_triton.py:377] Triton deploy function will be called.
I1211 10:00:43.905046 140666415416448 deploy_triton.py:384] Model serving on Triton is will be started.


Open a terminal to query the model:

```bash
QUERY="Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?"

python /opt/NeMo/scripts/deploy/nlp/query.py \
    -mn llama3-finetuned \
    -p "$QUERY" \
    -mol 350
```