<center>
  <a href="https://escience.sdu.dk/index.php/ucloud/">
    <img src="https://escience.sdu.dk/wp-content/uploads/2020/03/logo_esc.svg" width="400" height="186" />
  </a>
</center>
<br>
<p style="font-size: 1.2em;">
  This notebook was tested using <strong>Triton Inference Server (TRT-LLM) v25.02</strong> and machine type <code>uc1-l40-4</code> on UCloud.
</p>


# 01 - Deploying a Triton Inference Server for LLMs

## 🚀 Introduction

In this end‑to‑end tutorial, we’ll take a pre-trained LLM and:
1. Authenticate and download weights from Hugging Face.
2. Convert the model into a **TensorRT-LLM** engine for high‑performance inference.
3. Test the optimized engine locally.
4. Package and deploy the engine on NVIDIA Triton Inference Server with inflight batching.
5. Send sample requests to Triton and validate responses.
6. Profile performance using Triton’s `genai-perf` tool.

> 🛠️ **Important Environment Note:**
> This notebook is designed to run on **UCloud**, using the **NVIDIA Triton Inference Server (TRT-LLM) app, version `v25.02`**.
> If you encounter unexpected errors, **double-check you are using the correct app version**, and that your session includes **GPU resources**.

## 🛠️ Step 1: Hugging Face Authentication

The following code creates a secure input widget for your Hugging Face token, which is required to authenticate and download the [Llama 3.1 Nemotron Nano 4B v1.1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1) model from the Hugging Face Hub.

In [None]:
from IPython.display import display
from ipywidgets import Password
from huggingface_hub import snapshot_download

pwd = Password(description="Hugging Face Token:")
display(pwd)

In [None]:
token = pwd.value
hf_model="nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1"
hf_model_path="models/llama-3.1-nemotron-nano/4B/hf"

snapshot_download(
    repo_id=hf_model,
    local_dir=hf_model_path,
    token=token
)

## 🛠️ Step 2: Convert the Model Checkpoint to TensorRT-LLM Format

The following Python script sets up the required directories and executes the conversion of the model checkpoint from Hugging Face to a **TensorRT-ready** format for optimized performance.

In [None]:
%%bash

HF_MODEL="models/llama-3.1-nemotron-nano/4B/hf"

du -sh "$HF_MODEL"

The table below compares two deployment strategies for a 4B parameter model on dual NVIDIA L40 GPUs—using one GPU per replica (TP = 1, PP = 1) versus sharding a single replica across both GPUs (TP = 2)—highlighting the trade-offs in interconnect usage, TensorRT-LLM support, and tuning complexity.

| Criterion                            | TP = 1, PP = 1 (one GPU per replica)                                                                                        | TP = 2 (shard across both GPUs)                                                                                                       |
| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| **L40 interconnect**                 | No cross-GPU traffic. The card has **no NVLink; only PCIe Gen4 ×16 (64 GB/s)**, so you stay off the slow bus. ([NVIDIA][1]) | Every attention & FFN layer does an all-reduce over PCIe, erasing any speed-up.                                                       |
| **Software support in TensorRT-LLM** | Fully supported; Triton can run many single-GPU engines via *orchestrator* or *leader* mode. ([NVIDIA Docs][2])             | Works, but gives no concurrency benefit and complicates launch scripts.                                                               |
| **Tuning knobs**                     | Only one engine build: `--tp_size 1`.  Simpler GPU placement and easier scaling tests.                                      | Must align hidden-size and #heads to TP; increased build time; more configs to tune.                                                  |

[1]: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/support-guide/NVIDIA-L40-Datasheet-January-2023.pdf "NVIDIA L40"
[2]: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tensorrtllm_backend/docs/llama_multi_instance.html "Running Multiple Instances of the LLaMa Model — NVIDIA Triton Inference Server"


In [None]:
%%bash

HF_MODEL="models/llama-3.1-nemotron-nano/4B/hf"
TRT_CKPT="models/llama-3.1-nemotron-nano/4B/trt_ckpt/tp1"
mkdir -p "$TRT_CKPT"

python -W ignore ~/llama/convert_checkpoint.py \
      --model_dir "$HF_MODEL" \
      --output_dir "$TRT_CKPT" \
      --dtype bfloat16 \
      --tp_size 1 \
      --load_by_shard

## 🛠️ Step 3: Build TensorRT-LLM Engine

The following script constructs the **TensorRT-LLM** engine from the previously converted model checkpoint. This optimization enhances the model's inference performance by leveraging TensorRT's efficient execution capabilities.

**Checkpoint vs. Engine Directory Contents**

| Directory                                        | Files                                | Purpose                                                                                                                                                                                                                                                                   |
| ------------------------------------------------ | ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `models/llama-3.1-nemotron-nano/4B/trt_ckpt/tp1` | `config.json`<br>`rank0.safetensors` | **Intermediate checkpoint.**<br>• `rank0.safetensors` contains all BF16 weights for the single tensor-parallel rank.<br>• `config.json` records hidden size, number of layers/heads, tokenizer info, and that `tp_size = 1` (so the builder knows no sharding is needed). |
| `models/llama-3.1-nemotron-nano/4B/trt_llm/tp1`  | `config.json`<br>`rank0.engine`      | **Final TensorRT-LLM engine.**<br>• `rank0.engine` is the serialized, fused CUDA graph for the whole model.<br>• Includes the GEMM/GELU/rotary-positional kernels and paged-KV-cache support you enabled (`--paged_kv_cache enable`).                                     |



In [None]:
%%bash

TRT_CKPT="models/llama-3.1-nemotron-nano/4B/trt_ckpt/tp1"
TRT_ENGINE="models/llama-3.1-nemotron-nano/4B/trt_llm/tp1"

trtllm-build --checkpoint_dir "$TRT_CKPT" \
      --output_dir "$TRT_ENGINE" \
      --gemm_plugin auto \
      --max_num_tokens 16384 \
      --max_input_len 4096  \
      --max_seq_len 8192 \
      --kv_cache_type paged \
      --workers 4

**Local Testing of TensorRT-Optimized model**

The following Python script performs a local test of the optimized model checkpoint. It sets the necessary environment variables and uses a sample prompt to evaluate the model's inference performance.

If you get an error in the cell below, update the Transformer library:
```bash
pip install -U Transformers
```

In [None]:
%%bash

HF_MODEL="models/llama-3.1-nemotron-nano/4B/hf"
TRT_ENGINE="models/llama-3.1-nemotron-nano/4B/trt_llm/tp1"

PROMPT='{"role":"system","content":"detailed thinking off"}
{"role":"user","content":"What are the typical symptoms of an infected appendix?"}
{"role":"assistant","content":"<think>\n</think>"}'

python ~/run.py \
    --engine_dir     "$TRT_ENGINE" \
    --tokenizer_dir  "$HF_MODEL" \
    --input_text     "$PROMPT" \
    --max_output_len 1000

## 🛠️ Step 4: Deploying Triton with Inflight Batching

The following scripts set up and configure Triton Inference Server for the model engine using [**inflight batching**](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/batcher.html) and [**orchestrator mode**](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tensorrtllm_backend/docs/llama_multi_instance.html#orchestrator-mode). This deployment optimizes inference performance by managing batch sizes and instance counts effectively.

Using the C++ TensorRT-LLM backend with the executor API.

| Model            | Description                                                                                           |
|------------------|-------------------------------------------------------------------------------------------------------|
| **ensemble**     | This model is used to chain the preprocessing, tensorrt_llm, and postprocessing models together.      |
| **preprocessing**| This model is used for tokenizing, meaning the conversion from prompts (string) to input_ids (ints).  |
| **tensorrt_llm** | This model is a wrapper of your TensorRT-LLM model and is used for inferencing.                       |
| **postprocessing**| This model is used for de-tokenizing, meaning the conversion from output_ids (ints) to outputs (string). |
| **tensorrt_llm_bls** | This model can also be used to chain the preprocessing, tensorrt_llm, and postprocessing models together. |

**Memory math: 4 B model on a 48 GB L40**

| Component (per replica)                   | Footprint (BF16) | Notes                                           |
| ----------------------------------------- | ---------------- | ----------------------------------------------- |
| Weights + constant buffers                | **≈ 8.0 GB**     | 4B params × 2 bytes × (1 + small overhead)      |
| CUDA Graph & scratch                      | ≈ 1 GB           | TensorRT kernels, persistent workspace          |
| **KV cache** (*paged*, 16384 tokens cap)  | **≤ 4.2 GB**     | 32 layers × 4096 dim × 2 bytes × 16384 tokens   |
| **Total per replica**                     | **≈ 13 GB**      | round-up for safety                             |


In [None]:
%%bash

TRTITON_REPO="models/llama-3.1-nemotron-nano/4B/triton"
mkdir -p "$TRTITON_REPO"

cp -r ~/all_models/inflight_batcher_llm/* "$TRTITON_REPO"

ls "$TRTITON_REPO"

In [29]:
%%bash

# preprocessing
ENGINE_DIR="models/llama-3.1-nemotron-nano/4B/trt_llm/tp1"
TOKENIZER_DIR="models/llama-3.1-nemotron-nano/4B/hf"
MODEL_FOLDER="models/llama-3.1-nemotron-nano/4B/triton"
TRITON_MAX_BATCH_SIZE=16 # reduce this value if you encounter OOM cuda errors

# How many parallel workers Triton should spin up for this model?
# For preprocessing / postprocessing (CPU) INSTANCE_COUNT controls how many tokenisers / detokenisers you can run in parallel
# raising it improves throughput until CPU cores or the GPU generation stage saturate
INSTANCE_COUNT=16 # Requests that can be tokenised and detokenised in parallel
MAX_QUEUE_DELAY_MS=1000 # Helps collect more requests into batches at very low overhead, or set to zero if you want lowest possible latency
MAX_QUEUE_SIZE=512 # Allow some queuing under bursty load, or set to zero for pure online inferencing
FILL_TEMPLATE_SCRIPT="$HOME/tools/fill_template.py"
LOGITS_DATA_TYPE="TYPE_FP32"
DECOUPLED_MODE=false # No decoupled streaming, matches inflight batching

python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATA_TYPE}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},logits_datatype:${LOGITS_DATA_TYPE},encoder_input_features_data_type:TYPE_FP16 
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATA_TYPE}


# For starting one model replica per available GPU
GPU_NUM=4
GPU_IDS="0;1;2;3"
    
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt gpu_device_ids:${GPU_IDS}

sed -i '/instance_group *\[/,/\]/d' "$MODEL_FOLDER/tensorrt_llm/config.pbtxt"

cat <<EOF >> "$MODEL_FOLDER/tensorrt_llm/config.pbtxt"
instance_group [
  {
    count: ${GPU_NUM}
    kind: KIND_CPU
  }
]
EOF

In [None]:
%%bash

# The following command starts an OpenAI-Compatible Frontend for Triton Inference Server
# This is used to run predictions in Label Studio

MODEL_FOLDER="models/llama-3.1-nemotron-nano/4B/triton"
TOKENIZER_DIR="models/llama-3.1-nemotron-nano/4B/hf"

stop_tritonserver

# Run multiple istances of the model (one replica per GPU):
# start in multi-process (MPI) orchestrator mode
export TRTLLM_ORCHESTRATOR=1 

nohup python3 -W ignore /opt/tritonserver/python/openai/openai_frontend/main.py --model-repository $MODEL_FOLDER --backend tensorrtllm --tokenizer $TOKENIZER_DIR --openai-port 9000 --enable-kserve-frontends &> /work/triton-server-log.txt &
# nohup python3 -W ignore /opt/tritonserver/python/openai/openai_frontend/main.py --model-repository $MODEL_FOLDER --backend tensorrtllm --tokenizer $TOKENIZER_DIR --openai-port 9000 &> /work/triton-server-log.txt &

The following Bash commands verify that the Triton server and the deployed model are running correctly.

In [None]:
%%bash

LOG_FILE="/work/triton-server-log.txt"

# Function to wait for Triton to start by monitoring the log file
wait_for_triton_start() {
    echo "Waiting for Triton Inference Server to start..."
    while true; do
        # Check for all required startup messages
        if grep -q 'Uvicorn running' "$LOG_FILE"; then
                echo "✅ Uvicorn has started successfully."
                break
                
        else
            echo "❌ Uvicorn has NOT started.. Retrying in 5 seconds..."
            sleep 5   
        fi
    done
}

# Wait for Triton to start
wait_for_triton_start

To test the model, run the following code in a terminal window:
```bash
MODEL=ensemble

TEXT="An appendectomy is a surgical procedure to remove the appendix when it's infected. Symptoms typically include abdominal pain starting around the navel and shifting to the lower right abdomen, nausea, vomiting, and fever. Diagnosis is often made based on symptoms, physical examination, and imaging like ultrasound or CT scan. The procedure can be performed using open surgery or minimally invasive laparoscopic techniques. Recovery usually takes a few weeks, with minimal complications."

curl -s http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "'"${MODEL}"'",
        "messages": [
          {"role": "system", "content": "detailed thinking off\n"},
          {"role": "assistant", "content":"<think>\n</think>\n"},
          {"role": "user",
           "content": "Given this medical text:\n\n'"${TEXT}"' \n\nGenerate direct, succinct, and unique medical questions covering symptoms, diagnosis, treatments, or patient management strategies. Output ONLY 5 questions clearly separated by new lines."
           }
        ],
        "temperature": 0.2,
        "max_tokens": 500
}' | jq -r '.choices[0].message.content'

```

## 🛠️ Step 5: Performance Profiling with `genai-perf`

To evaluate the performance of the deployed Triton Inference Server, execute the following Bash commands in a terminal session within Jupyter. This script uses [GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html) to profile the model, generating performance metrics and visualizations.

- Each request simulates:
    - An **input prompt** of around **200 tokens** (±10 tokens random noise)
    - And expects **200 output tokens** exactly.
- **1000 total prompts** per test, but each test measures for **10 seconds**.
- **Concurrency** will sweep from **10** to **500** (500 requests *in flight* at the same time).
- The input and output sizes are **large** — total per request is ~400 tokens.

In [None]:
%%bash

# Run the following code in a terminal window for a better output format

TOKENIZER_DIR="models/llama-3.1-nemotron-nano/4B/hf"

export INPUT_PROMPTS=1000
export INPUT_SEQUENCE_LENGTH=200
export INPUT_SEQUENCE_STD=10
export OUTPUT_SEQUENCE_LENGTH=200
export MODEL=ensemble

# Running multiple GenAI-Perf calls (the first for warm-up)
for concurrency in 1 1 2 4 8 16 32 48 64 128 256 512 1024; do
    genai-perf profile -m $MODEL \
        --service-kind triton \
        --backend tensorrtllm \
        --num-prompts $INPUT_PROMPTS \
        --random-seed 1234 \
        --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
        --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
        --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
        --output-tokens-stddev 0 \
        --output-tokens-mean-deterministic \
        --tokenizer $TOKENIZER_DIR \
        --concurrency $concurrency \
        --measurement-interval 25000 \
        --profile-export-file model_profile_${concurrency}.json \
        --url "localhost:8001" \
        --generate-plots
done

In [None]:
import os
import pandas as pd
import plotly.graph_objects as go
import json

# Settings
concurrency_levels = [1, 2, 4, 8, 16, 32, 48, 64, 128, 256, 512, 1024]  # Concurrency values you tested
base_path = "artifacts/ensemble-triton-tensorrtllm-concurrency{}"

# Storage for points
x = []  # Time to first token (s)
y = []  # Total system throughput (tokens/sec)
labels = []  # Concurrency values

# Read each JSON file
for concurrency in concurrency_levels:
    folder = base_path.format(concurrency)
    json_path = os.path.join(folder, f"model_profile_{concurrency}_genai_perf.json")
    
    # Read JSON file
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    # Extract values
    time_to_first_token_ms = data['time_to_first_token']['avg']  # milliseconds
    output_token_throughput = data['output_token_throughput']['avg']  # tokens per second

    # Append
    x.append(time_to_first_token_ms / 1000)  # ms → seconds
    y.append(output_token_throughput)
    labels.append(concurrency)


# Now plot
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=x, y=y,
    mode='lines+markers+text',
    text=labels,
    textposition="middle center",
    line=dict(width=2),
))

fig.update_layout(
    xaxis_title="Time to first token (s)",
    yaxis_title="Throughput (tokens/s)",
    plot_bgcolor="rgba(240, 248, 255, 1)",
    font=dict(size=14),
    width=1000,
    height=800,
)

fig.show()