<center>
  <a href="https://escience.sdu.dk/index.php/ucloud/">
    <img src="https://escience.sdu.dk/wp-content/uploads/2020/03/logo_esc.svg" width="400" height="186" />
  </a>
</center>
<br>
<p style="font-size: 1.2em;">
  This notebook was tested using <strong>Triton Inference Server (TRT-LLM) v25.02</strong> and machine type <code>u3-gpu4</code> on UCloud.
</p>


# 02 - Deploying a Triton Inference Server for LLMs

## 🚀 Introduction

In this end‑to‑end tutorial, we’ll take a pre-trained LLM (e.g., Llama 3.3 70B Instruct) and:
1. Authenticate and download weights from Hugging Face.
2. Convert the model into a TensorRT-LLM engine for high‑performance inference.
3. Test the optimized engine locally.
4. Package and deploy the engine on NVIDIA Triton Inference Server with inflight batching.
5. Send sample requests to Triton and validate responses.
6. Profile performance using Triton’s `genai-perf` tool.

> 🛠️ **Important Environment Note:**
> This notebook is designed to run on **UCloud**, using the **NVIDIA Triton Inference Server (TRT-LLM) app, version `v25.02`**.
> If you encounter unexpected errors, **double-check you are using the correct app version**, and that your session includes **GPU resources**.

## 🛠️ Step 1: Hugging Face Authentication

The following code creates a secure input widget for your Hugging Face token, which is required to authenticate and download the [Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) model from the Hugging Face Hub.

In [None]:
from IPython.display import display
from ipywidgets import Password
from huggingface_hub import snapshot_download

pwd = Password(description="Hugging Face Token:")
display(pwd)

In [None]:
token = pwd.value
hf_model="meta-llama/Llama-3.3-70B-Instruct"
hf_model_path="models/llama-3.3/70B/hf"

snapshot_download(
    repo_id=hf_model,
    local_dir=hf_model_path,
    token=token
)

## 🛠️ Step 2: Convert the Model Checkpoint to TensorRT-LLM Format

The following Bash script sets up the required directories and executes the conversion of the Llama 3.3 70B Instruct model checkpoint from Hugging Face to a **TensorRT-ready** format for optimized performance.

In [None]:
%%bash

HF_MODEL="models/llama-3.3/70B/hf"

du -sh "$HF_MODEL"

In [None]:
%%bash

HF_MODEL="models/llama-3.3/70B/hf"
TRT_CKPT="models/llama-3.3/70B/trt_ckpt/tp4"
mkdir -p "$TRT_CKPT"

python -W ignore ~/llama/convert_checkpoint.py \
      --model_dir "$HF_MODEL" \
      --output_dir "$TRT_CKPT" \
      --dtype float16 \
      --tp_size 4 \
      --pp_size 1 \
      --load_by_shard

## 🛠️ Step 3: Build TensorRT-LLM Engine

The following Bash script constructs the **TensorRT-LLM** engine from the previously converted Llama 3.3 70B Instruct model checkpoint. This optimization enhances the model's inference performance by leveraging TensorRT's efficient execution capabilities.

In [None]:
%%bash

TRT_CKPT="models/llama-3.3/70B/trt_ckpt/tp4"
TRT_ENGINE="models/llama-3.3/70B/trt_llm/tp4"

trtllm-build --checkpoint_dir "$TRT_CKPT" \
      --output_dir "$TRT_ENGINE" \
      --max_num_tokens 32768 \
      --max_input_len 4096  \
      --max_seq_len 8192 \
      --use_paged_context_fmha enable \
      --workers 4

### Checkpoint vs. Engine Directory Contents

| Directory                                | Files                                                                    | Purpose                                                                                                                                                                                                                                   |
|------------------------------------------|--------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `models/llama-3.3/70B/trt_ckpt/tp4`      | `config.json`<br>`rank0.safetensors`<br>`rank1.safetensors`<br>`rank2.safetensors`<br>`rank3.safetensors` | **Intermediate checkpoint.**<br>- Each `rank*.safetensors` holds the raw FP16 weight shards for one tensor‑parallel rank.<br>- `config.json` describes model dimensions, tokenizer settings, tensor‑parallel layout, etc., for the builder. |
| `models/llama-3.3/70B/trt_llm/tp4`       | `config.json`<br>`rank0.engine`<br>`rank1.engine`<br>`rank2.engine`<br>`rank3.engine`            | **Final TensorRT-LLM engine.**<br>- Each `*.engine` is a serialized, fully‑fused CUDA kernel graph for one TP rank.<br>- Includes optimized GEMM/GELU fusions, memory‑layout transforms, and paged‑context streaming (if enabled).      |


### Local Testing of TensorRT-Optimized model

The following Bash script performs a local test of the optimized Llama 3.3 70B Instruct model. It sets the necessary environment variables and runs the `run.py` script with a sample prompt to evaluate the model's inference performance.

If you get an error in the cell below, update the Transformer library:
```bash
pip install -U Transformers
```

In [None]:
%%bash

HF_MODEL="models/llama-3.3/70B/hf"
TRT_ENGINE="models/llama-3.3/70B/trt_llm/tp4"

PROMPT="What are the typical symptoms of an infected appendix?"

mpirun -n 4 python ~/run.py \
    --max_output_len=50 \
    --tokenizer_dir $HF_MODEL \
    --engine_dir $TRT_ENGINE \
    --input_text "$PROMPT"

## 🛠️ Step 4: Deploying Triton with Inflight Batching

The following Bash scripts set up and configure Triton Inference Server for the Llama 3.3 70B Instruct model using **inflight batching**. This deployment optimizes inference performance by managing batch sizes and instance counts effectively.

In [None]:
%%bash

TRTITON_REPO="models/llama-3.3/70B/triton"
mkdir -p "$TRTITON_REPO"

cp -r ~/all_models/inflight_batcher_llm/* "$TRTITON_REPO"

ls "$TRTITON_REPO"

In [None]:
%%bash

# preprocessing
ENGINE_DIR="models/llama-3.3/70B/trt_llm/tp4"
TOKENIZER_DIR="models/llama-3.3/70B/hf"
MODEL_FOLDER="models/llama-3.3/70B/triton"
TRITON_MAX_BATCH_SIZE=16 # reduce this value if you encounter OOM cuda errors
INSTANCE_COUNT=1 # one instance of the model, which is distributed on 4 GPUs
MAX_QUEUE_DELAY_MS=1000 # Helps collect more requests into batches at very low overhead, or set to zero if you want lowest possible latency
MAX_QUEUE_SIZE=512 # Allow some queuing under bursty load, or set to zero for pure online inferencing
FILL_TEMPLATE_SCRIPT="$HOME/tools/fill_template.py"
LOGITS_DATA_TYPE="TYPE_FP32"
DECOUPLED_MODE=false # no decoupled streaming, matches inflight batching

python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATA_TYPE}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},logits_datatype:${LOGITS_DATA_TYPE},encoder_input_features_data_type:TYPE_FP16
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATA_TYPE}

In [None]:
%%bash

# The following Bash command starts an OpenAI-Compatible Frontend for Triton Inference Server
# This is used to run predictions in Label Studio

MODEL_FOLDER="models/llama-3.3/70B/triton"
TOKENIZER_DIR="models/llama-3.3/70B/hf"

stop_tritonserver

nohup mpirun -np 4 python3 -W ignore /opt/tritonserver/python/openai/openai_frontend/main.py --model-repository $MODEL_FOLDER --backend tensorrtllm --tokenizer $TOKENIZER_DIR --openai-port 9000 --enable-kserve-frontends &> /work/triton-server-log.txt &

The following Bash commands verify that the Triton server and the deployed Llama 3.3 70B Instruct model are running correctly.

In [None]:
%%bash

LOG_FILE="/work/triton-server-log.txt"

# Function to wait for Triton to start by monitoring the log file
wait_for_triton_start() {
    echo "Waiting for Triton Inference Server to start..."
    while true; do
        # Check for all required startup messages
        if grep -q 'Uvicorn running' "$LOG_FILE"; then
                echo "✅ Uvicorn has started successfully."
                break
                
        else
            echo "❌ Uvicorn has NOT started.. Retrying in 5 seconds..."
            sleep 5   
        fi
    done
}

# Wait for Triton to start
wait_for_triton_start

To test the model, run the following code in a terminal window:
```bash
MODEL=ensemble

TEXT="An appendectomy is a surgical procedure to remove the appendix when it's infected. Symptoms typically include abdominal pain starting around the navel and shifting to the lower right abdomen, nausea, vomiting, and fever. Diagnosis is often made based on symptoms, physical examination, and imaging like ultrasound or CT scan. The procedure can be performed using open surgery or minimally invasive laparoscopic techniques. Recovery usually takes a few weeks, with minimal complications."

curl -s http://localhost:9000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "'"${MODEL}"'",
        "messages": [
          {
            "role": "user",
            "content": "Given this medical text:\n\n'"${TEXT}"' \n\nGenerate direct, succinct, and unique medical questions covering symptoms, diagnosis, treatments, or patient management strategies. Output ONLY 5 questions clearly separated by new lines."
          }
        ],
        "temperature": 0.2,
        "max_tokens": 500
}' | jq -r '.choices[0].message.content'

```

## 🛠️ Step 5: Performance Profiling with `genai-perf`

To evaluate the performance of the deployed Llama 3.3 70B Instruct model on Triton Inference Server, execute the following Bash commands in a terminal session within Jupyter. This script uses [GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html) to profile the model, generating performance metrics and visualizations.

- Each request simulates:
    - An **input prompt** of around **200 tokens** (±10 tokens random noise)
    - And expects **200 output tokens** exactly.
- **1000 total prompts** per test, but each test measures for **10 seconds**.
- **Concurrency** will sweep from **10** to **500** (500 requests *in flight* at the same time).
- The input and output sizes are **large** — total per request is ~400 tokens.

In [None]:
%%bash

# Run the following code in a terminal window for a better output format

TOKENIZER_DIR="models/llama-3.3/70B/hf"

export INPUT_PROMPTS=1000
export INPUT_SEQUENCE_LENGTH=200
export INPUT_SEQUENCE_STD=10
export OUTPUT_SEQUENCE_LENGTH=200
export MODEL=ensemble

# Running multiple GenAI-Perf calls (the first for warm-up)
for concurrency in 1 5 10 30 50 100 250 400 500; do
    genai-perf profile -m $MODEL \
        --service-kind triton \
        --backend tensorrtllm \
        --num-prompts $INPUT_PROMPTS \
        --random-seed 1234 \
        --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH \
        --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
        --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH \
        --output-tokens-stddev 0 \
        --output-tokens-mean-deterministic \
        --tokenizer $TOKENIZER_DIR \
        --concurrency $concurrency \
        --measurement-interval 25000 \
        --profile-export-file model_profile_${concurrency}.json \
        --url "localhost:8001" \
        --generate-plots
done

In [None]:
import os
import pandas as pd
import plotly.graph_objects as go
import json

# Settings
concurrency_levels = [10, 30, 50, 100, 250, 400, 500]  # Concurrency values you tested
base_path = "artifacts/ensemble-triton-tensorrtllm-concurrency{}"

# Storage for points
x = []  # Time to first token (s)
y = []  # Total system throughput (tokens/sec)
labels = []  # Concurrency values

# Read each JSON file
for concurrency in concurrency_levels:
    folder = base_path.format(concurrency)
    json_path = os.path.join(folder, f"model_profile_{concurrency}_genai_perf.json")
    
    # Read JSON file
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    # Extract values
    time_to_first_token_ms = data['time_to_first_token']['avg']  # milliseconds
    output_token_throughput = data['output_token_throughput']['avg']  # tokens per second

    # Append
    x.append(time_to_first_token_ms / 1000)  # ms → seconds
    y.append(output_token_throughput)
    labels.append(concurrency)


# Now plot
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=x, y=y,
    mode='lines+markers+text',
    text=labels,
    textposition="middle center",
    line=dict(width=2),
))

fig.update_layout(
    xaxis_title="Time to first token (s)",
    yaxis_title="Throughput (tokens/s)",
    plot_bgcolor="rgba(240, 248, 255, 1)",
    font=dict(size=14),
    width=800,
    height=400,
)

fig.show()