<center>
  <a href="https://escience.sdu.dk/index.php/ucloud/">
    <img src="https://escience.sdu.dk/wp-content/uploads/2020/03/logo_esc.svg" width="400" height="186" />
  </a>
</center>
<br>
<p style="font-size: 1.2em;">
  This notebook was tested using <strong>Triton Inference Server (TRT-LLM) v24.08</strong> and machine type <code>u3-gpu4</code> on UCloud.
</p>


## Hugging Face Authentication

The following code creates a secure input widget for your Hugging Face token, which is required to authenticate and download the [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) model from the Hugging Face Hub.

In [1]:
from IPython.display import display
from ipywidgets import Password
from huggingface_hub import snapshot_download

pwd = Password(description="Hugging Face Token:")
display(pwd)

Password(description='Hugging Face Token:')

In [2]:
token = pwd.value
hf_model="meta-llama/Llama-3.3-70B-Instruct"
hf_model_path="models/llama-3.3/70B/hf"

snapshot_download(
    repo_id=hf_model,
    local_dir=hf_model_path,
    token=token
)

Fetching 53 files:   0%|          | 0/53 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

'/work/ucloud-workshop-11-12-2024/models/llama-3.3/70B/hf'

## Convert the model to TensorRT Format

The following Bash script sets up the required directories and executes the conversion of the Llama-3.3-70B-Instruct model checkpoint from Hugging Face format to TensorRT for optimized performance.

In [3]:
%%bash

HF_MODEL="models/llama-3.3/70B/hf"

# Modify rope_scaling properties
[ ! -f "$HF_MODEL/config.json.bak" ] && cp "$HF_MODEL/config.json" "$HF_MODEL/config.json.bak"
jq '.rope_scaling = {"factor": 8.000000001, "type": "linear"}' "$HF_MODEL/config.json" > /tmp/config.tmp && mv /tmp/config.tmp "$HF_MODEL/config.json"

du -sh "$HF_MODEL"

263G	models/llama-3.3/70B/hf


In [4]:
%%bash

HF_MODEL="models/llama-3.3/70B/hf"
TRT_CKPT="models/llama-3.3/70B/trt_ckpt/tp2_pp2"
mkdir -p "$TRT_CKPT"

python ~/llama/convert_checkpoint.py \
      --model_dir "$HF_MODEL" \
      --output_dir "$TRT_CKPT" \
      --dtype bfloat16 \
      --tp_size 2 \
      --pp_size 2 \
      --load_by_shard \
      --workers 20

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600
0.12.0.dev2024080600


0it [00:00, ?it/s]
0it [00:00, ?it/s][A

0it [00:00, ?it/s][A[A


0it [00:00, ?it/s][A[A[A


5it [00:00, 17.71it/s][A[A[A

5it [00:00, 14.28it/s][A[A


4it [00:00,  8.85it/s][A[A[A
4it [00:00,  8.85it/s][A
5it [00:00,  8.17it/s][A

8it [00:00,  8.95it/s][A[A


6it [00:01,  3.96it/s][A[A[A
13it [00:02,  5.23it/s][A
13it [00:02,  5.29it/s][A


12it [00:02,  3.31it/s][A[A[A
16it [00:02,  6.59it/s][A


13it [00:02,  3.53it/s][A[A[A

12it [00:02,  3.43it/s][A[A


15it [00:02,  4.22it/s][A[A[A

21it [00:03,  6.48it/s][A[A


19it [00:03,  5.60it/s][A[A[A

19it [00:03,  5.56it/s][A[A
20it [00:03,  5.65it/s][A
23it [00:03,  6.73it/s][A

22it [00:03,  6.88it/s][A[A


27it [00:03,  8.42it/s][A[A[A
27it [00:04,  8.19it/s][A

26it [00:04,  4.80it/s][A[A


34it [00:04,  7.62it/s][A[A[A
35it [00:05,  7.60it/s][A
36it [00:05,  8.38it/s][A


29it [00:05,  5.76it/s][A[A[A

29it [00:05,  5.79it/s][A[A


33it [00:05,  8.08it/s][A[A[A

33it [0

Total time of converting checkpoints: 00:02:53


## Build TensorRT Engine

The following Bash script constructs the TensorRT engine from the previously converted Llama-3.3-70B-Instruct model checkpoint. This optimization enhances the model's inference performance by leveraging TensorRT's efficient execution capabilities.

In [5]:
%%bash

TRT_CKPT="models/llama-3.3/70B/trt_ckpt/tp2_pp2"
TRT_ENGINE="models/llama-3.3/70B/trt_llm/tp2_pp2"

trtllm-build --checkpoint_dir "$TRT_CKPT" \
      --output_dir "$TRT_ENGINE" \
      --max_num_tokens 4096 \
      --max_input_len 255000 \
      --max_seq_len 256000 \
      --use_paged_context_fmha enable \
      --workers 4

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600
[12/17/2024-11:45:05] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set gemm_plugin to None.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set nccl_plugin to auto.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set lookup_plugin to None.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set lora_plugin to None.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set moe_plugin to auto.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set context_fmha to True.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set paged_kv_cache to True.
[12/17/2024-11:45:05] [TRT-LLM] [I] Set remove_input_padding to True.
[12/17/2024-

## Local Testing of TensorRT-Optimized model

The following Bash script performs a local test of the optimized Llama-3.3-70B-Instruct model. It sets the necessary environment variables and runs the `run.py` script with a sample prompt to evaluate the model's inference performance.

If you get an error in the cell below, update the Transformer library:
```bash
pip install -U Transformers
```

In [6]:
%%bash

HF_MODEL="models/llama-3.3/70B/hf"
TRT_ENGINE="models/llama-3.3/70B/trt_llm/tp2_pp2"

PROMPT="The capital of Indonesia is"

mpirun -n 4 python ~/run.py \
    --max_output_len=10 \
    --tokenizer_dir $HF_MODEL \
    --engine_dir $TRT_ENGINE \
    --input_text "$PROMPT"

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 1
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 4, MPI local size: 4, rank: 3
[Tenso

## Deploying Triton with Inflight Batching

The following Bash scripts set up and configure Triton Inference Server for the Llama-3.3-70B-Instruct model using inflight batching. This deployment optimizes inference performance by managing batch sizes and instance counts effectively.

In [7]:
%%bash

TRTITON_REPO="models/llama-3.3/70B/triton"
mkdir -p "$TRTITON_REPO"

cp -r ~/all_models/inflight_batcher_llm/* "$TRTITON_REPO"

ls "$TRTITON_REPO"

ensemble
postprocessing
preprocessing
tensorrt_llm
tensorrt_llm_bls


In [8]:
%%bash

ENGINE_DIR="models/llama-3.3/70B/trt_llm/tp2_pp2"
TOKENIZER_DIR="models/llama-3.3/70B/hf"
MODEL_FOLDER="models/llama-3.3/70B/triton"
TRITON_MAX_BATCH_SIZE=4
INSTANCE_COUNT=4
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_TEMPLATE_SCRIPT="$HOME/tools/fill_template.py"
DECOUPLED_MODE=false

python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}

In [9]:
%%bash

# Start the Triton server

MODEL_FOLDER="models/llama-3.3/70B/triton"
stop_tritonserver
nohup mpirun -np 4 tritonserver --model-repository=$MODEL_FOLDER &> /work/triton-server-log.txt &

Terminated


## Testing the Triton Inference Server

The following Bash commands verify that the Triton server and the deployed Llama-3.3-70B-Instruct model are running correctly. The first command checks the repository status, and the second sends a sample generation request to the mode

In [10]:
%%bash

LOG_FILE="/work/triton-server-log.txt"

# Function to wait for Triton to start by monitoring the log file
wait_for_triton_start() {
    echo "Waiting for Triton Inference Server to start..."
    while true; do
        # Check for all required startup messages
        if grep -q 'Started GRPCInferenceService at 0.0.0.0:8001' "$LOG_FILE" &&
           grep -q 'Started HTTPService at 0.0.0.0:8000' "$LOG_FILE" &&
           grep -q 'Started Metrics Service at 0.0.0.0:8002' "$LOG_FILE"; then
                echo "Triton Inference Server is ready."
                break
                
        else
            echo "Triton not ready yet. Retrying in 5 seconds..."
            sleep 5   
        fi
    done
}

# Wait for Triton to start
wait_for_triton_start

curl -X POST http://localhost:8000/v2/repository/index -H "Content-Type: application/json" -d '{"ready": true}'| jq '.[] | select(.state == "READY")'

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": ""}' | jq -r '.text_output'

Waiting for Triton Inference Server to start...
Triton not ready yet. Retrying in 5 seconds...
Triton not ready yet. Retrying in 5 seconds...
Triton not ready yet. Retrying in 5 seconds...
Triton not ready yet. Retrying in 5 seconds...
Triton not ready yet. Retrying in 5 seconds...
Triton not ready yet. Retrying in 5 seconds...
Triton not ready yet. Retrying in 5 seconds...
Triton Inference Server is ready.


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   289  100   274  100    15   269k  15120 --:--:-- --:--:-- --:--:--  282k


{
  "name": "ensemble",
  "version": "1",
  "state": "READY"
}
{
  "name": "postprocessing",
  "version": "1",
  "state": "READY"
}
{
  "name": "preprocessing",
  "version": "1",
  "state": "READY"
}
{
  "name": "tensorrt_llm",
  "version": "1",
  "state": "READY"
}
{
  "name": "tensorrt_llm_bls",
  "version": "1",
  "state": "READY"
}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   625    0   543  100    82    124     18  0:00:04  0:00:04 --:--:--   143


What is ML? Machine Learning Analytics?
Machine learning is a subset of Machine learning is a subset of AI that involves training algorithms, which enables systems using data analysis of data to learn from data and improve their performance on their performance on some task, can be able to improve


## Performance Profiling with `genai-perf`

To evaluate the performance of the deployed Llama-3.3-70B-Instruct model on Triton Inference Server, execute the following Bash commands in a terminal session within Jupyter. This script uses [GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html) to profile the model, generating performance metrics and visualizations.

In [13]:
%%bash

TOKENIZER_DIR="models/llama-3.3/70B/hf"

genai-perf profile -m ensemble \
    # Specifies the use of Triton Inference Server.
    --service-kind triton \
    # Utilizes the TensorRT backend optimized for large language models.
    --backend tensorrtllm \
    # Number of prompts to use for the profiling test.
    --num-prompts 100 \
    # Sets the seed for random number generation to ensure reproducibility.
    --random-seed 1234 \
    # Defines the mean for synthetic input token lengths.
    --synthetic-input-tokens-mean 200 \
    # Defines the standard deviation for synthetic input token lengths.
    --synthetic-input-tokens-stddev 0 \
    # Defines the mean for synthetic output token lengths.
    --output-tokens-mean 100 \
    # Defines the standard deviation for synthetic output token lengths.
    --output-tokens-stddev 0 \
    # Ensures deterministic output token lengths.
    --output-tokens-mean-deterministic \
    # Specifies the tokenizer configuration file.
    --tokenizer $TOKENIZER_DIR/tokenizer.json \
    # Number of concurrent requests to simulate during profiling.
    --concurrency 500 \
    # Time interval (in milliseconds) for measurements.
    --measurement-interval 4000 \
    # File to export the profiling results.
    --profile-export-file model_profile.json \
    # URL of the Triton Inference Server (gRPC API).
    --url localhost:8001 \
    # Generates visual plots of the profiling results.
    --generate-plots


2024-12-17 11:51 [INFO] genai_perf.parser:90 - Profiling these models: ensemble
2024-12-17 11:51 [INFO] genai_perf.wrapper:147 - Running Perf Analyzer : 'perf_analyzer -m ensemble --async --input-data artifacts/ensemble-triton-tensorrtllm-concurrency1/llm_inputs.json -i grpc --streaming -u localhost:8001 --shape max_tokens:1 --shape text_input:1 --concurrency-range 1 --service-kind triton --measurement-interval 10000 --stability-percentage 999 --profile-export-file artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export.json'




[3m                                  LLM Metrics                                   [0m
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃[1m [0m[1m             Statistic[0m[1m [0m┃[1m [0m[1m   avg[0m[1m [0m┃[1m [0m[1m   min[0m[1m [0m┃[1m [0m[1m   max[0m[1m [0m┃[1m [0m[1m   p99[0m[1m [0m┃[1m [0m[1m   p90[0m[1m [0m┃[1m [0m[1m   p75[0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│[36m [0m[36m  Request latency (ms)[0m[36m [0m│[32m [0m[32m12,03…[0m[32m [0m│[32m [0m[32m12,03…[0m[32m [0m│[32m [0m[32m12,04…[0m[32m [0m│[32m [0m[32m12,04…[0m[32m [0m│[32m [0m[32m12,04…[0m[32m [0m│[32m [0m[32m12,03…[0m[32m [0m│
│[36m [0m[36mOutput sequence length[0m[36m [0m│[32m [0m[32m806.00[0m[32m [0m│[32m [0m[32m806.00[0m[32m [0m│[32m [0m[32m806.00[0m[32m [0m│[32m [0m[32m806.00[0m[32m [0m│[32m [0m[32m806.00[0m[32m [

bash: line 6: --service-kind: command not found
bash: line 8: --backend: command not found
bash: line 10: --num-prompts: command not found
bash: line 12: --random-seed: command not found
bash: line 14: --synthetic-input-tokens-mean: command not found
bash: line 16: --synthetic-input-tokens-stddev: command not found
bash: line 18: --output-tokens-mean: command not found
bash: line 20: --output-tokens-stddev: command not found
bash: line 23: --output-tokens-mean-deterministic: command not found
bash: line 24: --tokenizer: command not found
bash: line 26: --concurrency: command not found
bash: line 28: --measurement-interval: command not found
bash: line 30: --profile-export-file: command not found
bash: line 32: --url: command not found
bash: line 34: --generate-plots: command not found


CalledProcessError: Command 'b'\nTOKENIZER_DIR="models/llama-3.3/70B/hf"\n\ngenai-perf profile -m ensemble \\\n    # Specifies the use of Triton Inference Server.\n    --service-kind triton \\\n    # Utilizes the TensorRT backend optimized for large language models.\n    --backend tensorrtllm \\\n    # Number of prompts to use for the profiling test.\n    --num-prompts 100 \\\n    # Sets the seed for random number generation to ensure reproducibility.\n    --random-seed 1234 \\\n    # Defines the mean for synthetic input token lengths.\n    --synthetic-input-tokens-mean 200 \\\n    # Defines the standard deviation for synthetic input token lengths.\n    --synthetic-input-tokens-stddev 0 \\\n    # Defines the mean for synthetic output token lengths.\n    --output-tokens-mean 100 \\\n    # Defines the standard deviation for synthetic output token lengths.\n    --output-tokens-stddev 0 \\\n    # Ensures deterministic output token lengths.\n    --output-tokens-mean-deterministic \\\n    # Specifies the tokenizer configuration file.\n    --tokenizer $TOKENIZER_DIR/tokenizer.json \\\n    # Number of concurrent requests to simulate during profiling.\n    --concurrency 500 \\\n    # Time interval (in milliseconds) for measurements.\n    --measurement-interval 4000 \\\n    # File to export the profiling results.\n    --profile-export-file model_profile.json \\\n    # URL of the Triton Inference Server.\n    --url localhost:8001 \\\n    # Generates visual plots of the profiling results.\n    --generate-plots\n'' returned non-zero exit status 127.