<center>
  <a href="https://escience.sdu.dk/index.php/ucloud/">
    <img src="https://escience.sdu.dk/wp-content/uploads/2020/03/logo_esc.svg" width="400" height="186" />
  </a>
</center>
<br>
<p style="font-size: 1.2em;">
  This notebook was tested using <strong>Triton Inference Server (TRT-LLM) v24.08</strong> and machine type <code>u3-gpu4</code> on UCloud.
</p>


## Hugging Face Authentication

The following code creates a secure input widget for your Hugging Face token, which is required to authenticate and download the [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) model from the Hugging Face Hub.

In [None]:
from IPython.display import display
from ipywidgets import Password
from huggingface_hub import snapshot_download

pwd = Password(description="Hugging Face Token:")
display(pwd)

In [None]:
token = pwd.value
hf_model="meta-llama/Llama-3.3-70B-Instruct"
hf_model_path="models/llama-3.3/70B/hf"

snapshot_download(
    repo_id=hf_model,
    local_dir=hf_model_path,
    token=token
)

## Convert the model to TensorRT Format

The following Bash script sets up the required directories and executes the conversion of the Llama-3.3-70B-Instruct model checkpoint from Hugging Face format to TensorRT for optimized performance.

In [None]:
%%bash

HF_MODEL="models/llama-3.3/70B/hf"

# Modify rope_scaling properties
[ ! -f "$HF_MODEL/config.json.bak" ] && cp "$HF_MODEL/config.json" "$HF_MODEL/config.json.bak"
jq '.rope_scaling = {"factor": 8.000000001, "type": "linear"}' "$HF_MODEL/config.json" > /tmp/config.tmp && mv /tmp/config.tmp "$HF_MODEL/config.json"

du -sh "$HF_MODEL"

In [None]:
%%bash

HF_MODEL="models/llama-3.3/70B/hf"
TRT_CKPT="models/llama-3.3/70B/trt_ckpt/tp2_pp2"
mkdir -p "$TRT_CKPT"

python ~/llama/convert_checkpoint.py \
      --model_dir "$HF_MODEL" \
      --output_dir "$TRT_CKPT" \
      --dtype bfloat16 \
      --tp_size 2 \
      --pp_size 2 \
      --load_by_shard \
      --workers 20

## Build TensorRT Engine

The following Bash script constructs the TensorRT engine from the previously converted Llama-3.3-70B-Instruct model checkpoint. This optimization enhances the model's inference performance by leveraging TensorRT's efficient execution capabilities.

In [None]:
%%bash

TRT_CKPT="models/llama-3.3/70B/trt_ckpt/tp2_pp2"
TRT_ENGINE="models/llama-3.3/70B/trt_llm/tp2_pp2"

trtllm-build --checkpoint_dir "$TRT_CKPT" \
      --output_dir "$TRT_ENGINE" \
      --max_num_tokens 4096 \
      --max_input_len 255000 \
      --max_seq_len 256000 \
      --use_paged_context_fmha enable \
      --workers 4

## Local Testing of TensorRT-Optimized model

The following Bash script performs a local test of the optimized Llama-3.3-70B-Instruct model. It sets the necessary environment variables and runs the `run.py` script with a sample prompt to evaluate the model's inference performance.

If you get an error in the cell below, update the Transformer library:
```bash
pip install -U Transformers
```

In [None]:
%%bash

HF_MODEL="models/llama-3.3/70B/hf"
TRT_ENGINE="models/llama-3.3/70B/trt_llm/tp2_pp2"

PROMPT="The capital of Indonesia is"

mpirun -n 4 python ~/run.py \
    --max_output_len=10 \
    --tokenizer_dir $HF_MODEL \
    --engine_dir $TRT_ENGINE \
    --input_text "$PROMPT"

## Deploying Triton with Inflight Batching

The following Bash scripts set up and configure Triton Inference Server for the Llama-3.3-70B-Instruct model using inflight batching. This deployment optimizes inference performance by managing batch sizes and instance counts effectively.

In [None]:
%%bash

TRTITON_REPO="models/llama-3.3/70B/triton"
mkdir -p "$TRTITON_REPO"

cp -r ~/all_models/inflight_batcher_llm/* "$TRTITON_REPO"

ls "$TRTITON_REPO"

In [None]:
%%bash

ENGINE_DIR="models/llama-3.3/70B/trt_llm/tp2_pp2"
TOKENIZER_DIR="models/llama-3.3/70B/hf"
MODEL_FOLDER="models/llama-3.3/70B/triton"
TRITON_MAX_BATCH_SIZE=4
INSTANCE_COUNT=4
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_TEMPLATE_SCRIPT="$HOME/tools/fill_template.py"
DECOUPLED_MODE=false

python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}

In [None]:
%%bash

# Start the Triton server

MODEL_FOLDER="models/llama-3.3/70B/triton"
stop_tritonserver
nohup mpirun -np 4 tritonserver --model-repository=$MODEL_FOLDER &> /work/triton-server-log.txt &

## Testing the Triton Inference Server

The following Bash commands verify that the Triton server and the deployed Llama-3.3-70B-Instruct model are running correctly. The first command checks the repository status, and the second sends a sample generation request to the mode

In [None]:
%%bash

LOG_FILE="/work/triton-server-log.txt"

# Function to wait for Triton to start by monitoring the log file
wait_for_triton_start() {
    echo "Waiting for Triton Inference Server to start..."
    while true; do
        # Check for all required startup messages
        if grep -q 'Started GRPCInferenceService at 0.0.0.0:8001' "$LOG_FILE" &&
           grep -q 'Started HTTPService at 0.0.0.0:8000' "$LOG_FILE" &&
           grep -q 'Started Metrics Service at 0.0.0.0:8002' "$LOG_FILE"; then
                echo "Triton Inference Server is ready."
                break
                
        else
            echo "Triton not ready yet. Retrying in 5 seconds..."
            sleep 5   
        fi
    done
}

# Wait for Triton to start
wait_for_triton_start

curl -X POST http://localhost:8000/v2/repository/index -H "Content-Type: application/json" -d '{"ready": true}'| jq '.[] | select(.state == "READY")'

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": ""}' | jq -r '.text_output'

## Performance Profiling with `genai-perf`

To evaluate the performance of the deployed Llama-3.3-70B-Instruct model on Triton Inference Server, execute the following Bash commands in a terminal session within Jupyter. This script uses [GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html) to profile the model, generating performance metrics and visualizations.

In [None]:
%%bash

TOKENIZER_DIR="models/llama-3.3/70B/hf"

genai-perf profile -m ensemble \
    --service-kind triton \
    --backend tensorrtllm \
    --num-prompts 100 \
    --random-seed 1234 \
    --synthetic-input-tokens-mean 200 \
    --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean 100 \
    --output-tokens-stddev 0 \
    --output-tokens-mean-deterministic \
    --tokenizer $TOKENIZER_DIR \
    --concurrency 500 \
    --measurement-interval 4000 \
    --profile-export-file model_profile.json \
    --url localhost:8001 \
    --generate-plots
