<img src="./images/DLI_Header.png" style="width: 400px;">

# 3. Optimizing inference with NVIDIA TensorRT-LLM library 

In this lab, we are going to look at the NVIDIA TensorRT-LLM library and how it optimizes execution of large language models. We will use it to deploy Llama 13B initially using just a single GPU but afterwards taking advantage of its Tensor and Pipeline parallelism capabilities on multiple GPUs.  

We will conclude this notebook by comparing the latency between our baseline implementation using the Transformers library and the TensorRT-LLM Tensor and Pipeline parallel deployments. In the next notebook, we will look at how to serve our TensorRT-LLM optimized model to customers/users using Triton Inference Server. 

To summarize, in this notebook we will: 
* Review the features of NVIDIA TensorRT-LLM library. 
* Learn how to build the development environment including building TensorRT-LLM library. 
* Learn how to prepare a checkpoint of LLama2 (or other Transformers based model) for inference with TensorRT LLM. 
* Run inference of the model on a single GPU. 
* Extend the execution to multiple GPUs using Tensor Parallelism. 
* Profile the single and multi-GPU pipelines to capture information about throughput and latency. 

**[3.1 NVIDIA TensorRT-LLM](#3.1)<br>** 
**[3.2 Overall Inference Pipeline with NVIDIA TensorRT-LLM](#3.2)<br>** 
**[3.3 Download and Install NVIDIA TensorRT-LLM library](#3.3)<br>** 
**[3.4 Download LLama2 weights](#3.4)<br>** 
**[3.5 Compile TensorRT-LLM engines](#3.5)<br>** 
**[3.6 Run TensorRT-LLM engines](#3.6)<br>** 
&nbsp;&nbsp;&nbsp;&nbsp;[3.6.1 Inference on 1 GPU ](#3.6.1)<br> 
&nbsp;&nbsp;&nbsp;&nbsp;[3.6.2 Inference on 2 GPUs ](#3.6.2)<br> 

# 3.1 NVIDIA TensorRT-LLM

## Introduction 


In 2020, OpenAI demonstrated that using a large language model trained in a self-supervised way on large volume of training data can significantly improve the capacity of GPT model ([refer to the paper for more details](https://arxiv.org/abs/2005.14165)). The largest GPT-3 variant, has 175 billion parameters, which consumes about 350 GBs, even when represented in half-precision. Therefore putting such a model on a single GPU is impossible, making multi-GPU or even multi-node deployment a necessity. To solve the challenges of latency and memory footprint, the FasterTransformer library provides high efficiency kernels, optimized for memory usage, and support for model parallelism.</br>
[NVIDIA’s TensorRT-LLM (FT)](https://github.com/NVIDIA/TensorRT-LLM) is an open-source library for optimal performance on the latest Large Language Models for inference on NVIDIA GPUs. It consists of the TensorRT deep learning compiler and includes optimized kernels, pre– and post-processing and multi-GPU/multi-node communication primitives steps – largely inspired from the former Faster Transformer library for groundbreaking performance on NVIDIA GPUs.
It enables you to experiment with new LLMs, with peak performance and quick customization capabilities, without requiring a deep knowledge of C++ or NVIDIA CUDA, as it is offering a convenient Python API.</br>
TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs. See for a list of supported [models](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/models).

It also comes with a wide range of other features including:</br> 
* Number of attention layers and caching methods:
    * Multi-head Attention(MHA)
    * Multi-query Attention (MQA)
    * Group-query Attention(GQA)
    * Paged KV Cache for the Attention
* Support for a range of data types and quantization methods:
    * INT4/INT8 Weight-Only Quantization (W4A16 & W8A16)
    * FP8
    * SmoothQuant
    * GPTQ, AWQ
* Advanced feature: 
    * In-flight Batching
    * Tensor Parallelism
    * Pipeline Parallelism
    * Greedy-search
    * Beam-search
    * RoPE

This section of the notebook discusses how TensorRT-LLM can be used for optimization of the LLama2 model. It explains the optimization workflow for both single and multi GPU deployments. 

### Tensor and Pipeline Parallelism 

Under the hood, TensorRT-LLM relies on MPI and NVIDIA NCCL to enable inter/intra node communication. Using this software stack, anyone can run huge Transformers in Tensor-Parallelism mode on multiple GPUs to reduce computational latency. At the same time, tensor parallelism and pipeline parallelism can be combined to execute large models with billions and trillions of parameters (which amount to terabytes of weights) in Multi-GPU and Multi-Node environments. 

We have discussed the techniques below in the lecture but let us revisit them before diving into the implementation detail: 
- Data Parallelism (DP) - is a technique used during the training process. Every GPU receives the same copy of the model but different data to process. The GPUs execute the forward pass in parallel and exchange the gradients during the backward pass, allowing all the devices to make a synchronized weights update based on the average of the accumulated gradients. 
- Tensor Parallelism (TP) - is a technique used both during training and inference. Instead of splitting the data across multiple GPUs, selected layers of the model are distributed. If using Tensor Parallelism across 8 GPUs each layer affected/its tensor is split into 8 segments, each processed on a separate GPU in parallel. The results are gathered at the end of the step. 
- Pipeline Parallelism (PP) - similarly, this is a technique used both in training and inference. Here, individual layers are not being split into pieces, instead they are sequentially distributed across multiple GPUs. E.g. if training a 10 layer deep neural network across 2 GPUs, the first five layers would be deployed on the first GPU and the rest on the second GPU. Each GPU is processing data sequentially and the second GPU needs to wait for results from the first GPU. 

The diagram below demonstrates the difference between Tensor and Pipeline parallelism. 

<div style="text-align:center"><img src="./images/image3.png" style="width: 1000px;"></div>


### Optimizations in TensorRT-LLM library 

  

TensorRT-LLM allows us to speed up the inference pipeline achieving lower latency and higher throughput compared to the common deep learning frameworks. Below are the key optimization techniques that allow TensorRT-LLM to achieve its performance: 
1. <b>Layer Fusion</b></br> 
During the model pre-processing stage, certain layers can be combined to form individual execution kernels. This allows for considerable reduction in GPU memory bandwidth increasing mathematical density of our model, thus accelerating computation at the inference stage. For example, all operations in the multi-head attention block can be combined into a single kernel. 
2. <b>Autoregressive models: Keys/Values caching. </b></br> 
In the generation phase, a common optimization is to provide the MultiHeadAttention kernel with a cache containing the values of the past K and V elements that have already been computed. That cache is known as the KV cache. The diagram below illustrates the process. TensorRT-LLM uses that technique to accelerate its generation phase. In TensorRT-LLM, there is one KV cache per Transformer layer, which means that there are as many KV caches as layers in a model. The current version of TensorRT-LLM supports two different types of KV caches: contiguous and paged KV caches.<br/> 
<div style="text-align:center"> 
<img src="./images/KV_caching v2.PNG" style="width: 50%;position:relative;"><br/> 
<em>Keys/Values caching</em> 
</div> 
<br/><br/> 
3. <b>Usage of MPI and NCCL to enable inter/intra node communication and support model parallelism. </b></br> 
TensorRT-LLM adds the support for systems with multiple GPUs and nodes. It is enabled using TensorRT plugins that wrap communication primitives from the NCCL library as well as a custom plugin that optimize the All-Reduce primitive in the presence of All-to-all connections between GPUs (through NVSwitch in DGX systems).</br> 
Tensor Parallelism usually leads to more balanced executions but requires more memory bandwidth between the GPUs. Pipeline Parallelism reduces the need for high-bandwidth communication but may incur load-balancing issues and may be less efficient in terms of GPU utilization.</br>
4. <b>Reduced precision inference</b></br> 
TensorRT-LLM has kernels that support inference using low-precision input data in fp32, fp16, bf6, fp8, int8 and int4. All these regimes allow acceleration due to the reduction in data transfer and required memory. Int8 and fp16 computations can be hardware accelerated using TensorCores (available on all GPU architectures starting from Volta) and fp8 using Transformer Engines (starting from Hopper)</br>
5. <b>Other optimizations include:</b></br> 
TensorRT-LLM supports in-flight batching of requests (also known as continuous batching or iteration-level batching) for higher serving throughput. 

# 3.2 Overall Inference Pipeline with NVIDIA TensorRT-LLM
The diagram listed below lists all the steps involved in using the TensorRT-LLM library to deploy large models to production. In the next section, we will go through them one at a time. 

<div style="text-align:center">
<img src="./images/TRTLLM_pipeline.png" style="width: 30%"/>
</div>

## 3.3 Download and install NVIDIA TensorRT-LLM library
Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM Backend and the Python Backend. This container should have everything needed to run a TensorRT-LLM model. You can find this container [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).

In this lab, we already git cloned the backend repository for you by using the commands: </br>
```
git clone -b release/0.6.1 https://github.com/triton-inference-server/TensorRT-LLM_backend.git
cd tensorrtllm_backend 
```
We then fetched the TensorRT-LLM library as a submodule:</br>
```
git submodule update --init --recursive
git lfs install
git lfs pull
```
And install properly the TensorRT-LLM library in the Triton container: </br>
```
pip install git+https://github.com/NVIDIA/TensorRT-LLM.git
mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
```

## 3.4 Download LLama Weights
We already downloaded the model weights from the [Meta website](https://llama.meta.com/). Please view the [license agreement](https://ai.meta.com/llama/license/). If you would like to use a Llama model outside of this course, please [register with Meta](https://llama.meta.com/llama-downloads).

Once downloaded, we converted the weights into Hugging Face format using the `src/transformers/models/llama/convert_llama_weights_to_hf.py` python script from Transformers Library. 

The <b>/dli/task/weights</b> folder's content should look similar to the following:
<div style="text-align:center">
<img src="./images/llama_weights_folder_new.png" style="width: 50%"/>
</div>



## 3.5 Build TensorRT-LLM engines
### Build on 1 GPU

In this section, we are going to build TensorRT-LLM engines from the Hugging Face Llama weights, first on 1 GPU, and then on 4 GPUs using model parallelism.   

In [1]:
%cd /dli/task/tensorrtllm_backend/tensorrt_llm/examples/llama
%pip install -r requirements.txt
%pip install --upgrade protobuf

/dli/task/tensorrtllm_backend/tensorrt_llm/examples/llama


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Collecting protobuf
  Downloading protobuf-5.26.1-cp37-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Downloading protobuf-5.26.1-cp37-abi3-manylinux2014_x86_64.whl (302 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.8/302.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.25.2
    Uninstalling protobuf-4.25.2:
      Successfully uninstalled protobuf-4.25.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of t

Define the output directory where to store the compiled engine: 

In [2]:
%cd /dli/task
trt_engine_1gpu="/dli/task/trt-engines/llama_13b/fp16/1-gpu"
!mkdir -p $trt_engine_1gpu

/dli/task


The Build command takes some parameters, each of them impacting performance of the engine at different level (Memory used for KV Caching, Batching policy, Quantization, ...) 

In [3]:
hf_weights_dir = "/dli/task/weights"
!python tensorrtllm_backend/tensorrt_llm/examples/llama/build.py  \
                --model_dir $hf_weights_dir \
                --dtype float16 \
                --use_gpt_attention_plugin float16  \
                --use_inflight_batching \
                --paged_kv_cache \
                --remove_input_padding \
                --use_gemm_plugin float16  \
                --output_dir $trt_engine_1gpu  \
                --max_input_len 2048 --max_output_len 512 \
                --use_rmsnorm_plugin float16  \
                --enable_context_fmha

[05/09/2024-13:30:51] [TRT-LLM] [I] Serially build TensorRT engines.
[05/09/2024-13:30:51] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 125, GPU 27412 (MiB)
[05/09/2024-13:30:53] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2234, GPU 27762 (MiB)
[05/09/2024-13:30:53] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[05/09/2024-13:30:57] [TRT-LLM] [I] Loading HF LLaMA ... from /dli/task/weights
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:00<00:00, 13.11it/s]
[05/09/2024-13:31:01] [TRT-LLM] [I] HF LLaMA loaded. Total time: 00:00:04
[05/09/2024-13:31:01] [TRT-LLM] [I] Loading weights from HF LLaMA...
[05/09/2024-13:31:03] [TRT-LLM] [I] Weights loaded. Total time: 00:00:01
[05/09/2024-13:31:04] [TRT-LLM] [I] Context FMHA Enabled
[05/09/2024-13:31:04] [TRT-LLM] [I] Remove Padding Enabled
[05/09/2024-13:31:04] [TRT-LLM] [I] Paged KV Cache Enabled
[05/09/2024-

In [4]:
# Check your output ! You should have .engine in the folder now
!ls $trt_engine_1gpu

config.json  llama_float16_tp1_rank0.engine  model.cache


### Build on 4 GPUs
Large Language Models can be huge and the GPU RAM can be a limitation.
Pipeline and Tensor Parallelism (PP and TP) are efficient ways to workaround the memory limitation on a single GPU as they split the model into parts at training and inference time and distribute them among multiple GPUs.</br>
Let's see it in action on Llama-13B.
The world size will represent the number of parts you have.
For example, using PP=2 and TP=2, the world size is equal to 4.

Prepare the output directory: 

In [5]:
trt_engine_4gpus= "/dli/task/trt-engines/llama_13b/fp16/4-gpus"
!mkdir -p $trt_engine_4gpus

Build your engine using Tp_size and PP_size flags
World_size must be equal to  Tp_size * PP_size

In [6]:
!python tensorrtllm_backend/tensorrt_llm/examples/llama/build.py \
    --model_dir $hf_weights_dir \
    --dtype float16 \
    --use_gpt_attention_plugin float16 \
    --use_gemm_plugin float16 \
    --use_rmsnorm_plugin float16 \
    --use_inflight_batching \
    --remove_input_padding \
    --enable_context_fmha \
    --paged_kv_cache \
    --max_input_len 2048 --max_output_len 512 \
    --output_dir $trt_engine_4gpus \
    --world_size 4 \
    --tp_size 2 \
    --pp_size 2

[05/09/2024-13:32:22] [TRT-LLM] [I] Serially build TensorRT engines.
[05/09/2024-13:32:22] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 125, GPU 27412 (MiB)
[05/09/2024-13:32:24] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2234, GPU 27762 (MiB)
[05/09/2024-13:32:24] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[05/09/2024-13:32:25] [TRT-LLM] [I] Loading HF LLaMA ... from /dli/task/weights
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:00<00:00, 13.24it/s]
[05/09/2024-13:32:26] [TRT-LLM] [I] HF LLaMA loaded. Total time: 00:00:00
[05/09/2024-13:32:26] [TRT-LLM] [I] Loading weights from HF LLaMA...
[05/09/2024-13:32:28] [TRT-LLM] [I] Weights loaded. Total time: 00:00:02
[05/09/2024-13:32:28] [TRT-LLM] [I] Context FMHA Enabled
[05/09/2024-13:32:28] [TRT-LLM] [I] Remove Padding Enabled
[05/09/2024-13:32:28] [TRT-LLM] [I] Paged KV Cache Enabled
[05/09/2024-

In [7]:
# Check your output ! You should have 4 engines in the folder, one for each rank
!ls $trt_engine_4gpus

config.json			    llama_float16_tp2_pp2_rank2.engine
llama_float16_tp2_pp2_rank0.engine  llama_float16_tp2_pp2_rank3.engine
llama_float16_tp2_pp2_rank1.engine  model.cache


## 3.6 Run the TensorRT-LLM engine

### Run on 1 GPU

Your engine are now ready to run using TensorRT-LLM library! 
The tokenizer files are stored next to the weights in the Llama folder. 

In [9]:
!python tensorrtllm_backend/tensorrt_llm/examples/llama/run.py \
    --engine_dir=$trt_engine_1gpu \
    --max_output_len 128 \
    --tokenizer_dir $hf_weights_dir \
    --input_text "How do I count in French ? 1 un"

Running the float16 engine ...
Input: "How do I count in French ? 1 un"
Output: "2 deux 3 trois 4 quatre 5 cinq 6 six 7 sept 8 huit 9 neuf 10 dix 11 onze 12 douze 13 treize 14 quatorze 15 quinze 16 seize 17 dix-sept 18 dix-huit 19 dix-neuf 20 vingt 21 vingt et un 22 vingt et deux 23 vingt et trois 24 vingt et quatre 25 vingt et cinq 26 vingt et"


### Run on 4 GPUs

Use mpirun command to launch the Run command on TensorRT-LLM for multi-GPUs execution 

In [10]:
!mpirun -n 4 --allow-run-as-root python tensorrtllm_backend/tensorrt_llm/examples/llama/run.py \
    --engine_dir=$trt_engine_4gpus\
    --max_output_len 128 \
    --tokenizer_dir $hf_weights_dir \
    --input_text "How do I count in French? 1 un "

Running the float16 engine ...
Input: "How do I count in French? 1 un "
Output: "2 deux 3 trois 4 quatre 5 cinq 6 six 7 sept 8 huit 9 neuf 10 dix 11 onze 12 douze 13 treize 14 quatorze 15 quinze 16 seize 17 dix-sept 18 dix-huit 19 dix-neuf 20 vingt 21 vingt et un 22 vingt et deux 23 vingt et trois 24 vingt et quatre 25 vingt et cinq 26 vingt et six"


## 3.7 Exercice - Build and Run on 2 GPus
Let's practice yourself, and try to create a TensorRT-LLM engine on 2 GPU using only Tensor Parallelism.</br>
1) Prepare your output directory </br>
2) Build the engine </br>
3) Run and test your engine </br>

Fill the <<<< FIXME >>>> in the cells below. If you are stuck, check the solutions clicking on the ... just under each cell

In [11]:
# 1) Prepare your output directory
trt_engine_2gpus= "/dli/task/trt-engines/llama_13b/fp16/2-gpus"
!mkdir -p $trt_engine_2gpus

In [None]:
# SOLUTION 
# 1) Prepare your output directory
trt_engine_2gpus= "/dli/task/trt-engines/llama_13b/fp16/2-gpus"
!mkdir -p $trt_engine_2gpus

In [12]:
# 2) Build the engine 
!python tensorrtllm_backend/tensorrt_llm/examples/llama/build.py \
    --model_dir $hf_weights_dir \
    --dtype float16 \
    --use_gpt_attention_plugin float16 \
    --use_gemm_plugin float16 \
    --use_rmsnorm_plugin float16 \
    --use_inflight_batching \
    --remove_input_padding \
    --enable_context_fmha \
    --paged_kv_cache \
    --max_input_len 2048 --max_output_len 512 \
    --output_dir $trt_engine_2gpus \
    --world_size 2 \
    --tp_size 2 \
    --pp_size 1

[05/09/2024-13:44:16] [TRT-LLM] [I] Serially build TensorRT engines.
[05/09/2024-13:44:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 125, GPU 27412 (MiB)
[05/09/2024-13:44:18] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2234, GPU 27762 (MiB)
[05/09/2024-13:44:18] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[05/09/2024-13:44:20] [TRT-LLM] [I] Loading HF LLaMA ... from /dli/task/weights
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:00<00:00, 13.26it/s]
[05/09/2024-13:44:24] [TRT-LLM] [I] HF LLaMA loaded. Total time: 00:00:03
[05/09/2024-13:44:24] [TRT-LLM] [I] Loading weights from HF LLaMA...
[05/09/2024-13:44:26] [TRT-LLM] [I] Weights loaded. Total time: 00:00:02
[05/09/2024-13:44:27] [TRT-LLM] [I] Context FMHA Enabled
[05/09/2024-13:44:27] [TRT-LLM] [I] Remove Padding Enabled
[05/09/2024-13:44:27] [TRT-LLM] [I] Paged KV Cache Enabled
[05/09/2024-

In [None]:
# SOLUTION 
# 2) Build the engine 
!python tensorrtllm_backend/tensorrt_llm/examples/llama/build.py \
    --model_dir $hf_weights_dir \
    --dtype float16 \
    --use_gpt_attention_plugin float16 \
    --use_gemm_plugin float16 \
    --use_rmsnorm_plugin float16 \
    --use_inflight_batching \
    --remove_input_padding \
    --enable_context_fmha \
    --paged_kv_cache \
    --max_input_len 2048 --max_output_len 512 \
    --output_dir $trt_engine_2gpus \
    --world_size 2 \
    --tp_size 2 \
    --pp_size 1

In [13]:
# 3) Run and test your engine </br>
!mpirun -n 2 --allow-run-as-root python tensorrtllm_backend/tensorrt_llm/examples/llama/run.py \
    --engine_dir=$trt_engine_2gpus\
    --max_output_len 128 \
    --tokenizer_dir $hf_weights_dir\
    --input_text "How do I count in German? 1 eins "

Running the float16 engine ...
Input: "How do I count in German? 1 eins "
Output: "2 zwei 3 drei 4 vier 5 fünf 6 sechs 7 sieben 8 acht 9 neun 10 zehn 11 elf 12 zwölf 13 dreizehn 14 vierzehn 15 fünfzehn 16 sechzehn 17 siebzehn 18 achtzehn 19 neunzehn 20 zwanzig 21 einundzwanzig 22 zweiundzwanzig 23 dreiundzwanzig 24 vierundzwanzig 25 fünfund"


In [None]:
# SOLUTION
# 3) Run and test your engine </br>
!mpirun -n 2 --allow-run-as-root python tensorrtllm_backend/tensorrt_llm/examples/llama/run.py \
    --engine_dir=$trt_engine_2gpus\
    --max_output_len 128 \
    --tokenizer_dir $hf_weights_dir \
    --input_text "How do I count in French? 1 un "

<h2 style="color:green;">Congratulations!</h2>

Please proceed on to [Inference of the LLama 13B model with Triton Inference server and TensorRT-LLM as a backend.](04_TRTLLMAndTritonRunRemoteInferenceOfTheLlama.ipynb)
