# Deploy Llama3 with TensorRT-LLM

Welcome!

In this notebook, we will walk through on converting Mistral into the TensorRT format. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

Once the TensorRT engine is build, you can use the run.py script provided at the end of this notebook or use this engine as in input to the Triton Inference Server.

See the [Github repo](https://github.com/NVIDIA/TensorRT-LLM) for more examples and documentation!

Deployment powered by [Brev.dev](https://twitter.com/brevdev) and the link for the [notebook](https://console.brev.dev/notebook/llama3-tensorrtllm-deployment).

## Step 1 - Install TensorRT-LLM

We first install TensorRT-LLM, which is already installed in the `nlp-1.3` and `Llama3` Jupyter kernels. You can choose either of these to run the rest of the notebook.

In [None]:
#!pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

## Step 2 - Download Llama3 model weights

Llama3 is a gated model which means you'll need to request approval on their respository and generate a HF token. This usually takes about 20 minutes! The good news is that we have already downloaded the Llama3 model, which is located at `/data/ai/models/nlp/llama/models_llama3`.

In [None]:
# import huggingface_hub

In [None]:
# huggingface_hub.login("<ENTER TOKEN HERE>")

In [None]:
# huggingface_hub.snapshot_download("meta-llama/Meta-Llama-3-8B-Instruct", local_dir="llama3-hf")

## Step 3 - Convert checkpoints into safetensors and build the TRT engine

There are 2 substeps here. The first is converting the raw huggingface model into safetensors which is a safe and fast format for storing tensors.

Next we build the TensorRT engine. This is where the magic happens. We take the converted safetensors model and convert it into a `TensorRT engine`. Engines are optimized versions of models built to run lightening fast on the current machine.

In [1]:
!wget -L https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py

--2024-05-10 14:23:48--  https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17658 (17K) [text/plain]
Saving to: ‘convert_checkpoint.py’


2024-05-10 14:23:48 (43.5 MB/s) - ‘convert_checkpoint.py’ saved [17658/17658]



In [9]:
!python convert_checkpoint.py --model_dir /data/ai/models/nlp/llama/models_llama3/Meta-Llama-3-8B-Instruct-hf \
    --output_dir ./llama3-safetensors \
    --dtype bfloat16

--------------------------------------------------------------------------

  Local host:   c0904a-s17
  Local device: mlx5_1
--------------------------------------------------------------------------
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
0.10.0.dev2024050700
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:14<00:00,  3.61s/it]
Weights loaded. Total time: 00:00:10
Total time of converting checkpoints: 00:01:02


In [10]:
!trtllm-build --checkpoint_dir llama3-safetensors \
    --output_dir ./llama3engine_bf16_1gpu \
    --gpt_attention_plugin bfloat16 \
    --gemm_plugin bfloat16

--------------------------------------------------------------------------

  Local host:   c0904a-s17
  Local device: mlx5_1
--------------------------------------------------------------------------
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
[05/10/2024-16:05:37] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/10/2024-16:05:37] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[05/10/2024-16:05:37] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[05/10/2024-16:05:37] [TRT-LLM] [I] Set nccl_plugin to float16.
[05/10/2024-16:05:37] [TRT-LLM] [I] Set lookup_plugin to None.
[05/10/2024-16:05:37] [TRT-LLM] [I] Set lora_plugin to None.
[05/10/2024-16:05:37] [TRT-LLM] [I] Set moe_plugin to float16.
[05/10/2024-16:05:37] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16.
[05/10/2024-16:05:37] [TRT-LLM] [I] Set context_fmha to True.
[05/10/2024-16:05:37] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/10/2024-16:05:37] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/10/

## Step 4 - Run the model using the example script!

In [5]:
# !git clone https://github.com/NVIDIA/TensorRT-LLM.git

Cloning into 'TensorRT-LLM'...
remote: Enumerating objects: 14993, done.[K
remote: Counting objects: 100% (7097/7097), done.[K
remote: Compressing objects: 100% (1958/1958), done.[K
remote: Total 14993 (delta 5436), reused 6102 (delta 5086), pack-reused 7896[K
Receiving objects: 100% (14993/14993), 204.58 MiB | 40.19 MiB/s, done.
Resolving deltas: 100% (10606/10606), done.
Updating files: 100% (2216/2216), done.
Filtering content: 100% (14/14), 204.50 MiB | 129.91 MiB/s, done.


In [11]:
!python ./TensorRT-LLM/examples/run.py --engine_dir=llama3engine_bf16_1gpu \
    --max_output_len 100 \
    --tokenizer_dir /data/ai/models/nlp/llama/models_llama3/Meta-Llama-3-8B-Instruct-hf \
    --input_text "How do I count to nine in French?"

--------------------------------------------------------------------------

  Local host:   c0904a-s17
  Local device: mlx5_1
--------------------------------------------------------------------------
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024050700
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][INFO] Engine version 0.10.0.dev2024050700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Loaded engine size: 15323 MiB
[TensorRT-LLM][INFO] Allocated 192.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15320 (MiB)
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 1
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 467840. Allocating 61320724480 bytes.
Input [Text 0]: "How do I count to nine in F