# Deploy Llama3 with TensorRT-LLM

Welcome!

In this notebook, we will walk through on converting Mistral into the TensorRT format. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

Once the TensorRT engine is build, you can use the run.py script provided at the end of this notebook or use this engine as in input to the Triton Inference Server.

See the [Github repo](https://github.com/NVIDIA/TensorRT-LLM) for more examples and documentation!

Deployment powered by [Brev.dev](https://twitter.com/brevdev) and the link for the [notebook](https://console.brev.dev/notebook/llama3-tensorrtllm-deployment).

## Step 1 - Install TensorRT-LLM

We first install TensorRT-LLM, which is already installed in the `nlp-1.3` and `Llama3` Jupyter kernels. You can choose either of these to run the rest of the notebook.

In [None]:
#!pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

## Step 2 - Download Llama3 model weights

Llama3 is a gated model which means you'll need to request approval on their respository and generate a HF token. This usually takes about 20 minutes! The good news is that we have already downloaded the Llama3 model, which is located at `/data/ai/models/nlp/llama/models_llama3`.

In [None]:
# import huggingface_hub

In [None]:
# huggingface_hub.login("<ENTER TOKEN HERE>")

In [None]:
# huggingface_hub.snapshot_download("meta-llama/Meta-Llama-3-8B-Instruct", local_dir="llama3-hf")

## Step 3 - Convert checkpoints into safetensors and build the TRT engine

There are 2 substeps here. The first is converting the raw huggingface model into safetensors which is a safe and fast format for storing tensors.

Next we build the TensorRT engine. This is where the magic happens. We take the converted safetensors model and convert it into a `TensorRT engine`. Engines are optimized versions of models built to run lightening fast on the current machine.

In [None]:
!wget -L https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py

In [None]:
!python convert_checkpoint.py --model_dir /data/ai/models/nlp/llama/models_llama3/Meta-Llama-3-8B-hf \
    --output_dir ./llama3-safetensors \
    --dtype bfloat16

In [None]:
!trtllm-build --checkpoint_dir llama3-safetensors \
    --output_dir ./llama3engine_bf16_1gpu \
    --gpt_attention_plugin bfloat16 \
    --gemm_plugin bfloat16

## Step 4 - Run the model using the example script!

In [None]:
!git clone https://github.com/NVIDIA/TensorRT-LLM.git

In [None]:
!python ./TensorRT-LLM/examples/run.py --engine_dir=llama3engine_bf16_1gpu \
    --max_output_len 100 \
    --tokenizer_dir llama3-hf \
    --input_text "How do I count to nine in French?"