# Run inference on Mistral 7B using NVIDIA TensorRT-LLM

Welcome!

In this notebook, we will walk through on converting Mistral into the TensorRT format. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM was recently featured in the Phind-70B release as their preferred framework for performing inference! 

See the [Github repo](https://github.com/NVIDIA/TensorRT-LLM) for more examples and documentation!

A note about running Jupyter Notebooks: Press Shift + Enter to run a cell. A * in the left-hand cell box means the cell is running. A number means it has completed. If your Notebook is acting weird, you can interrupt a too-long process by interrupting the kernel (Kernel tab -> Interrupt Kernel) or even restarting the kernel (Kernel tab -> Restart Kernel). Note restarting the kernel will require you to run everything from the beginning.

Deployment powered by [Brev.dev](https://x.com/brevdev) 🤙


#### Step 1 - Install TensorRT-LLM

We first install TensorRT-LLM and some additional packages that are using during the conversion process

In [1]:
!pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
!pip uninstall -y mpmath
!pip install mpmath==1.3.0
!pip install ipywidgets

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting tensorrt_llm
  Downloading https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.9.0.dev2024022000-cp310-cp310-linux_x86_64.whl (1229.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 GB[0m [31m933.9 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting accelerate==0.25.0 (from tensorrt_llm)
  Downloading accelerate-0.25.0-py3-none-any.whl.metadata (18 kB)
Collecting build (from tensorrt_llm)
  Downloading build-1.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting colored (from tensorrt_llm)
  Downloading colored-2.2.4-py3-none-any.whl.metadata (3.6 kB)
Collecting cuda-python (from tensorrt_llm)
  Downloading cuda_python-12.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting diffusers==0.15.0 (from tensorrt_llm)
  Downloading diffusers-0.15.0-py3-none-any.whl.metadata (19 kB)
Collecting lark (from tensorrt_llm)
  Downloading lark-1

#### Step 2 - Convert Mistral to the TensorRT format

In [2]:
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py -P .
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/run.py -P .
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/utils.py -P .

--2024-02-27 07:33:47--  https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63559 (62K) [text/plain]
Saving to: ‘./convert_checkpoint.py’


2024-02-27 07:33:47 (6.98 MB/s) - ‘./convert_checkpoint.py’ saved [63559/63559]

--2024-02-27 07:33:47--  https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/run.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21871 (21K) [text/plain]
Saving to: ‘./run.py’


2024-02-27 07:33:47 (12.2

In [3]:
!python convert_checkpoint.py --model_dir mistralai/Mistral-7B-v0.1 --output_dir ./tllm_checkpoint_1gpu_mistral --dtype float16

[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022000
0.9.0.dev2024022000
config.json: 100%|█████████████████████████████| 571/571 [00:00<00:00, 4.44MB/s]
model.safetensors.index.json: 100%|████████| 25.1k/25.1k [00:00<00:00, 81.3MB/s]
Downloading shards:   0%|                                 | 0/2 [00:00<?, ?it/s]
model-00001-of-00002.safetensors:   0%|             | 0.00/9.94G [00:00<?, ?B/s][A
model-00001-of-00002.safetensors:   0%|     | 21.0M/9.94G [00:00<01:09, 142MB/s][A
model-00001-of-00002.safetensors:   0%|     | 41.9M/9.94G [00:00<00:59, 168MB/s][A
model-00001-of-00002.safetensors:   1%|     | 62.9M/9.94G [00:00<00:53, 184MB/s][A
model-00001-of-00002.safetensors:   1%|     | 83.9M/9.94G [00:00<00:51, 192MB/s][A
model-00001-of-00002.safetensors:   1%|      | 115M/9.94G [00:00<00:48, 201MB/s][A
model-00001-of-00002.safetensors:   1%|      | 147M/9.94G [00:00<00:47, 205MB/s][A
model-00001-of-00002.safetensors:   2%|      | 178M/9.94G [00:00<00:46, 208MB/s][A
model-00

In [4]:
!mkdir -p mistral_engine
!trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_mistral --output_dir ./mistral_engine --gemm_plugin float16 --max_input_len 32256

[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022000
[02/27/2024-07:38:29] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set gemm_plugin to float16.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set lookup_plugin to None.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set lora_plugin to None.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set moe_plugin to float16.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set context_fmha to True.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set paged_kv_cache to True.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set remove_input_padding to True.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set multi_block_mode to False.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set enable_xqa to True.
[02/27/2024-07:38:29] [TRT-LLM] [I] Set attention_qk_half_accumulation to False

In [7]:
!python3 run.py --max_output_len=50 --tokenizer_dir mistralai/Mistral-7B-v0.1 --engine_dir=./mistral_engine --max_attention_window_size=4096 --input_text "Swap memory is"

[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022000
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024022000 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Loaded engine size: 13815 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13966, GPU 14090 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 13967, GPU 14100 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +13812, now: CPU 0, GPU 13812 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 14004, GPU 17044 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 14004, GPU 17052 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13812 (MiB)
[TensorRT-LLM][INFO] Allocat