## Using TensorRT-LLM to Run Sovereign Models

### Initial Setup

Please make sure that you complete the following steps before launching this notebook on a Linux machine. These steps walk through running the required docker container and installing the libraries required for TensorRT-LLM. These steps are also highlight in the installation guide

docker run --rm -it --ipc=host --net=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all --volume ${PWD}:/workspace --workdir /workspace nvidia/cuda:12.4.1-devel-ubuntu22.04

apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs

pip install jupyterlab

### Install TensorRT-LLM

In [None]:
!pip install -q ipywidgets
!pip install tensorrt_llm -U -q --extra-index-url https://pypi.nvidia.com

In [None]:
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/run.py -P .
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/utils.py -P .
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py -P .

### Swallow Model

In [None]:
!git lfs install

In [None]:
# Clone HF model repo

!git clone https://huggingface.co/tokyotech-llm/Swallow-70b-instruct-v0.1

In [7]:
# Convert HF checkpoint to TRT-LLM format using an 8-GPU single node.

!python3 convert_checkpoint.py --model_dir ./Swallow-70b-instruct-v0.1 \
                         --output_dir ./tllm_checkpoint_8gpu_tp8_swallow \
                         --dtype float16 --tp_size 8

[TensorRT-LLM] TensorRT-LLM version: 0.11.0
0.11.0
Total time of converting checkpoints: 00:05:04


In [None]:
#Build TRT engine using an 8-GPU single node.

!trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp8_swallow \
             --output_dir ./swallow/70B/trt_engines/fp16/8-gpu/ \
             --gemm_plugin auto \

In [None]:
# Run inference

!python3 run.py --engine_dir ./swallow/70B/trt_engines/fp16/8-gpu \
                 --max_output_len 500 \
                 --tokenizer_dir ./Swallow-70b-instruct-v0.1 \
                 --input_text "your prompt here" \
                 --use_py_session

### Tame Model

In [None]:
# Clone HF model repo

!git clone https://huggingface.co/yentinglin/Llama-3-Taiwan-70B-Instruct

In [None]:
# Convert HF checkpoint to TRT-LLM format using an 8-GPU single node.

!python3 convert_checkpoint.py --model_dir ./Llama-3-Taiwan-70B-Instruct \
                         --output_dir ./tllm_checkpoint_8gpu_tp8_tame \
                         --dtype float16 --tp_size 8

In [None]:
#Build TRT engine using an 8-GPU single node.

!trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp8_tame \
             --output_dir ./tame/70B/trt_engines/fp16/8-gpu/ \
             --gemm_plugin auto \

In [None]:
# Run inference

!python3 run.py --engine_dir ./tame/70B/trt_engines/fp16/8-gpu \
                 --max_output_len 500 \
                 --tokenizer_dir ./Llama-3-Taiwan-70B-Instruct \
                 --input_text "your prompt here" \
                 --use_py_session