# 🤙 Tensorrt Llama3 on NVIDIA Brev

<div style="background: linear-gradient(90deg, #00ff87 0%, #60efff 100%); padding: 1px; border-radius: 8px; margin: 20px 0;">
    <div style="background: #0a0a0a; padding: 20px; border-radius: 7px;">
        <p style="color: #60efff; margin: 0;"><strong>⚡ Powered by Brev</strong> | Converted from <a href="https://github.com/unslothai/notebooks/blob/main/nb/tensorrt-llama3.ipynb" style="color: #00ff87;">Unsloth Notebook</a></p>
    </div>
</div>

## 📋 Configuration

<div style="text-align: left;">

| Parameter | Value |
|:----------|:------|
| **Model** | Tensorrt Llama3 |
| **Recommended GPU** | L4 |
| **Min VRAM** | 16 GB |
| **Batch Size** | 2 |
| **Categories** | fine-tuning |

</div>

## 🔧 Key Adaptations for Brev

- ✅ Replaced Colab-specific installation with conda-based Unsloth
- ✅ Converted magic commands to subprocess calls
- ✅ Removed Google Drive dependencies
- ✅ Updated paths from `/workspace/` to `/workspace/`
- ✅ Added `device_map="auto"` for multi-GPU support
- ✅ Optimized batch sizes for NVIDIA GPUs

## 📚 Resources

- [Unsloth Documentation](https://docs.unsloth.ai/)
- [Brev Documentation](https://docs.nvidia.com/brev)
- [Original Notebook](https://github.com/unslothai/notebooks/blob/main/nb/tensorrt-llama3.ipynb)


# Deploy Llama3 with TensorRT-LLM

Welcome!

In this notebook, we will walk through on converting Mistral into the TensorRT format. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

Once the TensorRT engine is build, you can use the run.py script provided at the end of this notebook or use this engine as in input to the Triton Inference Server. 

See the [Github repo](https://github.com/NVIDIA/TensorRT-LLM) for more examples and documentation!

### Step 1 - Install TensorRT-LLM

We first install TensorRT-LLM. 

In [None]:
import subprocess
import sys

subprocess.check_call([sys.executable, "-m", "pip", "install", 'tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com'])

### Step 2 - Download Llama3 model weights

Llama3 is a gated model which means you'll need to request approval on their respository and generate a HF token. This usually takes about 20 minutes!

In [None]:
import huggingface_hub

In [None]:
huggingface_hub.login("<ENTER TOKEN HERE>")

In [None]:
huggingface_hub.snapshot_download("meta-llama/Meta-Llama-3-8B-Instruct", local_dir="llama3-hf")

### Step 3 - Convert checkpoints into safetensors and build the TRT engine

There are 2 substeps here. The first is converting the raw huggingface model into safetensors which is a safe and fast format for storing tensors. 

Next we build the TensorRT engine. This is where the magic happens. We take the converted safetensors model and convert it into a `TensorRT engine`. Engines are optimized versions of models built to run lightening fast on the current machine.  

In [None]:
import subprocess
import sys

subprocess.run(['wget -L https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py'], check=True, shell=True)

In [None]:
import subprocess
import sys

subprocess.run(['python convert_checkpoint.py --model_dir llama3-hf --output_dir ./llama3-safetensors --dtype bfloat16'], check=True, shell=True)

In [None]:
import subprocess
import sys

subprocess.run(['trtllm-build --checkpoint_dir llama3-safetensors --output_dir ./llama3engine_bf16_1gpu --gpt_attention_plugin bfloat16 --gemm_plugin bfloat16'], check=True, shell=True)

### Step 4 - Run the model using the example script!

In [None]:
import subprocess
import sys

subprocess.run(['git clone https://github.com/NVIDIA/TensorRT-LLM.git'], check=True, shell=True)

In [None]:
import subprocess
import sys

subprocess.run(['python ./TensorRT-LLM/examples/run.py --engine_dir=llama3engine_bf16_1gpu --max_output_len 100 --tokenizer_dir llama3-hf --input_text "How do I count to nine in French?"'], check=True, shell=True)