# Llama 8B Throughput: vLLM vs TensorRT-LLM

Benchmark a full-precision Llama Instruct 8B model on a Linux GPU box. The notebook builds a TensorRT-LLM engine from a HF checkpoint, runs inference with TensorRT-LLM, runs the same prompts with vLLM, and compares tokens/sec. Adjust model and batch sizes for your hardware.

## Prerequisites
- NVIDIA GPU with CUDA 12.x drivers and enough VRAM for an 8B model in FP16/BF16.
- Python 3.10+ in a writable env (e.g., `python -m venv .venv && source .venv/bin/activate`).
- Optional: export `HUGGINGFACE_HUB_TOKEN` if the model is gated.
- Make sure `nvidia-smi` shows your GPU and CUDA version ≥ 12.x.

In [None]:
# Install required packages (GPU drivers/CUDA 12.x assumed)
!pip install -q --upgrade \
  'vllm>=0.5.5' \
  'tensorrt-llm-cu12' \
  'transformers>=4.43' \
  'accelerate' \
  'datasets' \
  'polygraphy'


## Configure benchmark
Tweak these for your box. Increase `NUM_PROMPTS` or `MAX_NEW_TOKENS` to stress throughput, but watch VRAM. TensorRT-LLM engine is built into `./trtllm_engine`.

In [None]:
from pathlib import Path
import json

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
ENGINE_DIR = Path("trtllm_engine")
RESULTS_FILE = Path("throughput_results.json")

PROMPTS = [
    "Explain the benefits of quantization for LLM inference.",
    "Give three bullet points on how to optimize GPU memory usage.",
    "Write a short summary of the HTTP/2 protocol.",
    "Provide two examples of vector databases and their use cases.",
]
NUM_PROMPTS = len(PROMPTS)
MAX_NEW_TOKENS = 128
WARMUP_ROUNDS = 1
MEASURE_ROUNDS = 3
TP_SIZE = 1  # set >1 if you want tensor parallel across multiple GPUs

print(f"Model: {MODEL_ID}")
print(f"Prompts: {NUM_PROMPTS}, max_new_tokens={MAX_NEW_TOKENS}, warmup={WARMUP_ROUNDS}, measure={MEASURE_ROUNDS}, tp_size={TP_SIZE}")


## Build TensorRT-LLM engine
This uses `trtllm-build` to convert the HF checkpoint into a TensorRT engine directory. Adjust precision flags for your GPU.

- For BF16: `--enable_bf16` (Ampere+)
- For FP16: `--enable_fp16` (Turing+)

Run once; skip if you already have an engine in `ENGINE_DIR`.

In [None]:
ENGINE_DIR.mkdir(exist_ok=True)

# Example build command (BF16). Uncomment to run.
# !trtllm-build \
#   --checkpoint {MODEL_ID} \
#   --output_dir {ENGINE_DIR} \
#   --max_batch_size {NUM_PROMPTS} \
#   --max_input_len 2048 \
#   --max_output_len {MAX_NEW_TOKENS} \
#   --gemm_plugin auto \
#   --tp_size {TP_SIZE} \
#   --enable_bf16


## Benchmark with TensorRT-LLM
Uses the Python API to load the built engine and run generation. Tokens/sec counts both prompt and generated tokens.

In [None]:
import time
from statistics import mean
from tensorrt_llm import LLM, SamplingParams

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=MAX_NEW_TOKENS,
)

trt_llm = LLM(
    engine_dir=str(ENGINE_DIR),
    tokenizer=MODEL_ID,
    tensor_parallel_size=TP_SIZE,
)

# Warmup
for _ in range(WARMUP_ROUNDS):
    _ = trt_llm.generate(PROMPTS, sampling_params)

trt_tps_runs = []
for _ in range(MEASURE_ROUNDS):
    start = time.perf_counter()
    outputs = trt_llm.generate(PROMPTS, sampling_params)
    elapsed = time.perf_counter() - start
    total_tokens = sum(len(o.prompt_token_ids) + len(o.outputs[0].token_ids) for o in outputs)
    tps = total_tokens / elapsed
    trt_tps_runs.append(tps)
    print(f"TensorRT-LLM run tokens: {total_tokens}, time: {elapsed:.3f}s, tokens/sec: {tps:.2f}")

trt_tokens_per_sec = mean(trt_tps_runs)
print(f"TensorRT-LLM avg tokens/sec: {trt_tokens_per_sec:.2f}")


## Benchmark with vLLM
Runs the same prompts through vLLM’s Python API for a fair comparison.

In [None]:
from vllm import LLM as VLLM, SamplingParams as VLLMSamplingParams

vllm_sampling = VLLMSamplingParams(
    temperature=0.0,
    max_tokens=MAX_NEW_TOKENS,
)

vllm = VLLM(
    model=MODEL_ID,
    tensor_parallel_size=TP_SIZE,
)

# Warmup
for _ in range(WARMUP_ROUNDS):
    _ = vllm.generate(PROMPTS, vllm_sampling)

vllm_tps_runs = []
for _ in range(MEASURE_ROUNDS):
    start = time.perf_counter()
    outputs = vllm.generate(PROMPTS, vllm_sampling)
    elapsed = time.perf_counter() - start
    total_tokens = sum(len(o.prompt_token_ids) + len(o.outputs[0].token_ids) for o in outputs)
    tps = total_tokens / elapsed
    vllm_tps_runs.append(tps)
    print(f"vLLM run tokens: {total_tokens}, time: {elapsed:.3f}s, tokens/sec: {tps:.2f}")

vllm_tokens_per_sec = mean(vllm_tps_runs)
print(f"vLLM avg tokens/sec: {vllm_tokens_per_sec:.2f}")


## Compare and save results

In [None]:
results = {
    "model": MODEL_ID,
    "max_new_tokens": MAX_NEW_TOKENS,
    "num_prompts": NUM_PROMPTS,
    "warmup_rounds": WARMUP_ROUNDS,
    "measure_rounds": MEASURE_ROUNDS,
    "tp_size": TP_SIZE,
    "tensorrt_llm_tokens_per_sec": trt_tokens_per_sec,
    "vllm_tokens_per_sec": vllm_tokens_per_sec,
}

print(json.dumps(results, indent=2))
RESULTS_FILE.write_text(json.dumps(results, indent=2))
print(f"Saved to {RESULTS_FILE.resolve()}")


### Notes
- Keep prompts identical between runs for fairness.
- If the TensorRT-LLM build fails, check CUDA version and GPU compatibility; switch to `--enable_fp16` if BF16 is unavailable.
- If out-of-memory occurs, reduce `MAX_NEW_TOKENS` or batch size, or use tensor parallelism across GPUs.
- For larger models or stricter performance, tune TensorRT-LLM build flags (e.g., KV cache size, pipeline parallelism).