# Running Nemotron-49B-v1.5 with TensorRT-LLM on NVIDIA GPUs

This notebook provides a comprehensive guide on how to run the `Nemotron-49B-v1.5` model using NVIDIA's TensorRT-LLM for high-performance inference.

This notebook is divided into two parts:
- **Part 1:** Demonstrates how to use the direct Python API for inference, including batch generation.
- **Part 2:** Covers how to deploy the model with an OpenAI-compatible web server and interact with it using an OpenAI client.

#### Launch on NVIDIA Brev
You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide.

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-32vt7HcQjCUpafGyquLZwJdIm8F)

- Model card: [nvidia/Llama-3.3-Nemotron-Super-49B-v1.5](https://huggingface.co/nvidia/Llama-3.3-Nemotron-Super-49B-v1.5)
- TensorRT-LLM Docs: [https://nvidia.github.io/TensorRT-LLM/](https://nvidia.github.io/TensorRT-LLM/)


## Table of Contents

- [Part 1: Inference with the Python API](#Part-1:-Inference-with-the-Python-API)
  - [Prerequisites](#Prerequisites)
  - [Setup](#Setup)
  - [Inference with Python API](#Inference-with-Python-API)
  - [Batch Generation](#Batch-Generation)
- [Part 2: OpenAI-Compatible Server](#Part-2:-OpenAI-Compatible-Server)
  - [Client Setup and Examples](#Client-Setup-and-Examples)
  - [Reasoning Modes (`think` vs. `no_think`)](#Reasoning-Modes-(`think`-vs.-`no_think`))
  - [Interaction with `curl`](#Interaction-with-`curl`)
  - [Cleanup](#Cleanup)
- [Resource Notes](#Resource-Notes)
- [Conclusion](#Conclusion)


## Part 1: Inference with the Python API

### Prerequisites

**Hardware:** This notebook requires a machine with at least **2 NVIDIA GPUs** with sufficient VRAM to hold the 49B parameter model. Building the TensorRT-LLM engine is memory-intensive.

**Software:**
- Python 3.10+
- NVIDIA GPU with CUDA 12.x
- TensorRT-LLM installed from source or via the container.

## Setup

In [None]:
# Environment checks
import sys

import tensorrt_llm
import torch

print(f"Python: {sys.version}")
print(f"TensorRT-LLM version: {tensorrt_llm.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"GPU[{i}]: {torch.cuda.get_device_name(i)}")

Python: 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0]


  from .autonotebook import tqdm as notebook_tqdm
[2025-09-17 18:08:58] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-09-17 18:08:58] INFO config.py:66: Polars version 1.25.2 available.
2025-09-17 18:09:00,970 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc4
TensorRT-LLM available
TensorRT-LLM version: 1.1.0rc4
CUDA available: True
Num GPUs: 1
GPU[0]: NVIDIA H200


## Inference with Python API

In [None]:
from tensorrt_llm import LLM, SamplingParams

MODEL_ID = "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5"
# Load model
llm = LLM(MODEL_ID, trust_remote_code=True)

# Set sampling parameters
params = SamplingParams(temperature=0.6, max_tokens=200)

# Generate text
result = llm.generate(["What is Nemotron Super?"], params)
print(result[0].outputs[0].text)

Loading Model: [1;32m[1/3]	[0mDownloading HF model
[38;20mDownloaded model to /root/.cache/huggingface/hub/models--nvidia--Llama-3_3-Nemotron-Super-49B-v1_5/snapshots/2051b8a94fa75052394507a5c81cd7c2b70e9a4b
[0m[38;20mTime: 0.350s
[0mLoading Model: [1;32m[2/3]	[0mLoading HF model to memory
[38;20mTime: 1.126s
[0mLoading Model: [1;32m[3/3]	[0mBuilding TRT-LLM engine
[38;20mTime: 324.062s
[0m[1;32mLoading model done.
[0m[38;20mTotal latency: 325.539s
[0m[33;20mrank 0 using MpiPoolSession to spawn MPI processes
[0m2025-09-17 17:13:04,901 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Engine version 0.20.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 49
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled


Processed requests: 100%|██████████| 1/1 [00:11<00:00, 11.50s/it]

 How does it work?
Nemotron Super is a powerful and versatile audio effect plugin designed for music producers, sound designers, and audio engineers. It is a dynamic equalizer that offers a wide range of tonal shaping capabilities, allowing users to enhance or cut specific frequency ranges in their audio signals. The plugin is known for its intuitive interface and advanced features, making it a valuable tool in both mixing and mastering contexts.

### Key Features of Nemotron Super:

1. **Dynamic Equalization**: Nemotron Super allows for dynamic EQ adjustments, meaning it can automatically adjust the gain of specific frequency bands based on the input signal. This is particularly useful for taming problematic frequencies that only appear intermittently in a track.

2. **Multi-band Processing**: The plugin typically supports multiple bands, each of which can be set to boost or cut frequencies within a specific range. This allows for precise control over the tonal balance of an audio sig




### Batch Generation


In [None]:
# Multiple prompts for batch generation
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "Explain quantum computing in simple terms:",
]

results = llm.generate(prompts, params)

for i, r in enumerate(results):
    print(f"\nPrompt {i + 1}: {prompts[i]!r}")
    print(r.outputs[0].text)

Processed requests: 100%|██████████| 3/3 [00:08<00:00,  2.95s/it]


Prompt 1: 'Hello, my name is'
 Dr. John Lee and I am a licensed clinical psychologist with over 20 years of experience in the mental health field. I have worked in various settings including hospitals, community mental health centers, and private practice. My areas of expertise include trauma, anxiety, depression, and relationship issues. I am trained in several evidence-based therapies such as Cognitive Behavioral Therapy (CBT), Dialectical Behavior Therapy (DBT), and Eye Movement Desensitization and Reprocessing (EMDR). I believe in a collaborative approach to therapy, working with clients to identify their strengths and develop practical strategies to achieve their goals. I am committed to providing a safe, non-judgmental space for individuals to explore their concerns and work towards healing and growth.
I am originally from the Midwest and have lived in several parts of the country, which has given me a broad perspective on different cultures and backgrounds. I value diversity an




---

## Part 2: OpenAI-Compatible Server

Next, we'll launch a server that's compatible with the OpenAI API. This is a convenient way to integrate the model into existing applications that use the OpenAI client libraries.

Start the OpenAI-compatible server with the following command in your terminal. This will build the TensorRT engine, which may take some time.

```bash
trtllm-serve "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5" \
    --trust_remote_code \
    --host 0.0.0.0 \
    --port 5000
```

You should see the following output when the server is ready:
```
INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
```

### Client Setup and Examples


In [None]:
import requests
from openai import OpenAI

BASE_URL = "http://127.0.0.1:5000/v1"
API_KEY = "tensorrt_llm"
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

# Get the served model name
model_id = requests.get(f"{BASE_URL}/models").json()["data"][0]["id"]

# Basic chat completion
response = client.chat.completions.create(
    model=model_id,
    messages=[{"role": "user", "content": "Give me 3 bullet points about TensorRT-LLM."}],
    temperature=0.6,
    max_tokens=256,
)
print("--- Simple Chat Response ---")
print(response.choices[0].message.content)

# Streaming chat completion
print("\n--- Streaming Chat Response ---")
stream = client.chat.completions.create(
    model=model_id,
    messages=[{"role": "user", "content": "Write a short poem about GPUs."}],
    temperature=0.7,
    max_tokens=256,
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

[2025-09-17 18:17:46] INFO _client.py:1025: HTTP Request: POST http://127.0.0.1:5000/v1/chat/completions "HTTP/1.1 200 OK"
[2025-09-17 18:17:46] INFO _client.py:1025: HTTP Request: POST http://127.0.0.1:5000/v1/chat/completions "HTTP/1.1 200 OK"


<think>
Okay, the user is asking for three bullet points about TensorRT-LLM. Let me start by recalling what I know about TensorRT-LLM. It's related to NVIDIA's TensorRT, which is a toolkit for optimizing deep learning models for inference. But TensorRT-LLM specifically targets large language models, right?

First, I should explain what TensorRT-LLM is. It's an open-source library, so that's a key point. It's designed to optimize and deploy LLMs efficiently. Maybe mention that it's built on TensorRT, which is known for high-performance inference.

Next, the main features. Optimization techniques like model pruning, quantization, and efficient attention mechanisms come to mind. These help reduce the model size and speed up inference without losing much accuracy. Also, integration with frameworks like Hugging Face Transformers would be important for developers.

Third, use cases. TensorRT-LLM is used in applications requiring real-time or low-latency responses, such as chatbots, virtual a

### Reasoning Modes (`think` vs. `no_think`)


In [None]:
# Reasoning ON (default)
reasoning_prompt = (
    "I have 5 apples. I eat 2, then my friend gives me 3 more. How many apples do I have now?"
)
messages_on = [
    {"role": "system", "content": "You are a helpful reasoning assistant."},
    {"role": "user", "content": reasoning_prompt},
]
print("--- Reasoning ON ---")
response_on = client.chat.completions.create(
    model=model_id,
    messages=messages_on,
    temperature=0.0,
    max_tokens=512,
)
print(response_on.choices[0].message.content)


# Reasoning OFF using /no_think
messages_off = [
    {"role": "system", "content": "You are a helpful reasoning assistant.\n/no_think"},
    {"role": "user", "content": reasoning_prompt},
]
print("\n--- Reasoning OFF ---")
response_off = client.chat.completions.create(
    model=model_id,
    messages=messages_off,
    temperature=0.0,
    max_tokens=256,
)
print(response_off.choices[0].message.content)

### Interaction with `curl`

**Chat Completion:**
```bash
curl -sS -X POST http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tensorrt_llm" \
  -d '{
    "model": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
    "messages": [
      {"role": "user", "content": "Give me 3 bullet points about TensorRT-LLM."}
    ],
    "temperature": 0.6,
    "max_tokens": 256
  }'
```

**Streaming Chat Completion:**
```bash
curl -N -sS -X POST http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tensorrt_llm" \
  -d '{
    "model": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
    "messages": [
      {"role": "user", "content": "Write a short poem about GPUs."}
    ],
    "stream": true
  }'
```


### Cleanup

To stop the OpenAI-compatible server, press `CTRL+C` in the terminal where it is running.


## Resource Notes

- **Engine Build Time**: The initial build of the TensorRT-LLM engine can be very time-consuming. This is a one-time cost per model and hardware configuration.
- **Hardware**: Nemotron-49B-v1.5 is a large model. Multi-GPU is highly recommended for acceptable performance.
- **Quantization**: TensorRT-LLM supports various quantization techniques (like INT8, FP8) to reduce memory usage and improve performance. Refer to the official documentation for more details.

## Conclusion

In this notebook, you have learned how to:
- Run inference with the Nemotron-49B-v1.5 model using the TensorRT-LLM Python API.
- Perform batch generation for multiple prompts.
- Deploy the model as an OpenAI-compatible server.
- Interact with the server using both a Python client and `curl`.

This provides a solid foundation for integrating high-performance LLM inference into your applications.
