# Deploying NVIDIA Nemotron-3-Nano with TensorRT LLM

This notebook will walk you through how to run the `nnvidia/NVIDIA-Nemotron-3-Nano-30B-A3B` model via TensorRT-LLM

[TensorRT LLM](https://nvidia.github.io/TensorRT-LLM/) is NVIDIAâ€™s open-source library for accelerating and optimizing LLM inference performance on NVIDIA GPUs. TRTLLM support for this model is enabled through the AutoDeploy workflow. More details about this workflow can be found [here](https://nvidia.github.io/TensorRT-LLM/features/auto_deploy/auto-deploy.html).

For more details on the model [click here](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8). 

Prerequisites:
- NVIDIA GPU with recent drivers (â‰¥ 60 GB VRAM for BF16, â‰¥ 32 GB for FP8) and CUDA 12.x
- Python 3.10+
- TensorRT-LLM (you can refer to NVIDIA documentation, or pull this [container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release?version=1.2.0rc5))

#### Launch on NVIDIA Brev
You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide. 

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-36ikYKeRmXqG8MJjxsgROJM4S2V)

## Prerequisites & environment

Set up a containerized environment for TensorRT-LLM by running the following command in a terminal.

```shell
docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5
```

You now have TRT-LLM set up! 

In [1]:
#If pip not found
!python -m ensurepip --default-pip

Looking in links: /tmp/tmpua832ur3


In [None]:
%pip install torch openai

## Verify GPU

Check that CUDA is available and the GPU is detected correctly.


In [3]:
# Environment check
import sys
import torch

print(f"Python: {sys.version}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"GPU[{i}]: {torch.cuda.get_device_name(i)}")


Python: 3.12.12 (main, Dec  9 2025, 19:02:36) [Clang 21.1.4 ]
CUDA available: True
Num GPUs: 1
GPU[0]: NVIDIA H100 80GB HBM3


## OpenAI-compatible server

Start a local OpenAI-compatible server with TensorRT-LLM via the terminal, within the running docker container.

Ensure that the following commands are executed from the docker terminal.

### Create a YAML file with the required configuration

```shell
cat > nano_v3.yaml<<EOF
runtime: trtllm
compile_backend: torch-cudagraph
max_batch_size: 64
max_seq_len: 16384
enable_chunked_prefill: true
attn_backend: flashinfer
model_factory: AutoModelForCausalLM
skip_loading_weights: false
free_mem_ratio: 0.65
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 320, 384]
kv_cache_config:
  # disable kv_cache reuse since not supported for hybrid/ssm models
  enable_block_reuse: false
transforms:
  detect_sharding:
    sharding_dims: ['ep', 'bmm']
    allreduce_strategy: 'AUTO'
    manual_config:
      head_dim: 128
      tp_plan:
        # mamba SSM layer
        "in_proj": "mamba"
        "out_proj": "rowwise"
        # attention layer
        "q_proj": "colwise"
        "k_proj": "colwise"
        "v_proj": "colwise"
        "o_proj": "rowwise"
        # NOTE: consider not sharding shared experts and/or
        # latent projections at all, keeping them replicated.
        # To do so, comment out the corresponding entries.
        # moe layer: SHARED experts
        "up_proj": "colwise"
        "down_proj": "rowwise"
        # MoLE: latent projections: simple shard
        "fc1_latent_proj": "gather"
        "fc2_latent_proj": "gather"
  multi_stream_moe:
    stage: compile
    enabled: true
  insert_cached_ssm_attention:
      cache_config:
        mamba_dtype: float32
  fuse_mamba_a_log:
    stage: post_load_fusion
    enabled: true
EOF
```


### Load the model

#### BF16 version

```shell
TRTLLM_ENABLE_PDL=1 trtllm-serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \
--host 0.0.0.0 \
--port 8000 \
--backend _autodeploy \
--trust_remote_code \
--reasoning_parser deepseek-r1 \
--tool_parser qwen3_coder \
--extra_llm_api_options nano_v3.yaml
```

#### Alternative: Load the FP8 quantized version for faster inference and lower memory usage

```shell
TRTLLM_ENABLE_PDL=1 trtllm-serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8" \
--host 0.0.0.0 \
--port 8000 \
--backend _autodeploy \
--trust_remote_code \
--reasoning_parser deepseek-r1 \
--tool_parser qwen3_coder \
--extra_llm_api_options nano_v3.yaml
```

Your server is now running!

### Use the API

Use the OpenAI-compatible client to send requests to the TensorRT-LLM server.

Note: The model supports two modes - Reasoning ON (default) vs OFF. This can be toggled by setting enable_thinking to False, as shown below. 

In [4]:
from openai import OpenAI
import requests

# Setup client
BASE_URL = "http://0.0.0.0:8000/v1"
API_KEY = "null" 
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

model_id = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" # set this to the model you loaded

In [6]:
# Reasoning on (default)
print("Reasoning on")
response = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Give me 3 bullet points about TensorRT-LLM."}
    ],
    temperature=1,
    max_tokens=256,
)
print(response.choices[0].message.reasoning_content, response.choices[0].message.content)

print("\n")

# Reasoning off
print("Reasoning off")
response = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Give me 3 bullet points about TensorRT-LLM."}
    ],
    temperature=0,
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
print(response.choices[0].message.reasoning_content, response.choices[0].message.content)

Reasoning on
We need to respond with 3 bullet points about TensorRT-LLM. Provide concise, factual points. Should we add any extra explanation? Probably just bullet list with three points. Use markdown bullet points. Should be concise.
 
- **Highâ€‘performance inference engine**: TensorRTâ€‘LLM leverages NVIDIAâ€™s TensorRT and optimized kernels to accelerate transformerâ€‘based language models, delivering up to 4â€‘5Ã— higher throughput and lower latency compared with naÃ¯ve PyTorch implementations.

- **Modelâ€‘specific optimizations**: It provides specialized kernels and runtime tricks (eourgate, paged attention, speculative decoding) that tightly integrate with the model architecture, enabling efficient handling of very large vocabularies and context lengths.

- **Seamless deployment**: The library supports ONNX, Hugging Face ðŸ¤— Transformers, and TensorRTâ€‘compatible models, allowing developers to export existing models directly to TensorRTâ€‘LLM for production inference on GPUs,

In [7]:
# Streaming chat completion
print("Streaming response:")
stream = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the first 5 prime numbers?"}
    ],
    temperature=0.7,
    max_tokens=1024,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Streaming response:

The first five prime numbers are:

**2,â€¯3,â€¯5,â€¯7,â€¯11**.

### Tool calling

Use the OpenAI tools schema to call functions via the TensorRT-LLM endpoint.

In [8]:
# Tool calling via OpenAI tools schema
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculate_tip",
            "parameters": {
                "type": "object",
                "properties": {
                    "bill_total": {
                        "type": "integer",
                        "description": "The total amount of the bill"
                    },
                    "tip_percentage": {
                        "type": "integer",
                        "description": "The percentage of tip to be applied"
                    }
                },
                "required": ["bill_total", "tip_percentage"]
            }
        }
    }
]

completion = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": ""},
        {"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"}
    ],
    tools=TOOLS,
    temperature=0.6,
    top_p=0.95,
    max_tokens=512,
    stream=False
)

print(completion.choices[0].message.reasoning_content, completion.choices[0].message.content)
print(completion.choices[0].message.tool_calls)

Okay, the user wants to calculate a 15% tip on a $50 bill. Let me see. The tool provided is calculate_tip, which requires bill_total and tip_percentage. The bill_total is $50, so as an integer that's 50. The tip percentage is 15, so I need to plug those into the function. Let me check if there are any required parameters. The tool's required fields are both bill_total and tip_percentage, which the user provided. So I should call calculate_tip with 50 and 15. The function should return the tip amount, which is 50 multiplied by 0.15, resulting in $7.50. I need to make sure the JSON is correctly formatted with the arguments as integers.
 


[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-9bdee2d401a8466e84d2654e7a5786ee', function=Function(arguments='{"bill_total": 50, "tip_percentage": 15}', name='calculate_tip'), type='function')]


### Controlling Reasoning Budget

The `reasoning_budget` parameter allows you to limit the length of the model's reasoning trace. When the reasoning output reaches the specified token budget, the model will attempt to gracefully end the reasoning at the next newline character. 

If no newline is encountered within 500 tokens after reaching the budget threshold, the reasoning trace will be forcibly terminated at `reasoning_budget + 500` tokens to prevent excessive generation.


In [None]:
from typing import Any, Dict, List
import openai
from transformers import AutoTokenizer


class ThinkingBudgetClient:
    def __init__(self, base_url: str, api_key: str, tokenizer_name_or_path: str):
        self.base_url = base_url
        self.api_key = api_key
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
        self.client = openai.OpenAI(base_url=self.base_url, api_key=self.api_key)

    def chat_completion(
        self,
        model: str,
        messages: List[Dict[str, Any]],
        reasoning_budget: int = 512,
        max_tokens: int = 1024,
        **kwargs,
    ) -> Dict[str, Any]:
        assert (
            max_tokens > reasoning_budget
        ), f"reasoning_budget must be smaller than max_tokens. Given {max_tokens=} and {reasoning_budget=}"

        # 1. first call chat completion to get reasoning content
        response = self.client.chat.completions.create(
            model=model, 
            messages=messages, 
            max_tokens=reasoning_budget, 
            **kwargs
        )
        
        reasoning_content = response.choices[0].message.reasoning_content or ""
        
        if "</think>" not in reasoning_content:
            # reasoning content is too long, closed with a period (.)
            reasoning_content = f"{reasoning_content}.\n</think>\n\n"
        
        reasoning_tokens_used = len(
            self.tokenizer.encode(reasoning_content, add_special_tokens=False)
        )
        remaining_tokens = max_tokens - reasoning_tokens_used
        
        assert (
            remaining_tokens > 0
        ), f"remaining tokens must be positive. Given {remaining_tokens=}. Increase max_tokens or lower reasoning_budget."

        # 2. append reasoning content to messages and call completion
        messages.append({"role": "assistant", "content": reasoning_content})
        prompt = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            continue_final_message=True,
        )
        
        response = self.client.completions.create(
            model=model, 
            prompt=prompt, 
            max_tokens=remaining_tokens, 
            **kwargs
        )

        response_data = {
            "reasoning_content": reasoning_content.strip().strip("</think>").strip(),
            "content": response.choices[0].text,
            "finish_reason": response.choices[0].finish_reason,
        }
        return response_data

In [10]:
# Client
model_id = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" # set this to the model you loaded

client = ThinkingBudgetClient(
    base_url="http://0.0.0.0:8000/v1",
    api_key="null",
    tokenizer_name_or_path=model_id
)

In [12]:
resp = client.chat_completion(
    model=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about GPUs."}
    ],
    temperature=1,
    max_tokens=512,
    reasoning_budget=128
)
print("Reasoning:", resp["reasoning_content"], "\nContent:", resp["content"])

Reasoning: We need to write a haiku about GPUs. A haiku is 5-7-5 syllable structure. Provide 3 lines, 5 syllables, 7 syllables, 5 syllables. Might mention graphics, processing, cores, etc. Provide a nice poetic haiku. Ensure correct syllable count.

Let's craft: "Silicon veins pulse / Parallel rivers compute dreams / Light builds worlds anew"

Check syllables:

Line1: "Silicon veins pulse" -> Sil-i-con (3) veins (1) pulse (1) = 5? Actually "Silicon" is 3 syllables (sil-i-con. 
Content: 
Silicon veins pulse  
Parallel rivers compute dreams  
Light builds worlds anew
