# Running NVIDIA Nemotron Nano 9B v2 with TensorRT-LLM

This notebook will walk you through how to run the `nvidia/NVIDIA-Nemotron-Nano-9B-v2` model locally via TensorRT-LLM

[TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/) is NVIDIA’s open-source library for accelerating and optimizing LLM inference performance on NVIDIA GPUs.

For more details on the model [click here](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard)

Prerequisites:
- NVIDIA GPU with recent drivers (≥ 24 GB VRAM recommended) and CUDA 12.x 
- Python 3.10+
- TensorRT-LLM (you can refer to NVIDIA documentation, or pull this [container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release?version=1.1.0rc4))

## Prerequisites & environment

Set up a containerized environment and Jupyter kernel for TensorRT-LLM.

If you run the above mentioned container for TRT-LLM, make sure to configure your notebook with it. 
One approach is:

```shell
docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -p 8888:8888 -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc4

pip install jupyter

jupyter notebook --ip 0.0.0.0 --no-browser --allow-root
```
This should output a URL such as: http://127.0.0.1:8888/?token=xxxxxxxxxxxxxxxx

Then change your kernel to the Jupyter server running in your Docker container (copy and paste the URL).

## Verify GPU

Check that CUDA is available and the GPU is detected correctly.


In [None]:
# Environment check
import sys

import tensorrt_llm
import torch

print(f"Python: {sys.version}")
print(f"TensorRT-LLM version: {tensorrt_llm.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"GPU[{i}]: {torch.cuda.get_device_name(i)}")

## Loading the model

In [None]:
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import KvCacheConfig

kv_cache_config = KvCacheConfig(
    enable_block_reuse=False,
)

# Load model
llm = LLM(
    model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    max_seq_len=32678,
    max_batch_size=4,
    kv_cache_config=kv_cache_config,
)

## Generate responses for single or batch prompts

Use `SamplingParams` to control generation and run single and batched prompts.

In [None]:
# Set sampling parameters
params = SamplingParams(
    max_tokens=512,
    temperature=0.6,
    top_p=0.95,
    add_special_tokens=False,
)
# Generate text
result = llm.generate(["Write a haiku about GPUs"], params)
print(result[0].outputs[0].text)

In [None]:
# Multiple prompts for batch generation
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "Explain quantum computing in simple terms:",
]

results = llm.generate(prompts, params)

for i, r in enumerate(results):
    print(f"\nPrompt {i + 1}: {prompts[i]!r}")
    print(r.outputs[0].text)

## OpenAI-compatible server

Start a local OpenAI-compatible server with TensorRT-LLM to serve the model.


Run this in a terminal:

(Optional) If running via Docker, stop the running Jupyter server, and configure to another Python 3.10+ kernel for simplicity. Start this trtllm server instead. 

```shell
trtllm-serve "nvidia/NVIDIA-Nemotron-Nano-9B-v2" \
  --host 0.0.0.0 --port 8000 \
  --max_seq_len 32678 \
  --max_batch_size 4 \
  --extra_llm_api_options <(echo '{"kv_cache_config":{"enable_block_reuse":false}}')
```


Your server is now running! 

### Use the API

Use the OpenAI-compatible client to send requests to the local TensorRT-LLM server. 

Note: The model supports two modes - Reasoning ON (default) vs OFF. These can be toggled by passing /think vs /no_think as a part of the "system" message. 

The /think or /no_think keywords can also be provided in “user” messages for turn-level reasoning control.

In [None]:
import requests
from openai import OpenAI

# Setup client
BASE_URL = "http://127.0.0.1:8000/v1"
API_KEY = "tensorrt_llm"
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

# Get model ID
model_id = requests.get(f"{BASE_URL}/models", timeout=10).json()["data"][0]["id"]

# Basic chat completion
response = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": "/no_think"},
        {"role": "user", "content": "Give me 3 bullet points about TensorRT-LLM."},
    ],
    temperature=0.6,
    max_tokens=256,
)
print("Response:", response.choices[0].message.content)

print("\n" + "=" * 50 + "\n")

# Streaming chat completion
print("Streaming response:")
stream = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": "/think"},
        {"role": "user", "content": "Write a short haiku about GPUs."},
    ],
    temperature=0.7,
    max_tokens=256,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

### Tool calling

Use the OpenAI tools schema to call functions via the TensorRT-LLM endpoint.

In [None]:
# Tool calling via OpenAI tools schema
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculate_tip",
            "parameters": {
                "type": "object",
                "properties": {
                    "bill_total": {
                        "type": "integer",
                        "description": "The total amount of the bill",
                    },
                    "tip_percentage": {
                        "type": "integer",
                        "description": "The percentage of tip to be applied",
                    },
                },
                "required": ["bill_total", "tip_percentage"],
            },
        },
    }
]

completion = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    messages=[
        {"role": "system", "content": ""},
        {"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"},
    ],
    tools=TOOLS,
    temperature=0.6,
    top_p=0.95,
    max_tokens=32768,
    stream=False,
)

print(completion.choices[0].message.content)
print(completion.choices[0].message.tool_calls)