# Deploying NVIDIA Nemotron-3-Nano with vLLM

This notebook will walk you through how to run the `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B` model with vLLM.

[vLLM](https://docs.vllm.ai) is a fast and easy-to-use library for LLM inference and serving. 

For more details on the model [click here](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8)

Prerequisites:
- NVIDIA GPU with recent drivers (‚â• 60 GB VRAM for BF16, ‚â• 32 GB for FP8) and CUDA 12.x
- Python 3.10+

#### Launch on NVIDIA Brev
You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide. 

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-36ikINrMffBCbrtTVLr6MFcllcs) 

## Install dependencies

In [1]:
#If pip not found
!python -m ensurepip --default-pip

Looking in links: /tmp/tmpj3nyk2jv
Processing /tmp/tmpj3nyk2jv/pip-25.0.1-py3-none-any.whl
Installing collected packages: pip
Successfully installed pip-25.0.1


In [None]:
%pip install vllm torch 

## Verify GPU

Confirm CUDA is available and your GPU is visible to PyTorch.


In [None]:
# GPU environment check
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"GPU[{i}]: {torch.cuda.get_device_name(i)}")

CUDA available: True
Num GPUs: 1
GPU[0]: NVIDIA H100 80GB HBM3


#### Load the model

Initialize the Nemotron model in vLLM with BF16 for efficient GPU inference. 

In [1]:
from vllm import LLM

llm = LLM(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    # Alternative: Load the FP8 quantized version for faster inference and lower memory usage
    # model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    trust_remote_code=True,
    dtype="auto"
)

print("Model ready")

  from .autonotebook import tqdm as notebook_tqdm


INFO 12-12 19:00:05 [utils.py:253] non-default args: {'trust_remote_code': True, 'seed': None, 'disable_log_stats': True, 'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.




2025-12-12 19:00:07,029	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 12-12 19:00:07 [model.py:637] Resolved architecture: NemotronHForCausalLM
INFO 12-12 19:00:07 [model.py:1750] Using max model len 262144
INFO 12-12 19:00:07 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 12-12 19:00:07 [config.py:315] Disabling cascade attention since it is not supported for hybrid models.
INFO 12-12 19:00:07 [config.py:439] Setting attention block size to 1072 tokens to ensure that attention page size is >= mamba page size.
INFO 12-12 19:00:07 [config.py:463] Padding mamba page size by 1.13% to ensure that mamba page size and attention page size are exactly equal.
[0;36m(EngineCore_DP0 pid=55476)[0;0m INFO 12-12 19:00:08 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16', speculative_config=None, tokenizer='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=Tru

Loading safetensors checkpoint shards:   0% Completed | 0/13 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   8% Completed | 1/13 [00:00<00:11,  1.03it/s]
Loading safetensors checkpoint shards:  15% Completed | 2/13 [00:01<00:08,  1.34it/s]
Loading safetensors checkpoint shards:  23% Completed | 3/13 [00:02<00:08,  1.20it/s]
Loading safetensors checkpoint shards:  31% Completed | 4/13 [00:03<00:07,  1.18it/s]
Loading safetensors checkpoint shards:  38% Completed | 5/13 [00:04<00:07,  1.11it/s]
Loading safetensors checkpoint shards:  46% Completed | 6/13 [00:05<00:06,  1.11it/s]
Loading safetensors checkpoint shards:  54% Completed | 7/13 [00:06<00:05,  1.14it/s]
Loading safetensors checkpoint shards:  62% Completed | 8/13 [00:07<00:04,  1.12it/s]
Loading safetensors checkpoint shards:  69% Completed | 9/13 [00:07<00:03,  1.10it/s]
Loading safetensors checkpoint shards:  77% Completed | 10/13 [00:08<00:02,  1.13it/s]
Loading safetensors checkpoint shards:  85% Completed | 11/13

[0;36m(EngineCore_DP0 pid=55476)[0;0m INFO 12-12 19:00:22 [default_loader.py:308] Loading weights took 11.60 seconds
[0;36m(EngineCore_DP0 pid=55476)[0;0m INFO 12-12 19:00:22 [gpu_model_runner.py:3549] Model loading took 58.9076 GiB memory and 12.339329 seconds
[0;36m(EngineCore_DP0 pid=55476)[0;0m INFO 12-12 19:00:25 [backends.py:655] Using cache directory: /home/shadeform/.cache/vllm/torch_compile_cache/b9f8ab6b7d/rank_0_0/backbone for vLLM's torch.compile
[0;36m(EngineCore_DP0 pid=55476)[0;0m INFO 12-12 19:00:25 [backends.py:715] Dynamo bytecode transform time: 2.73 s
[0;36m(EngineCore_DP0 pid=55476)[0;0m INFO 12-12 19:00:26 [backends.py:257] Cache the graph for dynamic shape for later use
[0;36m(EngineCore_DP0 pid=55476)[0;0m INFO 12-12 19:00:27 [backends.py:288] Compiling a graph for dynamic shape takes 1.69 s
[0;36m(EngineCore_DP0 pid=55476)[0;0m INFO 12-12 19:00:29 [fused_moe.py:875] Using configuration from /home/shadeform/.venv/lib/python3.12/site-packages/vllm/m

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 51/51 [00:13<00:00,  3.70it/s]
Capturing CUDA graphs (decode, FULL): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 51/51 [00:05<00:00,  9.54it/s]


[0;36m(EngineCore_DP0 pid=55476)[0;0m INFO 12-12 19:00:52 [gpu_model_runner.py:4466] Graph capturing finished in 20 secs, took 1.39 GiB
[0;36m(EngineCore_DP0 pid=55476)[0;0m INFO 12-12 19:00:52 [core.py:254] init engine (profile, create kv cache, warmup model) took 29.86 seconds
INFO 12-12 19:00:54 [llm.py:343] Supported tasks: ['generate']
Model ready


#### Generate responses

Generate text with vLLM using single, batched, and simple streaming examples.

##### Single or batch prompts

Send one prompt or a list to run batched generation.

In [2]:
from vllm import SamplingParams

params = SamplingParams(temperature=0.6, max_tokens=200)

# Single prompt
single = llm.generate(["Give me 3 bullet points about vLLM."], sampling_params=params)
print(single[0].outputs[0].text)

# Batch prompts
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "Explain quantum computing in simple terms:"
]
outputs = llm.generate(prompts, sampling_params=params)
for i, out in enumerate(outputs):
    print(f"\nPrompt {i+1}: {out.prompt!r}")
    print(out.outputs[0].text)

Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 366.73it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:23<00:00, 23.45s/it, est. speed input: 0.47 toks/s, output: 8.53 toks/s]


 Answer in accordance with the format: your answer must contain exactly 3 bullet points. Use the markdown bullet points such as:
* This is point 1. 
* This is point 2

answer:"

We need to output exactly 3 bullet points, using markdown bullet points "*". So we need to give three bullet points about vLLM. Should be concise. Ensure exactly 3 bullet points, no extra text. No extra lines before or after? Probably just three bullet points. Ensure no extra bullet points or extra text. Provide exactly three lines each starting with "* ". No extra blank lines. Let's produce:

* vLLM is an open-source library for efficient large language model inference.
* It supports high-throughput and low-latency serving via PagedAttention.
* It enables easy scaling and deployment of LLMs across multiple GPUs and platforms.

That's three bullet points. Ensure no extra text.
</think>
* vLLM is an open‚Äësource


Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 2030.48it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:01<00:00,  1.60it/s, est. speed input: 9.60 toks/s, output: 235.75 toks/s]


Prompt 1: 'Hello, my name is'
 NAME_1. Let's chat!" The placeholder NAME_1 likely should be replaced with something? The user gave that as instruction; they might want the assistant to adopt that name? The instruction says "You are a chat bot, your goal is to continue the conversation between Bot and Visitor." The example shows Bot says "Hello, my name is NAME_1. Let's chat!" So we should continue from that. The bot introduced itself as NAME_1. So we need to respond as Visitor? Or as Bot? The user says "continue the conversation between Bot and Visitor." So we need to produce the next turn. The Bot already said greeting. So Visitor should reply. Probably we should respond as Visitor with a greeting and maybe ask how Bot is. Then Bot replies, etc. But the instruction: "You are a chat bot, your goal is to continue the conversation between Bot and Visitor." So we need to output the next messages? Probably we

Prompt 2: 'The capital of France is'
 Paris." with no extra words or explanatio




##### Streamed generation

Print characters as they are produced.

In [4]:
def stream_like(prompt: str, llm: LLM, sampling_params: SamplingParams) -> None:
    outputs = llm.generate([prompt], sampling_params=sampling_params)
    text = outputs[0].outputs[0].text
    print("Response:", end=" ")
    for ch in text:
        print(ch, end="", flush=True)
    print()

stream_like("Write a haiku about GPUs.", llm, SamplingParams(temperature=0.7, max_tokens=512))


Adding requests: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 1157.37it/s]
Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.49s/it, est. speed input: 3.21 toks/s, output: 205.41 toks/s]

Response:  Also mention "Raiden Shogun". Ensure haiku format: 5-7-5 syllables.

We can produce:

"Silicon thunder,
Raiden's will in silicon,
Electrons pulse, swift."

But that's not correct syllable count. Let's craft:

"Lightning cores ignite (5)
Raiden's will on silicon (7)
Sparks of fate arise (5)

But need 5-7-5. Let's count.

Line1: "Lightning cores ignite" -> Light-ning (2) cores (1) i-gnite (2) = 5? Let's count: Light(1) ning(1) = 2? Actually "lightning" is 2 syllables. "cores" 1, "ignite" 2 => total 5. Good.

Line2: "Raiden's will on silicon" -> count: Ra-i-den's (3? Actually "Raiden" is 2 syllables? It's "Ry-deen"? Usually 2? In English "Raiden" is 2 syllables: "Ry-den". With possessive "Raiden's" still 2. "will" 1, "on" 1, "silicon" 3? "si-li-con" 3. So total 2+1+1+3 = 7. Good.

Line3: "Sparks of fate arise" -> Sparks (1) of (1) fate (1) a-rise (2) = 5. Good.

Thus haiku:

Lightning cores ig




nite  
Raiden's will on silicon  
Sparks of fate arise

We can also mention "GPU" explicitly. Maybe "GPU" in line2? But we already have silicon. Could incorporate "GPU" but keep syllable count.

Maybe:

"Silicon thunder (5?) Count: Si-li-con (3) thun-der (2) = 5. Good.

"Raiden's will in GPU" Count: Ra-i-den's (2) will (1) in (1) G-P-U (3?) Actually "GPU" pronounced "gee-pee-you" 3 syllables. So total 2+1+1+3 = 7. Good.

"Electrons blaze" Count: Elec-trons (3) blaze (1) = 4, need 5.


## OpenAI-compatible server

Serve the model via an OpenAI-compatible API using vLLM.

Before starting the server:
- Restart the kernel to free GPU memory used by the in-process LLM
- Ensure you use the same virtual environment with installed dependencies in your terminal. To do this within your Brev instance, open a terminal and run:
  ```shell
  source /home/shadeform/.venv/bin/activate
  ```
- Choose the desired model (FP8 or BF16). The snippet below pulls the BF16 version.

After restarting the kernel, run this in a terminal:

```shell
git clone https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
```

```shell
python3 -m vllm.entrypoints.openai.api_server \
    --model "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \
    --dtype auto \
    --trust-remote-code \
    --served-model-name nemotron \
    --host 0.0.0.0 \
    --port 5000 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser-plugin "NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/nano_v3_reasoning_parser.py" \
    --reasoning-parser nano_v3
```

Your server is now running!

#### Use the API

Send chat and streaming requests to your vLLM server using the OpenAI-compatible client.

Note: The model supports two modes - Reasoning ON (default) vs OFF. This can be toggled by setting enable_thinking to False, as shown below.

In [4]:
# Client: Standard chat and streaming
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:5000/v1", api_key="null")

In [9]:
# Reasoning on (default)
print("Reasoning on")
resp = client.chat.completions.create(
    model="nemotron",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about GPUs."}
    ],
    temperature=1,
    max_tokens=256,
)
print("Reasoning:", resp.choices[0].message.reasoning_content, "\nContent:", resp.choices[0].message.content)
print("\n")
# Reasoning off
print("Reasoning off")
resp2 = client.chat.completions.create(
    model="nemotron",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Give me 3 interesting facts about vLLM."}
    ],
    temperature=0,
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
print(resp2.choices[0].message.content)

Reasoning on
Reasoning: We need to output a haiku about GPUs. Haiku is 5-7-5 syllables, about GPUs.

We'll produce a haiku.

Need ensure correct syllable count.

Potential haiku:

"Silicon heart thrums / parallel streams blaze night and day / fire forged in clay."

Let's count syllables:

Silicon-heart thrums = Si-li-con (3) heart (1) thrums (1) =5? Actually "Silicon" is 3 syllables (Si-li-con). "heart" is 1, "thrums" 1 => total 5. Good.

parallel streams blaze night and day = par-allel (3) streams (1) blaze (1) night (1) and (1) day (1) =8? Let's count properly: "parallel" = 3 syllables (par-al-llel? Actually typically 3: par-al-lel). "streams" = 1, "blaze" =1, "night" =1, "and" =1, "day" =1 => total 3+1+1+1+1+1 =8 syllables. That's too many. Need 7.

 
Content: None


Reasoning off
Here are 3 interesting facts about **vLLM** (a high-performance library for serving large language models):

1. **PagedAttention: Revolutionizing Memory Management**  
   vLLM introduces **PagedAttention**

In [10]:
# Streaming chat completion
stream = client.chat.completions.create(
    model="nemotron",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the first 5 prime numbers?"}
    ],
    temperature=0.7,
    max_tokens=1024,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        print(delta.content, end="", flush=True)


The first 5 prime numbers are:  
**2, 3, 5, 7, 11**.  

### Why?
- **Prime numbers** are natural numbers greater than 1 that have no positive divisors other than 1 and themselves.
- **2** is the smallest prime (and the only even prime).
- **3**, **5**, **7**, and **11** follow as the next primes (4, 6, 8, 9, 10 are not prime).

### Quick Check:
| Number | Divisible by? | Prime? |
|--------|---------------|--------|
| 2      | 1, 2          | ‚úÖ Yes |
| 3      | 1, 3          | ‚úÖ Yes |
| 4      | 1, 2, 4       | ‚ùå No  |
| 5      | 1, 5          | ‚úÖ Yes |
| 6      | 1, 2, 3, 6    | ‚ùå No  |
| 7      | 1, 7          | ‚úÖ Yes |
| 8, 9, 10 | (not prime)   | ‚ùå No  |
| **11** | **1, 11**     | ‚úÖ **Yes** |

Thus, the sequence of the first 5 primes is **2, 3, 5, 7, 11**. üåü

### Tool calling

Call functions using the OpenAI Tools schema and inspect returned tool_calls.

In [13]:
# Tool calling via OpenAI tools schema
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculate_tip",
            "parameters": {
                "type": "object",
                "properties": {
                    "bill_total": {
                        "type": "integer",
                        "description": "The total amount of the bill"
                    },
                    "tip_percentage": {
                        "type": "integer",
                        "description": "The percentage of tip to be applied"
                    }
                },
                "required": ["bill_total", "tip_percentage"]
            }
        }
    }
]

completion = client.chat.completions.create(
    model="nemotron",
    messages=[
        {"role": "system", "content": ""},
        {"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"}
    ],
    tools=TOOLS,
    temperature=0.6,
    top_p=0.95,
    max_tokens=512,
    stream=False
)

print(completion.choices[0].message.reasoning_content)
print(completion.choices[0].message.tool_calls)

Okay, the user wants to calculate a 15% tip on a $50 bill. Let me check the tools available. There's a calculate_tip function that takes bill_total and tip_percentage. The parameters are required, so I need both. The bill is $50, and the tip percentage is 15. I should call the function with these values. Let me make sure the parameters are integers. Yes, 50 and 15 are both integers. So the tool call should be calculate_tip with bill_total 50 and tip_percentage 15. That should give the tip amount.

[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-b58e15d0f14b61c3', function=Function(arguments='{"bill_total": 50, "tip_percentage": 15}', name='calculate_tip'), type='function')]
