# Running NVIDIA Nemotron Nano 2 VL with vLLM

This notebook will walk you through how to run the `nvidia/Nemotron-Nano-12B-v2-VL-BF16` model locally with vLLM.

[vLLM](https://docs.vllm.ai) is a fast and easy-to-use library for LLM inference and serving. 

For more details on the model [click here](TODO)

Prerequisites:
- NVIDIA GPU with recent drivers (≥ 24 GB VRAM recommended; BF16-capable) and CUDA 12.x 
- Python 3.10+

## Prerequisites & environment

Set up a clean Python environment for running vLLM locally.

Create and activate a virtual environment. The sample here uses Conda but feel free to choose whichever tool you prefer.
Run these commands in a terminal before using this notebook:

```bash
conda create -n nemotron-vllm-env python=3.10 -y
conda activate nemotron-vllm-env
```

If running notebook locally, install ipykernel and switch the kernel to this environment:
- Installation
```bash
pip install ipykernel
```
- Kernel → Change kernel → Python (nemotron-vllm-env)

## Install dependencies

In [None]:
!VLLM_USE_PRECOMPILED=1 pip install git+https://github.com/vllm-project/vllm.git@main

## Verify GPU

Confirm CUDA is available and your GPU is visible to PyTorch.


In [None]:
# GPU environment check
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"GPU[{i}]: {torch.cuda.get_device_name(i)}")

## Load the model
Initialize the Nemotron model in vLLM with BF16 for efficient GPU inference.

In [None]:
from vllm import LLM

vlm = LLM(
    model="nvidia/Nemotron-Nano-12B-v2-VL-BF16",
    trust_remote_code=True,
    dtype="bfloat16",
)

print("Model ready")

## Generate responses

Once the model is loaded successfully above, you can continue with text generation:

### Single or batch prompts

Send one prompt or a list to run batched generation.

In [None]:
from vllm import SamplingParams
from PIL import Image
import requests
from io import BytesIO

params = SamplingParams(temperature=0.0, max_tokens=1024)

image1 = Image.open("example_image1.png")

# Single prompt
single = vlm.generate({
    "prompt": "<image>\nDescribe the image in detail.",
    "multi_modal_data": {"image": image1}}, sampling_params=params)
print(single[0].outputs[0].text)


# Batch prompts
image2 = Image.open("example_image2.png")
prompts = [
    {
    "prompt": "<image>\nDescribe the image in detail.",
    "multi_modal_data": {"image": image1}
    },
    {
    "prompt": "<image>\nWhat color bars are used in the image?",
    "multi_modal_data": {"image": image2}
    }
]
outputs = vlm.generate(prompts, sampling_params=params)
for i, out in enumerate(outputs):
    print(f"\nPrompt {i+1}: {out.prompt!r}")
    print(out.outputs[0].text)

### Streamed generation

Printing characters as they are produced.

In [None]:
def stream_like(prompt: str, llm: LLM, sampling_params: SamplingParams) -> None:
    outputs = llm.generate([prompt], sampling_params=sampling_params)
    text = outputs[0].outputs[0].text
    print("Response:", end=" ")
    for ch in text:
        print(ch, end="", flush=True)
    print()

stream_like("Write a haiku about GPUs.", vlm, SamplingParams(temperature=0.7, max_tokens=80))


## OpenAI-compatible server 

Serve the model via an OpenAI-compatible API using vLLM.

Before starting the server:
- Restart the kernel to free GPU memory used by the in-process LLM
- Ensure you use the same virtual environment with installed dependancies in your terminal
- Use `--video-pruning-rate` to set EVS. The default EVS is 0.

After restarting the kernel, run this in a terminal:

```shell
git clone https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16
git clone https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2

vllm serve nvidia/Nemotron-Nano-12B-v2-VL-BF16 --trust-remote-code --dtype bfloat16 --enable-auto-tool-choice --tool-parser-plugin "NVIDIA-Nemotron-Nano-9B-v2/nemotron_toolcall_parser_no_streaming.py" --tool-call-parser "nemotron_json"
```

Your server is now running! 

### Use the API

Send chat and streaming requests to your local vLLM server using the OpenAI-compatible client.

Note: The model supports two modes - Reasoning ON (default) vs OFF. These can be toggled by passing /think vs /no_think as a part of the "system" message. 

The /think or /no_think keywords can also be provided in “user” messages for turn-level reasoning control.

In [None]:
# Client: Standard chat and streaming
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8033/v1", api_key="null")

# Simple chat completion
resp = client.chat.completions.create(
    model="nvidia/Nemotron-Nano-12B-v2-VL-BF16",
    messages=[
        {"role": "system", "content": "/think"},
        {"role": "user", "content": [
            {"type": "text", "text": "Give me 3 interesting facts about this image."}, 
            {"type": "image_url", "image_url": {"url": "https://blogs.nvidia.com/wp-content/uploads/2025/08/gamescom-g-assist-nv-blog-1280x680-1.jpg"}
            }
            ]},
    ],
    temperature=0.6,
    max_tokens=1024,
)
print(resp.choices[0].message.content)

# Streaming chat completion
stream = client.chat.completions.create(
    model="nvidia/Nemotron-Nano-12B-v2-VL-BF16",
    messages=[
        {"role": "system", "content": "/no_think"},
        {"role": "user", "content": [
            {"type": "text", "text": "Describe this video in detail."}, 
            {"type": "video_url", "video_url": {"url": "https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4"}
            }
            ]},
    ],
    temperature=0.0,
    max_tokens=1024,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        print(delta.content, end="", flush=True)
print()


### Tool calling

Call functions using the OpenAI Tools schema and inspect returned tool_calls.

In [None]:
# Tool calling via OpenAI tools schema
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculate_tip",
            "parameters": {
                "type": "object",
                "properties": {
                    "bill_total": {
                        "type": "integer",
                        "description": "The total amount of the bill"
                    },
                    "tip_percentage": {
                        "type": "integer",
                        "description": "The percentage of tip to be applied"
                    }
                },
                "required": ["bill_total", "tip_percentage"]
            }
        }
    }
]

completion = client.chat.completions.create(
    model="nvidia/Nemotron-Nano-12B-v2-VL-BF16",
    messages=[
        {"role": "system", "content": "/think"},
        {"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"}
    ],
    tools=TOOLS,
    temperature=0.6,
    top_p=0.95,
    max_tokens=32768,
    stream=False
)

print(completion.choices[0].message.content)
print(completion.choices[0].message.tool_calls)