# Running Nemotron-49B-v1.5 with vLLM on NVIDIA GPUs

This notebook provides a comprehensive guide on how to run the `Nemotron-49B-v1.5` model using vLLM, a high-performance library for LLM inference and serving.

This notebook is divided into two parts:
- **Part 1:** Demonstrates how to use the direct vLLM Python API for inference, including batch generation and pseudo-streaming.
- **Part 2:** Covers how to deploy the model with an OpenAI-compatible web server for robust chat, streaming, and tool-use capabilities.

#### Launch on NVIDIA Brev
You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide.

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-32vt7HcQjCUpafGyquLZwJdIm8F)

- Model card: [nvidia/Llama-3.3-Nemotron-Super-49B-v1.5](https://huggingface.co/nvidia/Llama-3.3-Nemotron-Super-49B-v1.5)
- vLLM Docs: [https://docs.vllm.ai/](https://docs.vllm.ai/)

## Table of Contents

- [Part 1: Inference with the Python API](#Part-1:-Inference-with-the-Python-API)
  - [Prerequisites](#Prerequisites)
  - [Setup](#Setup)
  - [Loading the Model](#Loading-the-Model)
  - [Single and Batch Generation](#Single-and-Batch-Generation)
  - [Streaming (Pseudo)](#Streaming-(Pseudo))
- [Part 2: OpenAI-Compatible Server](#Part-2:-OpenAI-Compatible-Server)
  - [Launch Server](#Launch-Server)
  - [Client Setup](#Client-Setup)
  - [Chat and Streaming](#Chat-and-Streaming)
  - [Reasoning Modes (`think` vs. `no_think`)](#Reasoning-Modes-(`think`-vs.-`no_think`))
  - [Interaction with `curl`](#Interaction-with-`curl`)
  - [Cleanup](#Cleanup)
- [Resource Notes](#Resource-Notes)
- [Conclusion](#Conclusion)


## Part 1: Inference with the Python API

### Prerequisites

**Hardware:** This notebook requires a machine with at least **2 NVIDIA GPUs** with sufficient VRAM to hold the 49B parameter model.

**Software:**
- Python 3.10+
- CUDA 12.x
- PyTorch 2.3+
- vLLM 0.10.x


## Setup


In [None]:
# Install dependencies
%pip install -U "vllm>=0.10.2,<0.11" transformers torch "flashinfer-python>=0.1.6" openai

/home/ubuntu/.venv/bin/python3: No module named pip
Note: you may need to restart the kernel to use updated packages.


In [None]:
# GPU environment check
import torch
import platform

print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"GPU[{i}]: {props.name} | SM count: {props.multi_processor_count} | Mem: {props.total_memory / 1e9:.2f} GB")


  import pynvml  # type: ignore[import]


CUDA available: True
Num GPUs: 1
GPU[0]: NVIDIA H200


### Loading the Model

In [None]:
import os
from vllm import LLM, SamplingParams

MODEL_ID = "nvidia/Llama-3_3-Nemotron-Super-49B-v1.5"

llm = LLM(
    model=MODEL_ID,
    dtype="bfloat16",
    trust_remote_code=True,
    max_model_len=65536,
    gpu_memory_utilization=0.95,
    tensor_parallel_size=1,
)

print("Model ready")


## Showcasing Reasoning Modes: `think` vs. `no_think`

The Nemotron model supports two reasoning modes, which can be controlled via the system message:

1.  **Reasoning ON (default):** The model generates a `<think>` block with its reasoning process before the answer.
2.  **Reasoning OFF (`/no_think`):** By adding `/no_think` to the system prompt, the model provides a direct answer without the `<think>` block. This is useful for simple tasks where you want a concise response.

Since we are using the `vllm` python client which does not use the chat template, we will demonstrate this feature in the OpenAI-compatible server section.


### Single and Batch Generation


In [None]:
from vllm import SamplingParams

params = SamplingParams(temperature=0.6, max_tokens=200)

# Single prompt
single = llm.generate(["What is Nemotron Super?"], sampling_params=params)
print(single[0].outputs[0].text)

# Batch prompts
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "Explain quantum computing in simple terms:"
]
outputs = llm.generate(prompts, sampling_params=params)
for i, out in enumerate(outputs):
    print(f"\nPrompt {i+1}: {out.prompt!r}")
    print(out.outputs[0].text)

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

 It’s a new version of the popular Nemotron font, which was originally designed for the Commodore 64 computer. Version 2.0 of Nemotron Super includes a full set of 96 characters, including uppercase and lowercase letters, numbers, and symbols. The font is designed to look like the text you'd see on a retro computer or video game screen. It’s perfect for creating a nostalgic or vintage look in your designs.

Nemotron Super is a monospaced font, meaning that each character takes up the same amount of horizontal space. This makes it ideal for use in programming, coding, or any situation where alignment is important. The font’s design is inspired by the pixelated text of early computing and gaming, giving it a distinctive and charming aesthetic.

One of the standout features of Nemotron Super is its versatility. It can be used in a variety of contexts, from web design and digital interfaces to print media and logos. The font’s retro style can add a unique


Adding requests:   0%|          | 0/3 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]


Prompt 1: 'Hello, my name is'
 (Your Name), and I am a [Your Profession/Student/Parent, etc.] from [Your Location]. I am writing to express my strong support for the proposed regulations to [Briefly Mention the Policy or Regulation, e.g., "strengthen background checks for firearm purchases" or "increase funding for renewable energy projects"]. 

I believe that [Policy/Regulation] is a critical step towards [Explain the Main Benefit, e.g., "ensuring public safety" or "addressing climate change"]. As someone who [Personal Connection, e.g., "has been affected by gun violence" or "is passionate about environmental conservation"], I have seen firsthand the importance of [Reiterate the Policy's Goal, e.g., "preventing tragedies" or "promoting clean energy"].

The current [Current Situation, e.g., "lax regulations on gun sales" or "reliance on fossil fuels"] poses significant risks to [Affected Group or Community, e.g.,

Prompt 2: 'The capital of France is'
 Paris, and the capital of Japan i

### Streaming (Pseudo)

vLLM’s Python API is designed for high throughput and returns complete `RequestOutput` objects. For true token-by-token streaming, the OpenAI-compatible server (covered in Part 2) is the recommended approach.

However, we can simulate streaming by iterating through the characters of the final generated text. This is useful for seeing the output progressively in a notebook environment but does not reflect true streaming inference.


In [4]:
def stream_like(prompt: str, llm: LLM, sampling_params: SamplingParams) -> None:
    outputs = llm.generate([prompt], sampling_params=sampling_params)
    text = outputs[0].outputs[0].text
    print("Response:", end=" ")
    for ch in text:
        print(ch, end="", flush=True)
    print()

stream_like("Write a haiku about GPUs.", llm, SamplingParams(temperature=0.7, max_tokens=80))


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Response:  Here are some ideas to consider: their speed, power consumption, heat generation, parallel processing capabilities, use in gaming, scientific computing, or machine learning.

Silent, swift and hot,
GPUs crunch numbers with flair,
Lighting up the screen.

Another one:

Circuits blaze with speed,
Billions of cores work as one,
Gaming's heart beats fast.

And another:

Cool


---

## Part 2: OpenAI-Compatible Server

vLLM offers an OpenAI-compatible server that allows you to use familiar tools like the OpenAI Python client and `curl`. This is the recommended way to use features like chat templates, streaming, and tool calling.

### Launch Server

Run the following command in your terminal to start the server.

```bash
python -m vllm.entrypoints.openai.api_server \
    --model "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5" \
    --dtype bfloat16 \
    --trust-remote-code \
    --served-model-name nemotron \
    --host 0.0.0.0 \
    --port 5000 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 1 \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json
```

In [None]:
from openai import OpenAI

# This assumes the server is running on localhost:5000
client = OpenAI(base_url="http://127.0.0.1:5000/v1", api_key="dummy")

### Chat and Streaming


In [None]:
# Simple chat completion
resp = client.chat.completions.create(
    model="nemotron",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Give me 3 bullet points about vLLM."}
    ],
    temperature=0.6,
    max_tokens=256,
)
print("--- Simple Chat Response ---")
print(resp.choices[0].message.content)

# Streaming chat completion
print("\n--- Streaming Chat Response ---")
stream = client.chat.completions.create(
    model="nemotron",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Write a short poem about GPUs."}
    ],
    temperature=0.7,
    max_tokens=256,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        print(delta.content, end="", flush=True)
print()


<think>
Okay, the user is asking for three bullet points about vLLM. Let me start by recalling what vLLM is. I know it's a framework for running large language models efficiently. The main points should highlight its key features.

First, I remember that vLLM is designed for high throughput and low latency. That's important for applications needing quick responses. So maybe the first bullet can be about efficient inference with techniques like PagedAttention.

Second, it's built on a modular architecture. This allows for customization, like integrating different models or backends. That's a good point for developers who need flexibility.


<think>
Okay, the user wants a short poem about GPUs. Let me start by recalling what GPUs are. They're graphics processing units, right? Used for rendering images, video, and also for parallel computing tasks like machine learning.

Hmm, I need to make the poem engaging and not too technical. Maybe focus on their speed, power, and applications. Words

### Reasoning Modes (`think` vs. `no_think`)


In [None]:
# Reasoning ON (default)
reasoning_prompt = "I have 5 apples. I eat 2, then my friend gives me 3 more. How many apples do I have now?"
messages_on = [
    {"role": "system", "content": "You are a helpful reasoning assistant."},
    {"role": "user", "content": reasoning_prompt},
]
print("--- Reasoning ON ---")
response_on = client.chat.completions.create(
    model="nemotron",
    messages=messages_on,
    temperature=0.0,
    max_tokens=512,
)
print(response_on.choices[0].message.content)


# Reasoning OFF using /no_think
messages_off = [
    {"role": "system", "content": "You are a helpful reasoning assistant.\n/no_think"},
    {"role": "user", "content": reasoning_prompt},
]
print("\n--- Reasoning OFF ---")
response_off = client.chat.completions.create(
    model="nemotron",
    messages=messages_off,
    temperature=0.0,
    max_tokens=256,
)
print(response_off.choices[0].message.content)


## `curl` Examples

You can also interact with the server directly using `curl`.

**Chat completion:**
```bash
curl -sS -X POST http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.0
  }'
```

**Streaming chat completion:**
```bash
curl -N -sS -X POST http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [
      {"role": "user", "content": "Write a short poem about GPUs."}
    ],
    "stream": true
  }'
```


### Cleanup

To stop the OpenAI-compatible server, press `CTRL+C` in the terminal where it is running.


---
## Resource Notes

- **Hardware**: Nemotron-49B-v1.5 is a large model. For optimal performance, running on a multi-GPU setup with high-speed interconnects (like NVLink) is recommended.
- **Quantization**: vLLM supports various quantization techniques that can significantly reduce the memory footprint of the model, allowing it to run on smaller GPUs.
- **Chat Templates**: When using the OpenAI-compatible server, vLLM automatically applies the correct chat template for the model, which is crucial for getting properly formatted and accurate responses in conversational tasks.
- **Tool Calling**: The `--enable-auto-tool-choice` and `--tool-call-parser` flags enable advanced tool-calling capabilities for the model.

## Conclusion

In this notebook, you have learned how to:
- Run inference with the Nemotron-49B-v1.5 model using the vLLM Python API.
- Deploy the model as an OpenAI-compatible server.
- Interact with the server using both a Python client and `curl` for chat, streaming, and reasoning mode demonstrations.
- Utilize the model's reasoning modes for different use cases.

This notebook provides a solid foundation for building applications with Nemotron and vLLM.
