# Deploying NVIDIA Nemotron-3-Nano with SGLang

This notebook will walk you through how to run the `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B` model with SGLang.

[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.

For more details on the model [click here](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8)

Prerequisites:
- NVIDIA GPU with recent drivers (â‰¥ 60 GB VRAM for BF16, â‰¥ 32 GB for FP8) and CUDA 12.x
- Python 3.10+

#### Launch on NVIDIA Brev
You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide. 

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-36ikQZX0ZDTSCGE7YkqxiOKwKsj) 

In [1]:
#If pip not found
!python -m ensurepip --default-pip

Looking in links: /tmp/tmp3hfrr9so


## Install dependencies

In [None]:
%pip install sglang torch

## Verify GPU

Confirm CUDA is available and your GPU is visible to PyTorch.


In [3]:
# GPU environment check
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"GPU[{i}]: {torch.cuda.get_device_name(i)}")

CUDA available: True
Num GPUs: 1
GPU[0]: NVIDIA H100 80GB HBM3


## Start SGLang Server

SGLang runs as a separate server process. 

Before starting the server, ensure that your notebook and terminal are in the same virtual environment.

Within Brev, open a terminal and run:
```shell
source /home/shadeform/.venv/bin/activate
```

Then, choose the desired model (FP8 or BF16) and run the following command in the terminal.

### Load the BF16 version

```shell
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--host 0.0.0.0 --port 5000 --log-level warning --trust-remote-code --tool-call-parser qwen3_coder --reasoning-parser deepseek-r1
```

### Alternative: Load the FP8 quantized version for faster inference and lower memory usage

```shell
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--host 0.0.0.0 --port 5000 --log-level warning --trust-remote-code --tool-call-parser qwen3_coder --reasoning-parser deepseek-r1
```

## Generate responses


In [1]:
## Client Setup
from openai import OpenAI

# The model name we used when launching the server.
SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"

BASE_URL = f"http://localhost:5000/v1"
API_KEY = "EMPTY"  # SGLang server doesn't require an API key by default

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
print(f"OpenAI client configured to use server at: {BASE_URL}")
print(f"Using model: {SERVED_MODEL_NAME}")

OpenAI client configured to use server at: http://localhost:5000/v1
Using model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16


### Simple vs streamed generation


In [2]:
resp = client.chat.completions.create(
    model=SERVED_MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Give me 3 bullet points about SGLang."}
    ],
    temperature=0.6,
    max_tokens=512,
)
print(resp.choices[0].message.reasoning_content, resp.choices[0].message.content)
print("\n")

User wants 3 bullet points about SGLang. Provide concise bullet points. Probably about SGLang being a programming language for tensor parallelism, etc. Provide three bullet points.
 - **Highâ€‘performance tensorâ€‘parallel programming** â€“ SGLang provides a Pythonâ€‘embedded DSL that lets you write models once and automatically generate optimized kernels for CPUs, GPUs, and TPUs, handling parallelism and memory layout behind the scenes.  
- **Unified graphâ€‘level and operatorâ€‘level optimizations** â€“ It analyses the entire computation graph to fuse operators, schedule overlapping work, and select the best parallelism strategy (e.g., dataâ€‘parallel, modelâ€‘parallel, or hybrid) without manual tuning.  
- **Seamless integration with existing frameworks** â€“ SGLang works with PyTorch, TensorFlow, and JAX models, offering dropâ€‘in acceleration through simple annotations or a thin wrapper, enabling developers to speed up inference/training without rewriting their code.




In [3]:
# Streaming chat completion
stream = client.chat.completions.create(
    model=SERVED_MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "What are the first 5 prime numbers?"}
    ],
    temperature=0.7,
    max_tokens=1024,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        print(delta.content, end="", flush=True)


The first 5 prime numbers are:

1. **2** (the only even prime number)  
2. **3**  
3. **5**  
4. **7**  
5. **11**

### Why these are prime:
- **2**: Divisible only by 1 and 2.  
- **3**: Divisible only by 1 and 3.  
- **5**: Divisible only by 1 and 5.  
- **7**: Divisible only by 1 and 7.  
- **11**: Divisible only by 1 and 11.  

Numbers like 4, 6, 8, 9, and 10 are excluded because they have divisors other than 1 and themselves (e.g., 4 = 2Ã—2, 6 = 2Ã—3). Let me know if you'd like further clarification! ðŸ˜Š

### Reasoning

Note: The model supports two modes - Reasoning ON (default) vs OFF. This can be toggled by setting enable_thinking to False, as shown below. 

In [4]:
# Reasoning on (default)
print("Reasoning on")
resp = client.chat.completions.create(
    model="nemotron",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about GPUs."}
    ],
    temperature=1,
    max_tokens=256,
)
print(resp.choices[0].message.reasoning_content, resp.choices[0].message.content)
print("\n")
# Reasoning off
print("Reasoning off")
resp = client.chat.completions.create(
    model="nemotron",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Give me 3 interesting facts about SGLang."}
    ],
    temperature=0,
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
print(resp.choices[0].message.reasoning_content, resp.choices[0].message.content)

Reasoning on
We need to output a haiku about GPUs. Probably 5-7-5 syllable structure. Something like "Silicon hearts beat fast / Parallel dreams in silicon / Gleam of shader light". Count syllables.

First line 5 syllables: "Silicon hearts beat fast" = Si-li-con (3) hearts (1) beat (1) fast (1) = 6? Let's count: Si-li-con (3), hearts (1) =4, beat (1)=5, fast (1)=6. Too many. Maybe "Silicon hearts beat" = Silicon (3) hearts (1) beat (1) =5 good.

Second line 7 syllables: "Parallel dreams take flight" = Par-al-lel (3) dreams (1)=4, take (1)=5, flight (1)=6. Need 7. Could be "Parallel dreams take flight" actually count: Par(1) al(2) lel(3) -> maybe 3? Actually "parallel" is 3 syllables (par-al-lel). So 3 + 1 (dreams) = 4, take 1 =5, flight 1 =6. Need 7. Add "high": "Parallel None


Reasoning off
Here are 3 interesting facts about **SGLang** (a programming language designed for efficient, scalable AI inference and training, particularly for large language models):

1. **Hardware-Aware Comp

### Tool calling

Call functions using the OpenAI Tools schema and inspect returned tool_calls.

In [7]:
# Tool calling via OpenAI tools schema
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "calculate_tip",
            "parameters": {
                "type": "object",
                "properties": {
                    "bill_total": {
                        "type": "integer",
                        "description": "The total amount of the bill"
                    },
                    "tip_percentage": {
                        "type": "integer",
                        "description": "The percentage of tip to be applied"
                    }
                },
                "required": ["bill_total", "tip_percentage"]
            }
        }
    }
]

completion = client.chat.completions.create(
    model="nemotron",
    messages=[
        {"role": "system", "content": ""},
        {"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"}
    ],
    tools=TOOLS,
    temperature=0.6,
    top_p=0.95,
    max_tokens=512,
    stream=False
)

print(completion.choices[0].message.reasoning_content)
print(completion.choices[0].message.tool_calls)

Okay, the user wants to calculate a 15% tip on a $50 bill. Let me check the tools available. There's a calculate_tip function that takes bill_total and tip_percentage. The parameters are both required. The bill is $50, so bill_total is 50. The tip percentage is 15. I need to call the function with these values. Let me make sure the parameters are integers. Yes, 50 and 15 are both integers. So the tool call should be calculate_tip with arguments bill_total=50 and tip_percentage=15. That should give the tip amount.

[ChatCompletionMessageFunctionToolCall(id='call_69429ca05ecc4764a8e63ffa', function=Function(arguments='{"bill_total": 50, "tip_percentage": 15}', name='calculate_tip'), type='function', index=-1)]


### Controlling Reasoning Budget

The `reasoning_budget` parameter allows you to limit the length of the model's reasoning trace. When the reasoning output reaches the specified token budget, the model will attempt to gracefully end the reasoning at the next newline character. 

If no newline is encountered within 500 tokens after reaching the budget threshold, the reasoning trace will be forcibly terminated at `reasoning_budget + 500` tokens to prevent excessive generation.


In [None]:
from typing import Any, Dict, List
import openai
from transformers import AutoTokenizer


class ThinkingBudgetClient:
    def __init__(self, base_url: str, api_key: str, tokenizer_name_or_path: str):
        self.base_url = base_url
        self.api_key = api_key
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
        self.client = openai.OpenAI(base_url=self.base_url, api_key=self.api_key)

    def chat_completion(
        self,
        model: str,
        messages: List[Dict[str, Any]],
        reasoning_budget: int = 512,
        max_tokens: int = 1024,
        **kwargs,
    ) -> Dict[str, Any]:
        assert (
            max_tokens > reasoning_budget
        ), f"reasoning_budget must be smaller than max_tokens. Given {max_tokens=} and {reasoning_budget=}"

        # 1. first call chat completion to get reasoning content
        response = self.client.chat.completions.create(
            model=model, 
            messages=messages, 
            max_tokens=reasoning_budget, 
            **kwargs
        )
        
        reasoning_content = response.choices[0].message.reasoning_content or ""
        
        if "</think>" not in reasoning_content:
            # reasoning content is too long, closed with a period (.)
            reasoning_content = f"{reasoning_content}.\n</think>\n\n"
        
        reasoning_tokens_used = len(
            self.tokenizer.encode(reasoning_content, add_special_tokens=False)
        )
        remaining_tokens = max_tokens - reasoning_tokens_used
        
        assert (
            remaining_tokens > 0
        ), f"remaining tokens must be positive. Given {remaining_tokens=}. Increase max_tokens or lower reasoning_budget."

        # 2. append reasoning content to messages and call completion
        messages.append({"role": "assistant", "content": reasoning_content})
        prompt = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            continue_final_message=True,
        )
        
        response = self.client.completions.create(
            model=model, 
            prompt=prompt, 
            max_tokens=remaining_tokens, 
            **kwargs
        )

        response_data = {
            "reasoning_content": reasoning_content.strip().strip("</think>").strip(),
            "content": response.choices[0].text,
            "finish_reason": response.choices[0].finish_reason,
        }
        return response_data

In [9]:
# Client
SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
client = ThinkingBudgetClient(
    base_url="http://127.0.0.1:5000/v1",
    api_key="null",
    tokenizer_name_or_path=SERVED_MODEL_NAME
)

In [10]:
resp = client.chat_completion(
    model=SERVED_MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about GPUs."}
    ],
    temperature=1,
    max_tokens=512,
    reasoning_budget=128
)
print("Reasoning:", resp["reasoning_content"], "\nContent:", resp["content"])

Reasoning: We need to comply with policy. It's a simple request: write a haiku about GPUs. No problem. Provide a haiku (5-7-5 syllable). Should be about GPUs. Just produce.

We'll output a haiku. Make sure it's correct syllable count. Example:

Silicon dreams hum (5) â€“ Let's count: Sil-i-con (3) dreams (1) hum (1) =5. Next line 7 syllables: "shaders paint scenes of light" count: sha-ders (2) paint (1) scenes (1) of (1) light (1. 
Content: 
Silicon dreams hum  
Shaders paint scenes of bright light  
Pixels rise, swift and clear
