# Running Nemotron-49B-v1.5 with SGLang on NVIDIA GPUs

This notebook provides a comprehensive guide on how to run the `Nemotron-49B-v1.5` model using SGLang's high-performance, OpenAI-compatible server.

This notebook will cover:
- Setting up the SGLang server for the Nemotron model.
- Performing basic and streaming chat completions.
- Running batch inference for multiple prompts.
- Showcasing the model's `think` vs `no_think` reasoning modes.

#### Launch on NVIDIA Brev
You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide.

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-32vt7HcQjCUpafGyquLZwJdIm8F)

- Model card: [nvidia/Llama-3.3-Nemotron-Super-49B-v1.5](https://huggingface.co/nvidia/Llama-3.3-Nemotron-Super-49B-v1.5)


## Table of Contents
- [Prerequisites](#Prerequisites)
- [Setup](#Setup)
- [Start SGLang Server](#Start-SGLang-Server)
- [Client Setup](#Client-Setup)
- [Showcasing Reasoning Modes: `think` vs. `no_think`](#Showcasing-Reasoning-Modes:-`think`-vs.-`no_think`)
- [Chat Completion Examples](#Chat-Completion-Examples)
- [Batching](#Batching)
- [Asynchronous-Batching](#Asynchronous-Batching)
- [Direct Interaction with `curl`](#Direct-Interaction-with-`curl`)
- [Resource-Notes](#Resource-Notes)
- [Conclusion](#Conclusion)


## Prerequisites

**Hardware:** This notebook is configured to run on a machine with at least **2 GPUs** and sufficient VRAM to hold the 49B parameter model. If your hardware is different, you may need to adjust the `--tp` (tensor parallelism) flag in the server launch command below.

**Software:**
- Python 3.10+
- CUDA 12.x
- PyTorch 2.3+
- Latest SGLang


## Setup


In [None]:
# Install SGLang and useful extras (run once per env)
%pip install --upgrade pip
%pip install uv
%uv pip install "sglang[all]>=0.5.3rc0"

In [None]:
# GPU environment check
import torch
import platform

print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"GPU[{i}]: {props.name} | SM count: {props.multi_processor_count} | Mem: {props.total_memory / 1e9:.2f} GB")

  import pynvml  # type: ignore[import]


CUDA available: True
Num GPUs: 1
GPU[0]: NVIDIA H200


## Start SGLang Server

SGLang runs as a separate server process. The following cell starts the server. You can also run this command in a terminal.

In [None]:
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

# This is equivalent to running the following command in your terminal
# python -m sglang.launch_server --model-path "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5" --host 0.0.0.0 --trust-remote-code

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 \
 --host 0.0.0.0 --log-level warning --trust-remote-code
"""
)

wait_for_server(f"http://localhost:{port}")

  import pynvml  # type: ignore[import]


  import pynvml  # type: ignore[import]
All deep_gemm operations loaded successfully!
W0917 22:57:39.794000 418063 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0917 22:57:39.794000 418063 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]
All deep_gemm operations loaded successfully!
W0917 22:57:46.521000 418277 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0917 22:57:46.521000 418277 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
All deep_gemm operations loaded successfully!
W0

In [None]:
## Client Setup
import os
from openai import OpenAI

# The model name we used when launching the server.
SERVED_MODEL_NAME = "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5"

BASE_URL = f"http://localhost:{port}/v1"
API_KEY = "EMPTY"  # SGLang server doesn't require an API key by default

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
print(f"OpenAI client configured to use server at: {BASE_URL}")
print(f"Using model: {SERVED_MODEL_NAME}")

## Showcasing Reasoning Modes: `think` vs. `no_think`

As described in the [model card](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5), this model has two distinct reasoning modes.

1.  **Reasoning ON (`think` mode):** This is the default mode. The model first generates a `<think>` block where it outlines its step-by-step reasoning process before providing the final answer. This is ideal for complex, multi-step problems.
2.  **Reasoning OFF (`no_think` mode):** This mode is activated by adding `/no_think` to the system prompt. The model provides a direct, concise answer without the preceding thought process. This is better for simple, instruction-following tasks where latency is a concern.

Let's see this in action with a simple reasoning problem.

# 1. Reasoning ON (Default Behavior)

We'll send a multi-step problem with a standard system prompt. We expect to see the model's thought process.


In [None]:
reasoning_prompt = "I have 5 apples. I eat 2, then my friend gives me 3 more. How many apples do I have now?"

print("--- Sending prompt with Reasoning ON ---")
response_on = client.chat.completions.create(
    model=SERVED_MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful reasoning assistant."},
        {"role": "user", "content": reasoning_prompt}
    ],
    temperature=0.0,
    max_tokens=512,
)

print("\\n--- Response with Reasoning ON ---")
print(response_on.choices[0].message.content)


--- Sending prompt with Reasoning ON ---
\n--- Response with Reasoning ON ---
<think>
Okay, let's see. The problem is: I have 5 apples. I eat 2, then my friend gives me 3 more. How many apples do I have now?

Alright, starting with 5 apples. So the initial number is 5. Then I eat 2. Eating apples would mean subtracting them from the total, right? So 5 minus 2. Let me write that down: 5 - 2. That should be 3. So after eating 2, I have 3 apples left.

Then, my friend gives me 3 more. So adding 3 to the current number. The current number after eating is 3, so adding 3 would be 3 + 3. That equals 6. So putting it all together: start with 5, subtract 2, add 3. So 5 - 2 + 3. Let me check the order of operations here. Subtraction and addition are at the same level, so we do them left to right. 5 - 2 is 3, then 3 + 3 is 6. Yep, that seems right.

Wait, but sometimes people might get confused if there's a different order, but in this case, the operations are straightforward. So the answer shoul

In [None]:
# 2. Reasoning OFF (using /no_think)

# Now, we'll send the exact same prompt, but this time we add `/no_think` to the system message. We expect a direct answer without the `<think>` block.

print("--- Sending prompt with Reasoning OFF ---")
response_off = client.chat.completions.create(
    model=SERVED_MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful reasoning assistant.\\n/no_think"},
        {"role": "user", "content": reasoning_prompt}
    ],
    temperature=0.0,
    max_tokens=512,
)

print("\\n--- Response with Reasoning OFF ---")
print(response_off.choices[0].message.content)


--- Sending prompt with Reasoning OFF ---
\n--- Response with Reasoning OFF ---
Let's break it down step by step:

1. **Start with**: 5 apples  
2. **Eat 2**: 5 - 2 = 3 apples left  
3. **Friend gives you 3 more**: 3 + 3 = 6 apples  

**Final answer**: You now have **6 apples**. üçéüçéüçéüçéüçéüçé


## Chat Completion Examples


### Basic Chat Completion


In [None]:

print("=== Simple Chat Completion ===")
resp = client.chat.completions.create(
    model=SERVED_MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Give me 3 bullet points about SGLang."}
    ],
    temperature=0.6,
    max_tokens=512,
)
print(resp.choices[0].message.content)
print("\n")  # Add a blank line for clarity

# Streaming chat completion
print("=== Streaming Chat Completion ===")
stream = client.chat.completions.create(
    model=SERVED_MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Write a short poem about GPUs."}
    ],
    temperature=0.7,
    max_tokens=512,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        print(delta.content, end="", flush=True)
print("\n")  # Add a blank line after streaming output


=== Simple Chat Completion ===


<think>
Okay, the user is asking for three bullet points about SGLang. First, I need to recall what SGLang is. From what I remember, SGLang might be related to programming or a specific language. Wait, SGLang could stand for something like "Simple Graphics Language" or maybe it's a domain-specific language. Let me check my knowledge base.

Hmm, I think SGLang is a lightweight programming language designed for educational purposes. It's used to teach programming concepts, especially in the context of graphics or game development. It might have a simple syntax to make it accessible for beginners. Also, SGLang could be used in specific educational platforms or tools. Another point might be that it's used for creating simple games or animations, which helps students grasp programming fundamentals through interactive projects. 

Wait, I should make sure I'm not confusing it with another language. Let me think. There's also a possibility that SGLang refers to a language used in a particular 

## Batching

In [None]:
# Batch chat prompts
from openai import OpenAI
client = OpenAI(base_url=BASE_URL, api_key="dummy")

prompts = [
    "Hello, my name is",
    "The capital of France is",
    "Explain quantum computing in simple terms:"
]

# Convert to messages for chat.completions
messages_list = [[{"role": "user", "content": p}] for p in prompts]

responses = []
for messages in messages_list:
    out = client.chat.completions.create(
        model=SERVED_MODEL_NAME,
        messages=messages,
        temperature=0.6,
        max_tokens=512,
    )
    responses.append(out.choices[0].message.content)

for i, (p, r) in enumerate(zip(prompts, responses), start=1):
    print(f"\nPrompt {i}: {p!r}")
    print(r)



Prompt 1: 'Hello, my name is'
<think>
Okay, the user started with "Hello, my name is" and then the message was cut off. I need to figure out how to respond appropriately. Since the name wasn't provided, maybe they intended to type their name but got interrupted or there was a technical issue. I should acknowledge their greeting and invite them to complete their message. Let me make sure to keep the response friendly and open-ended so they feel comfortable finishing their thought. Something like, "Hello! It seems like your message might have been cut off. Could you please share your name with me?" That should work. I should check for any typos and ensure the tone is welcoming.
</think>

Hello! It seems like your message might have been cut off. Could you please share your name with me? I'd love to greet you properly! üòä

Prompt 2: 'The capital of France is'
<think>
Okay, so the user asked, "The capital of France is," and then left it open. I need to figure out the correct answer here

### Asynchronous Batching

In [None]:
import asyncio
from openai import AsyncOpenAI

# Use the async client for concurrent requests
async_client = AsyncOpenAI(base_url=BASE_URL, api_key="dummy")

async def get_completion(messages):
    """A helper function to get a single completion asynchronously."""
    return await async_client.chat.completions.create(
        model=SERVED_MODEL_NAME,
        messages=messages,
        temperature=0.6,
        max_tokens=512,
    )

async def main():
    # Create a list of tasks for all our prompts
    tasks = [get_completion(msg) for msg in messages_list]

    # Run all tasks concurrently and wait for them all to complete
    print("--- Sending batch requests concurrently ---")
    all_responses = await asyncio.gather(*tasks)
    print("--- All responses received ---")

    # Extract the content from each response
    responses_content = [resp.choices[0].message.content for resp in all_responses]

    # Print the results
    for i, (p, r) in enumerate(zip(prompts, responses_content), start=1):
        print(f"\\nPrompt {i}: {p!r}")
        print(r)

# Run the asynchronous main function
# In a Jupyter Notebook, you might need to use `await main()` if you are in an async-enabled cell,
# or run it like this to handle the event loop.
await main()


--- Sending batch requests concurrently ---


--- All responses received ---
\nPrompt 1: 'Hello, my name is'
<think>
Okay, the user started with "Hello, my name is" but didn't finish. I need to respond appropriately. Since they mentioned their name, I should ask them to provide the rest. Maybe they got cut off or are testing the system. I should keep it friendly and encouraging. Let me make sure to prompt them to complete their name so I can address them properly. Also, check for any typos or if they intended to write more. But since the message is cut off, the best approach is to ask for the rest of their name. Keep the response simple and welcoming.
</think>

Hello! It seems like your message got cut off. Could you please share the rest of your name with me? I'd love to know what to call you as we chat! üòä
\nPrompt 2: 'The capital of France is'
<think>
Okay, so the user asked, "The capital of France is," and I need to figure out the answer. Let me start by recalling what I know about France. France is a country in Europe, know

In [None]:
# Stop the server process
if 'server_process' in globals() and server_process.poll() is None:
    server_process.terminate()
    server_process.wait()
    print("SGLang server stopped.")
else:
    print("No running server process found to terminate.")


## Direct Interaction with `curl`

For debugging or for use in environments where the OpenAI Python client is not available, you can interact with the SGLang server directly using `curl`.

The example below shows how to construct and execute a `curl` command to get a chat completion. We use Python's `subprocess` module to run the command and `json` to parse the output.


In [None]:
import subprocess, json

# Construct the JSON payload as a Python dictionary first
payload = {
    "model": SERVED_MODEL_NAME,
    "messages": [
        {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.0
}

# Convert the dictionary to a JSON string
payload_str = json.dumps(payload)

# Form the curl command
# Note: Using f-strings and subprocess like this is convenient for notebooks,
# but be cautious about shell injection in production environments.
curl_command = f"""
curl -s http://localhost:{port}/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{payload_str}'
"""

print("--- Executing Curl Command ---")
print(curl_command)

# Execute the command and load the JSON response
response_bytes = subprocess.check_output(curl_command, shell=True)
response = json.loads(response_bytes)

print("\\n--- Server Response ---")
print_highlight(response)


--- Executing Curl Command ---

curl -s http://localhost:33272/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5", "messages": [{"role": "user", "content": "What is the capital of France?"}], "temperature": 0.0}'



\n--- Server Response ---
{'id': 'ef64aa03bc164ca2bd95ab73bd117f6e', 'object': 'chat.completion', 'created': 1758151210, 'model': 'nvidia/Llama-3_3-Nemotron-Super-49B-v1_5', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "<think>\nOkay, so the user is asking for the capital of France. Let me start by recalling what I know about France. France is a country in Europe, known for its rich history, culture, and famous landmarks. The capital is the city where the government is based, right? I think the capital of France is Paris. Wait, but I should make sure I'm not confusing it with other cities. Let me think.\n\nI remember that Paris is a major city in France, famous for the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. But is it definitely the capital? Sometimes countries change their capitals, but I don't think France has done that recently. Let me check some other facts. The president of France, Emmanuel Macron, his official residence is in Paris, I be

## Resource Notes

- **Hardware**: Nemotron-49B-v1.5 is a large model. Multi-GPU tensor parallel (`--tp`) is highly recommended for acceptable performance.
- **Quantization**: For environments with limited resources, consider using quantized versions of the model if available. These can significantly reduce memory usage at the cost of some accuracy.
- **Network**: Ensure you have sufficient network and disk bandwidth for the initial model download, as the weights are very large.

## Conclusion and Next Steps
Congratulations! You successfully deployed the `Nemotron-49B-v1.5` model using SGLang.

In this notebook, you have learned how to:
- Set up your environment and install SGLang.
- Launch and manage an OpenAI-compatible SGLang server.
- Perform basic chat, streaming, and batch inference.
- Use the model's different reasoning modes.

You can adapt tensor parallelism, ports, and sampling parameters to your hardware and application needs.
