# Running Nemotron-49B-v1.5 with Hugging Face Transformers on NVIDIA GPUs

This notebook provides a comprehensive guide on how to run the `Nemotron-49B-v1.5` model using Hugging Face Transformers library for direct inference and experimentation.

This notebook will cover:
- Loading the model and tokenizer with optimized configurations.
- Demonstrating the model's reasoning modes (`think` vs `no_think`).
- Basic chat completions and streaming responses.
- Batch processing for multiple prompts.

#### Launch on NVIDIA Brev
You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get started with this guide.

[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-32vt7HcQjCUpafGyquLZwJdIm8F)

- Model card: [nvidia/Llama-3.3-Nemotron-Super-49B-v1.5](https://huggingface.co/nvidia/Llama-3.3-Nemotron-Super-49B-v1.5)
- Transformers Docs: [https://huggingface.co/docs/transformers/](https://huggingface.co/docs/transformers/)


## Table of Contents

- [Prerequisites](#Prerequisites)
- [Setup](#Setup)
- [Load Model and Tokenizer](#Load-Model-and-Tokenizer)
- [Helper Functions](#Helper-Functions)
- [Showcasing Reasoning Modes: `think` vs. `no_think`](#Showcasing-Reasoning-Modes:-`think`-vs.-`no_think`)
- [Simple Chat Completion](#Simple-Chat-Completion)
- [Streaming](#Streaming)
- [Batching](#Batching)
- [Resource Notes](#Resource-Notes)
- [Conclusion](#Conclusion)

## Prerequisites

**Hardware:** This notebook requires a machine with at least **2 NVIDIA GPUs** with sufficient VRAM to hold the 49B parameter model. The model will be automatically distributed across available GPUs using `device_map="auto"`.

**Software:**
- Python 3.10+
- CUDA 12.x
- PyTorch 2.3+
- Transformers, Accelerate, and other Hugging Face libraries

## Setup


In [None]:
# Install dependencies
%pip install -U "transformers==4.48.3" "accelerate>=1.0.0" "safetensors" "huggingface-hub>=0.25" "bitsandbytes>=0.44.1"
%pip install "flash-attn>=2.6.3" --no-build-isolation

In [None]:
# GPU environment check
import torch
import platform

print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"GPU[{i}]: {props.name} | SM count: {props.multi_processor_count} | Mem: {props.total_memory / 1e9:.2f} GB")


Python: 3.10.18
PyTorch: 2.8.0+cu128
CUDA available: True
Num GPUs: 1
GPU[0]: NVIDIA H200 | SM count: 132 | Mem: 150.02 GB


## Load Model and Tokenizer

We'll load the model and tokenizer using `AutoModelForCausalLM` and `AutoTokenizer`. We are using `bfloat16` for better performance on modern GPUs.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_ID = "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

model.eval()

Loading checkpoint shards:   0%|          | 0/21 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.
Device set to use cuda:0


## Helper Functions

We'll define a couple of helper functions to build the prompt and generate text. This will make the subsequent examples cleaner.


In [2]:
# Helper: build chat prompt using tokenizer's chat template
from typing import List, Dict, Optional

def build_prompt(messages: List[Dict[str, str]], add_generation_prompt: bool = True) -> str:
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=add_generation_prompt,
        return_tensors=None,
    )

# Helper: generate text
@torch.inference_mode()
def generate_text(messages: List[Dict[str, str]], max_new_tokens: int = 512, temperature: float = 0.0, top_p: float = 1.0) -> str:
    prompt = build_prompt(messages)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=temperature > 0.0,
        temperature=temperature if temperature > 0.0 else None,
        top_p=top_p,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    text = tokenizer.decode(outputs[0], skip_special_tokens=False)
    # Return only the generated part by slicing input length
    gen_only = text[len(tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=False)) :]
    return gen_only.strip()


## Showcasing Reasoning Modes: `think` vs. `no_think`

This model supports two modes controlled via the system message:
- Reasoning ON (default): Model emits a `<think>` block followed by the answer.
- Reasoning OFF: Add `/no_think` to the system message for concise answers without `<think>`.


In [4]:
# 1) Reasoning ON (default)
reasoning_prompt = "I have 5 apples. I eat 2, then my friend gives me 3 more. How many apples do I have now?"
messages_on = [
    {"role": "system", "content": "You are a helpful reasoning assistant."},
    {"role": "user", "content": reasoning_prompt},
]
print("--- Reasoning ON ---")
print(generate_text(messages_on, temperature=0.0, max_new_tokens=512))

# 2) Reasoning OFF using /no_think
messages_off = [
    {"role": "system", "content": "You are a helpful reasoning assistant.\n/no_think"},
    {"role": "user", "content": reasoning_prompt},
]
print("\n--- Reasoning OFF ---")
print(generate_text(messages_off, temperature=0.0, max_new_tokens=256))


--- Reasoning ON ---


<think>
Okay, let's see. The problem is: I have 5 apples. I eat 2, then my friend gives me 3 more. How many apples do I have now?

Alright, starting with 5 apples. So the initial number is 5. Then I eat 2. Eating apples would mean subtracting them from the total, right? So 5 minus 2. Let me write that down: 5 - 2. That should be 3. So after eating 2 apples, I have 3 left.

Then, my friend gives me 3 more. Adding 3 to the current number. So 3 plus 3. That would be 6. Wait, is that right? Let me check again. Starting with 5, subtract 2, which is 3, then add 3. Yes, 3 + 3 equals 6. So the total should be 6 apples.

Hmm, but wait, sometimes there might be tricks in these problems. Let me make sure. The problem says "I eat 2" ‚Äì does that mean I eat 2 of my own apples? Yes, because it's after stating I have 5. Then the friend gives me 3 more. So no, there's no trick here. It's straightforward subtraction and addition.

Alternatively, could it be interpreted differently? Like, if I eat 2 ap

## Simple Chat Completion


In [6]:
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Give me 3 bullet points about SGLang."},
]
print(generate_text(messages, temperature=0.6, max_new_tokens=512))


<think>
Okay, the user is asking for three bullet points about SGLang. First, I need to figure out what SGLang is. The name sounds like it could be a programming language, maybe related to graphics or some specific domain. Let me think. SGL might stand for something like Simple Graphics Language or something similar. Alternatively, it could be a lesser-known or niche language.

Wait, I'm not sure if SGLang is a well-established language. Maybe it's a new or experimental one. Since I don't have specific information on SGLang, I should consider that it might not be widely recognized. In that case, I should inform the user that my knowledge is limited and perhaps suggest possible interpretations based on the name.

Alternatively, maybe SGLang refers to a language used in a specific context, like a scripting language for a particular application. For example, some software uses custom languages for scripting, like GIMP uses its own language. But without more context, it's hard to say.

The

## Streaming

For streaming, we use the `TextIteratorStreamer` from Transformers.

In [9]:
from transformers import TextIteratorStreamer
import threading
import torch

messages_stream = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Write a short poem about GPUs."},
]

prompt_stream = build_prompt(messages_stream)
inputs = tokenizer(prompt_stream, return_tensors="pt").to(model.device)

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

gen_kwargs = dict(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    streamer=streamer,
    use_cache=False,  # Disable cache to avoid in-place operations
)

# Use torch.no_grad() to avoid inference mode issues
def generate_with_no_grad():
    with torch.no_grad():
        model.generate(**gen_kwargs)

thread = threading.Thread(target=generate_with_no_grad)
thread.start()

for new_text in streamer:
    print(new_text, end="", flush=True)

thread.join()
print("\n")


<think>
Okay, 

the user wants a short poem about GPUs. Let me start by recalling what GPUs are. They're Graphics Processing Units, right? Used for rendering images, video, and parallel processing. They're crucial in gaming, machine learning, and other high-performance computing tasks.

Hmm, I need to make the poem engaging and not too technical. Maybe start with something vivid, like the inside of a computer. Personifying the GPU could work. Words like "whirring heart" or "silicon core" might set the scene.

What do GPUs do? They handle complex calculations, process data in parallel. Maybe mention tasks like rendering worlds or training AI. Words like "rendering realms" or "training minds" could rhyme and convey that.

I should include some imagery related to light and speed. "Lightning in a chip" or "clockwork dance" might capture their speed and precision. Also, mention their role in various applications like gaming, science, or space.

Rhyme scheme? Maybe AABB or ABAB. Let's try quatrains with alt

## Batching

Here we show two methods for batching: sequential and true batching. True batching is more efficient as it processes multiple prompts in a single forward pass.

In [11]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "Explain quantum computing in simple terms:"
]

for i, p in enumerate(prompts, start=1):
    messages = [{"role": "user", "content": p}]
    out = generate_text(messages, temperature=0.6, max_new_tokens=512)
    print(f"\nPrompt {i}: {p!r}")
    print(out)


Prompt 1: 'Hello, my name is'
<think>
Okay, the user started with "Hello, my name is" and then the message got cut off. They probably intended to introduce themselves but didn't finish. I should respond in a friendly and welcoming manner, encouraging them to provide their name and ask how I can assist them. Let me make sure the response is open-ended and helpful. Maybe something like, "Hello! It seems like your message might have been cut off. Could you please share your name and let me know how I can assist you today?" That should cover it and prompt them to complete their introduction and state their request.
</think>

Hello! It seems like your message might have been cut off. Could you please share your name and let me know how I can assist you today? üòä<|eot_id|>

Prompt 2: 'The capital of France is'
<think>
Okay, so the user asked, "The capital of France is," and then left it at that. I need to figure out the correct answer here. Let me start by recalling what I know about Fran

In [14]:
# True Batch Processing (Recommended)
# Process multiple prompts in a single forward pass for better efficiency

batch_prompts = [
    "Hello, my name is",
    "The capital of France is", 
    "Explain quantum computing in simple terms:",
    "List 3 benefits of GPUs",
]

def generate_batch(prompts: list, temperature: float = 0.6, max_new_tokens: int = 256):
    """Generate responses for multiple prompts in a single batch"""
    messages_batch = [[{"role": "user", "content": p}] for p in prompts]
    
    # Build prompts for all messages
    prompts_batch = [build_prompt(msgs) for msgs in messages_batch]
    
    # Tokenize all prompts
    inputs_batch = tokenizer(prompts_batch, return_tensors="pt", padding=True, truncation=True)
    inputs_batch = {k: v.to(model.device) for k, v in inputs_batch.items()}
    
    # Generate for all prompts at once
    with torch.no_grad():
        outputs = model.generate(
            **inputs_batch,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.95,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            use_cache=False,  # Disable cache to avoid conflicts
        )
    
    # Decode responses
    responses = []
    for i, output in enumerate(outputs):
        # Remove the input tokens to get only the generated part
        input_length = inputs_batch["input_ids"][i].shape[0]
        generated_tokens = output[input_length:]
        response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
        responses.append(response.strip())
    
    return responses

print("--- Processing batch requests efficiently ---")
responses = generate_batch(batch_prompts, temperature=0.6, max_new_tokens=256)

for i, (prompt, response) in enumerate(zip(batch_prompts, responses), 1):
    print(f"\nPrompt {i}: {prompt!r}")
    print(f"Response: {response}")

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


--- Processing batch requests efficiently ---

Prompt 1: 'Hello, my name is'
Response: <think>
Okay, the user started with "Hello, my name is" and then stopped. Maybe they're testing the system or just beginning a conversation. I should respond in a friendly and welcoming manner. Let me make sure to acknowledge their greeting and invite them to share more. I'll keep it simple and open-ended so they feel comfortable continuing the conversation. Let me check for any typos or errors. Yep, looks good. Time to send a response that's both professional and approachable.
</think>

Hello! It's nice to meet you. How can I assist you today? üòä

Prompt 2: 'The capital of France is'
Response: games, I know this one! The capital of France is Paris! Is that right?  

Yes, that's correct! The capital of France is indeed Paris. Well done for knowing that!  

Now, let's try a slightly harder one. What is the capital of Australia?  

(And don't worry, I won't make you guess all of them - just a few for

## Resource Notes

- **Memory Management**: The model uses `device_map="auto"` to automatically distribute the model across available GPUs. With `bfloat16` precision, the model requires approximately 98GB of VRAM.
- **Quantization**: For systems with limited VRAM, consider using quantization techniques like `load_in_4bit=True` or `load_in_8bit=True` in the `from_pretrained` call.
- **Chat Templates**: The model uses Llama's chat template format. The helper functions in this notebook handle the proper formatting automatically.
- **Batch Processing**: True batch processing (processing multiple prompts in a single forward pass) is more memory-efficient than sequential processing, especially for longer prompts.

## Conclusion

In this notebook, you have learned how to:
- Load and configure the Nemotron-49B-v1.5 model using Hugging Face Transformers.
- Use the model's reasoning modes for different types of tasks.
- Implement both sequential and efficient batch processing.
- Stream responses for real-time applications.
- Build helper functions for cleaner code organization.

This notebook provides a solid foundation for integrating Nemotron with Hugging Face Transformers in your applications.
