# Running NVIDIA Nemotron Nano 9B v2 with Hugging Face Transformers

This notebook will walk you through how to run the `nvidia/NVIDIA-Nemotron-Nano-9B-v2` model with Hugging Face Transformers 

[Hugging Face Transformers](https://huggingface.co/docs/transformers) is a model-definition framework for state-of-the-art models across text, vision, audio, video, and multimodal tasks, supporting both training and inference.

For more details on the model [click here](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/modelcard)

Prerequisites:
- NVIDIA GPU with recent drivers (≥ 24 GB VRAM recommended; BF16-capable) and CUDA 12.x 
- Python 3.10+

## Prerequisites & environment

Set up a clean Python environment for running this locally.

Create and activate a virtual environment. The sample here uses Conda but feel free to choose whichever tool you prefer.
Run these commands in a terminal before using this notebook:

```bash
conda create -n nemotron-transformers-env python=3.10 -y
conda activate nemotron-transformers-env
```

If running notebook locally, install ipykernel and switch the kernel to this environment:
- Installation
```bash
pip install ipykernel
```
- Kernel → Change kernel → Python (nemotron-transformers-env)

## Install dependencies

In [None]:
%pip install -U transformers accelerate torch causal-conv1d mamba-ssm

## Verify GPU

Check that CUDA is available and the GPU is detected correctly.


In [None]:
# GPU environment check
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"GPU[{i}]: {torch.cuda.get_device_name(i)}")

## Generating responses via pipeline 
Use the Transformers pipeline with the model’s chat template to quickly generate replies.

In [None]:
import torch
from transformers import pipeline
 
generator = pipeline(
    "text-generation",
    model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    dtype=torch.bfloat16,
    device_map="auto" , 
    trust_remote_code=True
)
 
messages = [
    {"role": "user", "content": "Explain Hugging Face Transformers."},
]
 
result = generator(
    messages,
    max_new_tokens=200,
    temperature=1.0,
)
 
print(result[0]["generated_text"])

## Load the Nemotron model and tokenizer manually. 
Load AutoTokenizer and AutoModelForCausalLM for full control over inputs and generation settings.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-Nano-9B-v2")
model = AutoModelForCausalLM.from_pretrained(
    "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)

### Reasoning 
Nemotron Nano supports two reasoning modes: ON (default) and OFF. Toggle via `/think` or `/no_think` in the chat messages.

Note: You can include `/think` or `/no_think` in `system` or `user` messages for turn-level control.

In [None]:
messages = [
    {"role": "system", "content": "/think"},
    {"role": "user", "content": "Write a haiku about GPUs"},
]

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    max_new_tokens=1024,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))


Compare the same prompt with reasoning disabled using `/no_think` to see differences in style and latency.

In [None]:
messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user", "content": "Write a haiku about GPUs"},
]
tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    max_new_tokens=1024,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))


### Streamed generation
Stream tokens as they are produced using `TextIteratorStreamer` to display output incrementally while `generate` runs in a background thread.

In [None]:
import threading
from transformers import TextIteratorStreamer

messages = [
    {"role": "system", "content": "/think"},
    {"role": "user", "content": "Write a haiku about GPUs"},
]

inputs = tokenizer.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True, 
    return_tensors="pt"
).to(model.device)

# Create a TextIteratorStreamer for streaming output
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Launch model.generate in a separate thread with the streamer
gen_kwargs = dict(
    inputs=inputs,
    streamer=streamer,
    max_new_tokens=1024,  
    temperature=0.6,   
    top_p=0.95,
    eos_token_id=tokenizer.eos_token_id
)
thread = threading.Thread(target=model.generate, kwargs=gen_kwargs)
thread.start()

# Print output as tokens arrive
print("Model output:")
for new_text in streamer:
    print(new_text, end="", flush=True)

thread.join()
