<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/Nemotron_3_Nano_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

In [1]:
!pip install accelerate -q

In [None]:
# Example: Install from source if pip fails
# First, ensure you have the necessary development tools
!pip install packaging ninja -q

# Then, install mamba-ssm
!pip install --no-index --no-cache-dir mamba-ssm -q

In [3]:
import torch
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Version: {torch.version.cuda}")

PyTorch Version: 2.9.0+cu126
CUDA Version: 12.6


In [None]:
# Install the causal convolution dependency
!pip install causal-conv1d -q

In [None]:
# Install the core mamba-ssm package using the isolation fix
!pip install mamba-ssm --no-build-isolation -q

In [None]:
# Install libraries for aggressive memory optimization and fast loading
!pip install bitsandbytes accelerate -q

## CASE1: Simple Function (Average Calculation)

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# 1. Configuration and Aggressive 4-bit Quantization
MODEL_ID = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"

# Define the 4-bit configuration for maximum VRAM savings
# This is the most aggressive VRAM optimization available via transformers.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",       # Use NormalFloat 4-bit for best results
    bnb_4bit_compute_dtype=torch.bfloat16, # Use bfloat16 for speed
    bnb_4bit_use_double_quant=True   # Double Quantization for extra savings
)

# 2. Load Tokenizer and Model
print(f"Loading 4-bit model and tokenizer for {MODEL_ID}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Pass the quantization config to load the model in a compressed state
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,   # <--- THE CRITICAL VRAM FIX
    device_map="auto",
    dtype=torch.bfloat16,
    trust_remote_code=True
)
print("Model loaded successfully.")

# 3. Prepare the Agentic Prompt (No change)
messages = [
    {
        "role": "system",
        "content": "You are a helpful and detailed Python coding agent. When planning is needed, use the <think></think> tag. Always provide the final code in a block."
    },
    {
        "role": "user",
        "content": "I have a list of numbers: [12, 5, 8, 20, 3]. I need a function to calculate the average and return it formatted to two decimal places."
    }
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 4. Tokenize Input (No change)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

# 5. Generate Response (Optimized for Speed)
print("Generating response...")
output = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,      # Crucial to force caching and address speed warnings
    eos_token_id=tokenizer.eos_token_id,
)
print("Response generation complete.")

# 6. Decode and Print
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
#generated_response = output_text[len(input_text):].strip()

generated_response = output_text[len(input_text) - 5:].strip()


In [2]:
print("\n--- Nemotron 3 Nano Agent Output ---")
print(generated_response)


--- Nemotron 3 Nano Agent Output ---
ion to calculate average of a list and return formatted to two decimal places. Provide example usage. Should be straightforward.
</think>
Here’s a small, self‑contained Python function that takes a list of numbers, computes their arithmetic mean, and returns the result as a string formatted to **exactly two decimal places**.

```python
def average_two_decimals(numbers):
    """
    Calculate the average of a list of numbers and return it formatted to two decimal places.

    Parameters
    ----------
    numbers : list of int/float
        The numbers whose average is to be computed.

    Returns
    -------
    str
        The average value as a string with exactly two digits after the decimal point.
        If the list is empty, returns "0.00".

    Example
    -------
    >>> average_two_decimals([12, 5, 8, 20, 3])
    '7.80'
    """
    if not numbers:                     # Guard against an empty list
        return "0.00"

    avg = sum(number

## CASE2: Agentic OOP Task (InventoryManager)

In [4]:
# 3. Prepare the Agentic Prompt (NEW TASK)
messages = [
    {
        "role": "system",
        "content": "You are a helpful and detailed Python coding agent. When planning is needed, use the <think></think> tag. Always provide the final code in a block."
    },
    {
        "role": "user",
        "content": "I need a Python class called InventoryManager that handles stock. It must have two methods: 1. add_item(self, item_name: str, quantity: int): Adds or updates the stock in the internal dictionary. 2. check_stock(self, item_name: str): Returns the current quantity of an item, or 0 if it is not found. After defining the class, show the steps to: a. Initialize the manager. b. Add 15 apples and 20 bananas. c. Check the stock of apples."
    }
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 4. Tokenize Input
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

# 5. Generate Response (Optimized for Speed and Full Output)
print("Generating full agentic response...")
output = model.generate(
    **inputs,
    max_new_tokens=512,  # Increased token limit for multi-step tasks
    do_sample=False,     # Fastest decoding (greedy)
    use_cache=True,      # Crucial to force caching
    eos_token_id=tokenizer.eos_token_id,
)
print("Response generation complete.")

# 6. Decode and Print
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
#generated_response = output_text[len(input_text):].strip()

Generating full agentic response...
Response generation complete.


In [7]:
generated_response = output_text[len(input_text) - 5:].strip()
print("\n--- Nemotron 3 Nano Agent Output ---")
print(generated_response)


--- Nemotron 3 Nano Agent Output ---
lass with methods add_item and check_stock, and then show steps to initialize, add items, check stock. Provide code block. Probably also show example usage. Provide explanation. Ensure clarity.
</think>
**Python class**

```python
class InventoryManager:
    """Simple manager that stores an inventory of items."""

    def __init__(self) -> None:
        """Create an empty inventory dictionary."""
        self._stock: dict[str, int] = {}

    def add_item(self, item_name: str, quantity: int) -> None:
        """
        Add a new item or update the quantity of an existing one.

        Parameters
        ----------
        item_name : str
            The name of the item (e.g., "apple").
        quantity : int
            How many units to add.  If the item already exists, the quantity
            is increased; otherwise it is created with the given amount.
        """
        if quantity < 0:
            raise ValueError("Quantity must be non‑negat

## CASE3: Scientific Reasoning Task: Projectile Height

In [11]:
import torch
# We assume AutoTokenizer, AutoModelForCausalLM, and BitsAndBytesConfig are already imported
# from the previous model loading cell.

# 3. Prepare the Agentic Prompt (NEW SCIENTIFIC TASK)
print("3. Defining scientific reasoning task...")
messages = [
    {
        "role": "system",
        "content": "You are a helpful and detailed Python coding agent. When planning is needed, use the <think></think> tag. Always provide the final code in a block."
    },
    {
        "role": "user",
        "content": "Write a single, well-commented Python function that solves the following problem: A projectile is launched straight up from a height of 5 meters with an initial velocity of 20 m/s. Write a function that accepts the time in seconds (t) and returns the projectile's height (h) using the physics formula: h = h_0 + v_0*t - (1/2)*g*t^2. Use an approximate value for the gravitational acceleration g=9.8 m/s^2 and format the result to one decimal place."
    }
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 4. Tokenize Input
print("4. Tokenizing input and moving to GPU...")
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

# 5. Generate Response (Optimized for Speed and Full Output)
print("5. Generating full scientific reasoning response...")
output = model.generate(
    **inputs,
    max_new_tokens=1024,  # Sufficient limit
    do_sample=False,     # Fastest decoding (greedy)
    use_cache=True,      # Crucial to force caching
    eos_token_id=tokenizer.eos_token_id,
)
print("Response generation complete.")

# 6. Decode and Prepare for Print (Note: Cell 6 will print it)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
# Use the small negative offset to cleanly capture the start of the response.
generated_response = output_text[len(input_text) - 5:].strip()

3. Defining scientific reasoning task...
4. Tokenizing input and moving to GPU...
5. Generating full scientific reasoning response...
Response generation complete.


In [12]:
print("\n--- Nemotron 3 Nano Agent Output ---")
print(generated_response)


--- Nemotron 3 Nano Agent Output ---
ction that solves the problem. It should accept time t and return height h formatted to one decimal place. Use g=9.8, h0=5, v0=20. Provide code with comments. Probably include docstring. Return formatted string? It says returns the projectile's height (h) using formula and format result to one decimal place. Could return a float rounded to one decimal place or a string with one decimal place. Usually format to one decimal place means maybe return a string with one decimal place. But could also return a float rounded to one decimal place. I'll return a float rounded to one decimal place using round(h,1). Or format as string f"{h:.1f}". Provide both? Probably return a float with one decimal place representation. I'll return a float rounded to one decimal place.

Write function: def projectile_height(t): ... docstring explaining parameters and return. Use constants. Compute h = h0 + v0*t - 0.5*g*t**2. Then return round(h,1). Also handle negative heigh

## CASE4: Combined: Theoretical Architecture Explanation

In [15]:
import torch
# Assuming 'tokenizer' and 'model' are loaded from the previous cell execution.

# 3. Prepare the Agentic Prompt (NEW THEORETICAL TASK)
print("3. Defining theoretical architecture task...")
messages = [
    {
        "role": "system",
        "content": "You are a helpful and detailed AI assistant. When planning is needed, use the <think></think> tag. Always provide a clear, structured explanation."
    },
    {
        "role": "user",
        "content": "Explain why the Nemotron-3 Nano 30B model's 1-million-token context window and its hybrid architecture are important for enterprise software development and code analysis."
    }
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 4. Tokenize Input
print("4. Tokenizing input and moving to GPU...")
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

# 5. Generate Response (Optimized for Speed and Full Output)
print("5. Generating full theoretical explanation response...")
output = model.generate(
    **inputs,
    max_new_tokens=1024,  # Increased token limit for a detailed explanation
    do_sample=False,     # Fastest decoding (greedy)
    use_cache=True,      # Crucial to force caching
    eos_token_id=tokenizer.eos_token_id,
)
print("Response generation complete.")

# 6. Decode and Prepare for Print
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
# Retain the negative offset for clean output
generated_response = output_text[len(input_text) - 5:].strip()

3. Defining theoretical architecture task...
4. Tokenizing input and moving to GPU...
5. Generating full theoretical explanation response...
Response generation complete.


In [16]:
print("\n--- Nemotron 3 Nano Agent Output ---")
print(generated_response)


--- Nemotron 3 Nano Agent Output ---
30B model's 1-million-token context window and its hybrid architecture are important for enterprise software development and code analysis. Provide clear structured explanation. Use tags? The user said "Please respond with the format: wrap your entire response with double quotation marks." So we need to output a single string wrapped in double quotes. Inside we can have paragraphs etc. Must ensure the entire response is within double quotes. So output starts with " and ends with ". Inside we can have line breaks etc. Ensure no extra characters outside. Provide explanation. Let's craft.
</think>
"Enterprise software development and code analysis demand models that can ingest, process, and reason over extremely large codebases and documentation sets. The Nemotron‑3 Nano 30B model addresses this need through two key capabilities:

1. **1‑Million‑Token Context Window**
   - **Holistic Codebase Understanding**: A context window of one million tokens all