# Nemotron Nano 9B v2 NIM Setup and Inference

This notebook will guide you through:
1. Setting up your NGC API key
2. Logging into the NVIDIA Container Registry
3. Pulling and running the Nemotron Nano 9B v2 NIM container
4. Performing inference against the model


## Step 1: Enter Your NGC API Key

Get your API key from: https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/deploy


In [None]:
import getpass
import os
import subprocess
import time
import requests
import json

# Prompt for API key securely
NGC_API_KEY = getpass.getpass("Enter your NGC API Key: ")
os.environ['NGC_API_KEY'] = NGC_API_KEY

print("✓ API Key saved to environment")


## Step 2: Login to NVIDIA Container Registry

This will authenticate Docker with the NVIDIA Container Registry (nvcr.io)


In [None]:
# Docker login to nvcr.io
print("Logging into nvcr.io...")

login_process = subprocess.Popen(
    ['docker', 'login', 'nvcr.io', '-u', '$oauthtoken', '--password-stdin'],
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True
)

stdout, stderr = login_process.communicate(input=NGC_API_KEY)

if login_process.returncode == 0:
    print("✓ Successfully logged into nvcr.io")
    print(stdout)
else:
    print("✗ Login failed")
    print(stderr)


## Step 3: Setup Local Cache and Pull Container

This creates a local cache directory and pulls the NIM container image.


In [None]:
# Setup local cache directory
LOCAL_NIM_CACHE = os.path.expanduser("~/.cache/nim")
os.makedirs(LOCAL_NIM_CACHE, exist_ok=True)
print(f"✓ Created cache directory: {LOCAL_NIM_CACHE}")

# Pull the container image
print("\nPulling container image (this may take several minutes)...")
pull_result = subprocess.run(
    ['docker', 'pull', 'nvcr.io/nim/nvidia/nvidia-nemotron-nano-9b-v2:latest'],
    capture_output=True,
    text=True
)

if pull_result.returncode == 0:
    print("✓ Container image pulled successfully")
else:
    print("✗ Failed to pull container")
    print(pull_result.stderr)


## Step 4: Run the NIM Container

**Note:** This will start the container in the background. The model may take a few minutes to load and be ready for inference.

**Important:** Make sure you have:
- NVIDIA GPU with appropriate drivers
- Docker with GPU support (nvidia-docker)
- Sufficient disk space (~20GB+)


In [None]:
# Get current user ID
user_id = os.getuid()

# Check if container is already running
check_container = subprocess.run(
    ['docker', 'ps', '--filter', 'ancestor=nvcr.io/nim/nvidia/nvidia-nemotron-nano-9b-v2:latest', '-q'],
    capture_output=True,
    text=True
)

if check_container.stdout.strip():
    print("⚠ Container is already running")
    container_id = check_container.stdout.strip().split('\n')[0]
    print(f"Container ID: {container_id}")
else:
    print("Starting NIM container...")
    print("This will run in detached mode. Monitor logs with: docker logs -f <container_id>")
    
    run_result = subprocess.Popen(
        [
            'docker', 'run', '-d',
            '--gpus', 'all',
            '--shm-size=16GB',
            '-e', f'NGC_API_KEY={NGC_API_KEY}',
            '-v', f'{LOCAL_NIM_CACHE}:/opt/nim/.cache',
            '-u', str(user_id),
            '-p', '8000:8000',
            'nvcr.io/nim/nvidia/nvidia-nemotron-nano-9b-v2:latest'
        ],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True
    )
    
    stdout, stderr = run_result.communicate()
    
    if run_result.returncode == 0:
        container_id = stdout.strip()
        print(f"✓ Container started successfully")
        print(f"Container ID: {container_id}")
        print("\nWaiting for model to load (this may take 2-5 minutes)...")
    else:
        print("✗ Failed to start container")
        print(stderr)
        container_id = None


## Step 5: Wait for Model to be Ready

This cell will poll the health endpoint until the model is ready to accept requests.


In [None]:
import time

# Wait for the service to be ready
max_retries = 60  # 5 minutes max
retry_interval = 5  # seconds

print("Checking if service is ready...")
for i in range(max_retries):
    try:
        response = requests.get('http://localhost:8000/v1/models', timeout=5)
        if response.status_code == 200:
            print(f"\n✓ Service is ready! (took {i * retry_interval} seconds)")
            print(f"Available models: {response.json()}")
            break
    except requests.exceptions.RequestException:
        pass
    
    if i % 6 == 0:  # Print every 30 seconds
        print(f"Waiting... ({i * retry_interval}s elapsed)")
    
    time.sleep(retry_interval)
else:
    print("\n✗ Service did not become ready in time. Check container logs with:")
    print(f"docker logs {container_id}")


## Step 6: Perform Inference

Now let's test the model with a sample question!


In [None]:
# Perform inference
url = 'http://localhost:8000/v1/chat/completions'

payload = {
    "model": "nvidia/nvidia-nemotron-nano-9b-v2",
    "messages": [
        {"role": "user", "content": "Which number is larger, 9.11 or 9.8?"}
    ],
    "max_tokens": 64
}

headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

print("Sending inference request...\n")
response = requests.post(url, json=payload, headers=headers)

if response.status_code == 200:
    result = response.json()
    print("✓ Inference successful!\n")
    print("Response:")
    print(json.dumps(result, indent=2))
    print("\n" + "="*60)
    print("Model's Answer:")
    print(result['choices'][0]['message']['content'])
    print("="*60)
else:
    print(f"✗ Inference failed with status code: {response.status_code}")
    print(response.text)


## Step 7: Try Your Own Questions!

Use this cell to experiment with your own prompts.


In [None]:
# Custom inference function
def ask_nemotron(question, max_tokens=128):
    url = 'http://localhost:8000/v1/chat/completions'
    
    payload = {
        "model": "nvidia/nvidia-nemotron-nano-9b-v2",
        "messages": [
            {"role": "user", "content": question}
        ],
        "max_tokens": max_tokens
    }
    
    response = requests.post(url, json=payload, headers={'Content-Type': 'application/json'})
    
    if response.status_code == 200:
        result = response.json()
        return result['choices'][0]['message']['content']
    else:
        return f"Error: {response.status_code} - {response.text}"

# Try it out!
question = "What are the three laws of robotics?"
answer = ask_nemotron(question)
print(f"Q: {question}\n")
print(f"A: {answer}")


## Cleanup (Optional)

When you're done, you can stop and remove the container.


In [None]:
# Stop the container
if 'container_id' in locals() and container_id:
    print(f"Stopping container {container_id}...")
    subprocess.run(['docker', 'stop', container_id])
    print("✓ Container stopped")
else:
    # Try to find and stop any running nemotron containers
    result = subprocess.run(
        ['docker', 'ps', '--filter', 'ancestor=nvcr.io/nim/nvidia/nvidia-nemotron-nano-9b-v2:latest', '-q'],
        capture_output=True,
        text=True
    )
    if result.stdout.strip():
        containers = result.stdout.strip().split('\n')
        for cid in containers:
            print(f"Stopping container {cid}...")
            subprocess.run(['docker', 'stop', cid])
        print("✓ All containers stopped")
    else:
        print("No running containers found")
