# Part 2: Local NIM Deployment

This notebook will guide you through deploying NVIDIA NIMs locally on your own infrastructure.

## Prerequisites

- NVIDIA GPU (compute capability ≥ 7.0)
- Docker installed
- NVIDIA Container Toolkit
- NGC API Key
- Sufficient disk space (~50GB per model)

## 1. Environment Check

Ensure GPU Availability

In [None]:
# Check GPU availability
!nvidia-smi

This cell is performing a GPU availability test to verify that Docker can properly access the NVIDIA GPUs on the system. 

This command tests if Docker can access the NVIDIA GPUs by running nvidia-smi inside a temporary Ubuntu container with GPU support enabled.

In [None]:
# Verify Docker and NVIDIA runtime
!docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Time to set up NGC (NVIDIA GPU Cloud) access. This is different from the API key we used earlier - this one lets us download the actual NIM containers.

We're also creating a cache directory. This is important - models are LARGE (5-100GB). The cache means:
- Download once, use many times
- Survive container restarts
- Share models between containers
- Quick model switching

The cache will be at ~/.cache/nim

In [None]:
import os
import subprocess
import time
import requests
import json

# Set up environment variables
NGC_API_KEY = os.getenv('NGC_API_KEY')
if not NGC_API_KEY:
    import getpass
    NGC_API_KEY = getpass.getpass("Enter your NGC API key: ")
    os.environ['NGC_API_KEY'] = NGC_API_KEY

# Set cache directory
LOCAL_NIM_CACHE = os.path.expanduser("~/.cache/nim")
os.makedirs(LOCAL_NIM_CACHE, exist_ok=True)
print(f"NIM cache directory: {LOCAL_NIM_CACHE}")

Get NVIDIA API Key

In [None]:
import os
import requests
import json
from openai import OpenAI
from dotenv import load_dotenv
from pathlib import Path

# Find the .env file in the project root
env_path = Path('.env')

# Load environment variables from .env file
# Use override=True to ensure values are loaded even if they exist in environment
load_dotenv(dotenv_path=env_path, override=True)

# Get API key from environment
nvidia_api_key = os.getenv("NVIDIA_API_KEY")

if not nvidia_api_key:
    print("❌ NVIDIA API Key not found in .env file!")
    print("👉 Please run 00_Workshop_Setup.ipynb first to set up your API key.")
    print(f"   (Looked for .env file at: {env_path.absolute()})")
    raise ValueError("NVIDIA_API_KEY not found. Please run the setup notebook first.")
else:
    print("✅ NVIDIA API Key loaded successfully from .env file")
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

## 2. NGC Authentication

Log into NGC

In [None]:
# Docker login to NGC
login_cmd = f'echo "{NGC_API_KEY}" | docker login nvcr.io --username \'$oauthtoken\' --password-stdin'
result = subprocess.run(login_cmd, shell=True, capture_output=True, text=True)
print("Login result:", result.stdout)

## 3. Deploy Your First NIM

The startup process:
1. Pull NIM image (done in setup notebook)
2. Check cache for model files
3. Load model into GPU memory (May take awhile)
4. Start inference

In [11]:
# Define deployment parameters
CONTAINER_NAME = "llama3.1-8b-instruct" 
IMG_NAME = "nvcr.io/nim/meta/llama-3.1-8b-instruct:latest"

# Stop existing container if running
!docker stop {CONTAINER_NAME} 2>/dev/null || true
!docker rm {CONTAINER_NAME} 2>/dev/null || true


This cell deploys the NIM container with the Llama 3.1 8B Instruct model. It constructs and executes a Docker command that:

**Container Configuration:**
- Runs in detached mode (`-d`) for background operation
- Enables GPU access with NVIDIA runtime
- Allocates 16GB shared memory for PyTorch operations
- Mounts the local cache directory to persist downloaded models
- Maps port 8000 for API access
- Runs with the current user's permissions to avoid file permission issues

**Key Environment Variables:**
- `NGC_API_KEY`: Authenticates with NVIDIA GPU Cloud to download the model
- `LOCAL_NIM_CACHE`: Points to `~/.cache/nim` for model storage

**Success/Failure Handling:**
- On success: Displays the container ID and confirms deployment
- On failure: Provides detailed troubleshooting steps including:
  - Docker status checks
  - Port conflict detection
  - GPU availability verification
  - Disk space validation
  - NGC authentication verification

The first run will download the model (5-10 minutes), while subsequent runs use the cached model for faster startup (30-60 seconds).

In [None]:
# Start NIM container
docker_cmd = f"""
docker run -d --name={CONTAINER_NAME} \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY={NGC_API_KEY} \
    -v {LOCAL_NIM_CACHE}:/opt/nim/.cache \
    -u $(id -u) \
    -p 8000:8000 \
    {IMG_NAME}
"""

print("Starting NIM container...")
result = subprocess.run(docker_cmd, shell=True, capture_output=True, text=True)

# Check if the command succeeded
if result.returncode == 0 and result.stdout.strip():
    container_id = result.stdout.strip()
    print(f"✅ Container started successfully!")
    print(f"Container ID: {container_id}")
else:
    print("❌ Failed to start container!")
    print(f"Return code: {result.returncode}")
    if result.stderr:
        print(f"Error message: {result.stderr}")
    if result.stdout:
        print(f"Output: {result.stdout}")
    
    # Common issues and solutions
    print("\nTroubleshooting tips:")
    print("1. Check if Docker is running: docker info")
    print("2. Check if image exists: docker images | grep llama")
    print("3. Check if port 8000 is already in use: docker ps -a")
    print(f"4. Check Docker logs: docker logs {CONTAINER_NAME}")
    print("5. Verify NGC authentication: echo $NGC_API_KEY")
    print("6. Check available disk space: df -h")
    print("7. Verify GPU is accessible: nvidia-smi")

This function polls the health endpoint until the NIM is ready.

[When ready appears]

The NIM is now ready to serve requests through the familiar OpenAI API format

In [None]:
def wait_for_nim_ready(max_attempts=60, sleep_time=5):
    """Wait for NIM to be ready to serve requests"""
    print("Waiting for NIM to start (this may take a few minutes on first run)...")
    
    # Get container IP
    import subprocess
    import json
    
    try:
        result = subprocess.run(['docker', 'inspect', CONTAINER_NAME], 
                              capture_output=True, text=True)
        container_info = json.loads(result.stdout)
        container_ip = container_info[0]['NetworkSettings']['IPAddress']
        health_url = f"http://{container_ip}:8000/v1/health/ready"
    except:
        health_url = "http://localhost:8000/v1/health/ready"  # fallback
    
    for attempt in range(max_attempts):
        try:
            response = requests.get(health_url)
            if response.status_code == 200:
                print("\n✅ NIM is ready!")
                return True
        except:
            pass
        
        print(".", end="", flush=True)
        time.sleep(sleep_time)
    
    print("\n❌ NIM failed to start")
    return False

# Wait for container to be ready
if wait_for_nim_ready():
    print("NIM is ready to serve requests!")
else:
    print("Check logs with: docker logs", CONTAINER_NAME)

This cell retrieves the container's internal IP address and tests the NIM API directly using that IP instead of localhost:8000.

**Why this is needed:**
Since this workshop runs on cloud GPU instances (like Brev), localhost connections often fail because:
- The cloud instance's network configuration may block local port forwarding
- Security policies in cloud environments can restrict localhost access
- Docker's bridge network might not properly route to the host's localhost

**The workaround:**
By using `docker inspect` to get the container's IP address (like 172.17.0.3), we bypass these cloud networking issues and connect directly to the container's network. This ensures reliable API access regardless of the cloud provider's network configuration.

The cell then verifies the NIM is working by requesting the available models list, confirming the API is ready to serve requests.

In [None]:
import subprocess
import json

# Get container IP address
def get_container_ip(container_name):
    try:
        result = subprocess.run(['docker', 'inspect', container_name], 
                              capture_output=True, text=True)
        if result.returncode == 0:
            container_info = json.loads(result.stdout)
            ip = container_info[0]['NetworkSettings']['IPAddress']
            print(f"Container IP: {ip}")
            return ip
        else:
            print(f"Failed to get container info for '{container_name}'")
            print(f"Error: {result.stderr}")
            return None
    except Exception as e:
        print(f"Error getting container IP: {e}")
        return None

container_ip = get_container_ip(CONTAINER_NAME)

# If we have the IP, try connecting to it directly
if container_ip:
    try:
        response = requests.get(f"http://{container_ip}:8000/v1/models", timeout=5)
        if response.status_code == 200:
            print("✅ NIM is accessible via container IP!")
            print("Available models:", response.json())
        else:
            print(f"❌ Got status code {response.status_code} from container IP")
    except Exception as e:
        print(f"❌ Error connecting to container IP: {e}")

## 4. Test Local NIM

This cell shows how to use the OpenAI Python SDK with your local NIM - the same tool we used with the cloud API in Part 1.

**What's happening:**
1. Creates an OpenAI client, but points it to your local container instead of the cloud
2. No API key needed since it's running locally
3. Requests a poem about AI with streaming enabled
4. Prints the response word-by-word as it's generated

**The key insight:** You can switch between cloud and local NIMs by just changing the `base_url`. All your existing OpenAI code works unchanged. The commented section shows how easy it is to switch back to the cloud API - just swap the client configuration.

This demonstrates the power of NIMs: same API, same code, but now running on your own hardware.

In [None]:
from openai import OpenAI
import subprocess
import json

# Create OpenAI client pointing to your local NIM
client = OpenAI(
    base_url=f"http://{container_ip}:8000/v1",
    api_key="not-needed-for-local",  # Local NIM doesn't require autu
)

# You can reference how we called this model via the API
# client = OpenAI(
#     base_url="https://integrate.api.nvidia.com/v1",
#     api_key=nvidia_api_key
# )

# Example: Streaming response
stream = client.chat.completions.create(
    model="meta/llama-3.1-8b-instruct",
    messages=[
        {"role": "user", "content": "Write a short poem about AI"}
    ],
    stream=True
)

print("Streaming response:")
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Note on output: Your poem will be different from the example shown because:
- LLMs generate unique responses each time (temperature=0.7 adds randomness)

## Models availale in local NIM deployment

This cell checks which models are available in your local NIM deployment.

**What it does:**
- Sends a request to the `/v1/models` endpoint
- Retrieves metadata about the deployed model
- Pretty-prints the information in readable JSON format

**What you'll see:**
- Model ID and version
- Creation timestamp
- Available permissions and settings
- Maximum context length (tokens the model can handle)

This is useful for confirming which model is actually running in your container and verifying the deployment was successful.


In [None]:
# Check available models
response = requests.get(f"http://{container_ip}:8000/v1/models")
models = response.json()
print("Available models:")
print(json.dumps(models, indent=2))

## 9. Clean Up

Before we move on to LoRA fine-tuning, let's properly clean up our deployment. This is important for a few reasons:

1. Frees up GPU memory for our next activities
2. Prevents port conflicts if we redeploy
3. Good practice for resource management

Don't worry - the model remains cached, so if you want to restart this NIM later, it'll start up in seconds, not minutes.

In [None]:
# Stop and remove container
!docker stop {CONTAINER_NAME}
!docker rm {CONTAINER_NAME}

## Summary

You've learned how to:
- Deploy NIMs locally with Docker
- Test deployments

Next: Let's explore LoRA fine-tuning with NeMo!