🎤 **PRESENTER SCRIPT:**

"Welcome back! I hope you're excited because we're about to take full control of these powerful AI models.

In Part 1, we used NIMs through the cloud - quick, easy, but ultimately on someone else's server. Now we're going to run them on YOUR local hardware.

Why go local? Let me give you real scenarios:
- **Healthcare company**: 'We can't send patient data to the cloud' - Local NIMs solve this
- **Financial services**: 'We need sub-100ms latency for trading' - Local NIMs deliver this
- **Defense contractor**: 'We work in air-gapped environments' - Local NIMs enable this
- **Startup**: 'We need predictable costs as we scale' - Local NIMs provide this

The beauty is, the API is IDENTICAL to what we just used. Your application code doesn't change, just the endpoint.

Prerequisites check: #this should depend on what model you want to run
- NVIDIA GPU with 24GB+ memory (40GB recommended for Llama 3.1 8B)
- Ubuntu 20.04 or 22.04 (WSL2 works too!)
- Docker installed
- NVIDIA Container Toolkit
- 100GB free disk space for models

Don't worry if you're missing something - I'll show you how to set it all up!"


# Part 2: Local NIM Deployment

This notebook will guide you through deploying NVIDIA NIMs locally on your own infrastructure.

## Prerequisites

- NVIDIA GPU (compute capability ≥ 7.0)
- Docker installed
- NVIDIA Container Toolkit
- NGC API Key
- Sufficient disk space (~50GB per model)

🎤 **PRESENTER SCRIPT:**

"Let's start by verifying our environment is ready. This is crucial - better to catch issues now than during deployment!"


## 1. Environment Setup

🎤 **PRESENTER SCRIPT:**

"First, let's see what GPU we're working with. This nvidia-smi command is your best friend for GPU debugging.

[RUN THE CELL]

Let me explain what we're looking at:
- GPU Model: You need compute capability 7.0+ (that's RTX 2000 series or newer)
- Memory: This determines which models you can run
  - 16GB: Llama 3.1 8B, Mistral 7B
  - 24GB: Llama 2 13B, Mixtral 8x7B  
  - 40GB+: Llama 3.1 70B (with quantization)
  - 80GB+: Llama 3.1 70B (full precision)
- Driver Version: Should be 525.60.13 or newer
- CUDA Version: 12.0 or newer recommended

If you see your GPU here, we're in business! If not, check:
- Is the GPU properly installed?
- Are the NVIDIA drivers installed?
- On cloud instances, did you select a GPU instance type?"


In [1]:
# Check GPU availability
!nvidia-smi

Tue Jul  8 09:21:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:00:05.0 Off |                    0 |
| N/A   33C    P0             53W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

🎤 **PRESENTER SCRIPT:**

"Now let's verify Docker can access your GPU. This tests the NVIDIA Container Toolkit integration.

[RUN THE CELL]

Perfect! If you see the same nvidia-smi output, Docker has GPU access. 

If this fails, you need to install NVIDIA Container Toolkit:
```bash
# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
```

This toolkit is what allows Docker containers to use GPUs. It's the magic that makes local NIMs possible!"


In [2]:
# Verify Docker and NVIDIA runtime
!docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Tue Jul  8 09:21:06 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:00:05.0 Off |                    0 |
| N/A   33C    P0             53W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

🎤 **PRESENTER SCRIPT:**

"Time to set up NGC (NVIDIA GPU Cloud) access. This is different from the API key we used earlier - this one lets us download the actual NIM containers.

[RUN THE CELL - Enter NGC API key when prompted]

Let me show you how to get this key:
1. Go to ngc.nvidia.com
2. Sign in or create account
3. Click your username (top right) → Setup
4. Generate API Key
5. Copy and paste here

We're also creating a cache directory. This is important - models are LARGE (5-100GB). The cache means:
- Download once, use many times
- Survive container restarts
- Share models between containers
- Quick model switching

The cache will be at ~/.cache/nim - remember this location!"


In [3]:
import os
import subprocess
import time
import requests
import json

# Set up environment variables
NGC_API_KEY = os.getenv('NGC_API_KEY')
if not NGC_API_KEY:
    import getpass
    NGC_API_KEY = getpass.getpass("Enter your NGC API key: ")
    os.environ['NGC_API_KEY'] = NGC_API_KEY

# Set cache directory
LOCAL_NIM_CACHE = os.path.expanduser("~/.cache/nim")
os.makedirs(LOCAL_NIM_CACHE, exist_ok=True)
print(f"NIM cache directory: {LOCAL_NIM_CACHE}")

NIM cache directory: /root/.cache/nim


In [13]:
import os
import requests
import json
from openai import OpenAI
import getpass

# Securely input your API key
nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
os.environ["NVIDIA_API_KEY"] = nvidia_api_key

🎤 **PRESENTER SCRIPT:**

"Now we authenticate Docker with NGC. This is like 'docker login' for Docker Hub, but for NVIDIA's registry.

[RUN THE CELL]

The username is literally '$oauthtoken' - don't change it! The password is your NGC API key.

You should see 'Login Succeeded'. This authentication persists, so you only need to do this once per machine.

Behind the scenes, this stores credentials in ~/.docker/config.json. In production, you'd use:
- Kubernetes secrets
- Docker Swarm secrets  
- Cloud provider secret managers
- HashiCorp Vault

But for development, this local auth is perfect!"


## 2. NGC Authentication

🎤 **PRESENTER SCRIPT:**

"The moment we've been waiting for - let's deploy our first local NIM! We'll start with Llama 3.1 8B Instruct, a powerful but efficient model."


In [4]:
# Docker login to NGC
login_cmd = f'echo "{NGC_API_KEY}" | docker login nvcr.io --username \'$oauthtoken\' --password-stdin'
result = subprocess.run(login_cmd, shell=True, capture_output=True, text=True)
print("Login result:", result.stdout)

Login result: Login Succeeded



🎤 **PRESENTER SCRIPT:**

"First, let's clean up any existing containers. This ensures we start fresh and avoid port conflicts.

[RUN THE CELL]

The `|| true` means the command succeeds even if there's no container to stop. This is good defensive scripting - always handle the case where your cleanup targets don't exist.

In production, you'd do more sophisticated checks:
- Gracefully stop services
- Save state if needed
- Notify monitoring systems
- Coordinate with load balancers"


## 3. Deploy Your First NIM

🎤 **PRESENTER SCRIPT:**

"Here's where the magic happens. Let me break down this Docker command in detail:

`docker run -d`: Run detached (in background)
`--name llama3-8b-instruct`: Container name for easy reference
`--runtime=nvidia`: Enables GPU access (via NVIDIA Container Toolkit)
`--gpus all`: Use all available GPUs (you can specify specific ones)
`--shm-size=16GB`: Shared memory for PyTorch (critical for performance!)
`-e NGC_API_KEY`: Pass our authentication
`-v ~/.cache/nim:/opt/nim/.cache`: Mount cache directory
`-u $(id -u)`: Run as current user (avoids permission issues)
`-p 8000:8000`: Expose the API port
`nvcr.io/nim/meta/llama3-8b-instruct:latest`: The NIM image

[RUN THE CELL]

The container is starting! What's happening now:
1. Docker pulls the NIM image (first time only)
2. Container checks the cache for model files
3. If not cached, downloads from NGC (5-10 minutes first time)
4. Loads model into GPU memory
5. Starts the inference server
6. Becomes available on port 8000

This is a one-time setup per model. Future starts are much faster!"


In [31]:
# Define deployment parameters
CONTAINER_NAME = "llama3-8b-instruct" 
IMG_NAME = "nvcr.io/nim/meta/llama3-8b-instruct:latest"

# Stop existing container if running
!docker stop {CONTAINER_NAME} 2>/dev/null || true
!docker rm {CONTAINER_NAME} 2>/dev/null || true

llama-3.2-1b-instruct
llama-3.2-1b-instruct


🎤 **PRESENTER SCRIPT:**

"NIMs take time to initialize, especially on first run. This function polls the health endpoint until ready.

[RUN THE CELL]

Each dot is a 5-second check. First run typically takes:
- Image pull: 1-2 minutes
- Model download: 5-10 minutes (depends on internet speed)
- Model loading: 30-60 seconds
- Server startup: 10-20 seconds

Subsequent runs skip the download and just need:
- Model loading: 30-60 seconds
- Server startup: 10-20 seconds

While we wait, let me explain what's happening inside the container:
1. **Model Optimization**: Converting to TensorRT format for your specific GPU
2. **Memory Mapping**: Efficiently loading multi-GB files
3. **Warmup**: Preparing CUDA kernels
4. **API Server**: Starting FastAPI with OpenAI-compatible endpoints

"Let me walk you through what's happening behind the scenes while our container starts up. This process is fascinating because it showcases how NVIDIA has optimized these models for production use.

First, we have Model Optimization. The container is taking our large language model and converting it into TensorRT format. Think of this like translating a book into your native language - it makes everything run much more smoothly on your specific GPU. TensorRT is NVIDIA's secret sauce that can make models run up to 6 times faster than their original format.

Next comes Memory Mapping. We're dealing with models that can be several gigabytes in size - imagine trying to read a massive book all at once! Instead of loading everything into memory, the container uses a clever technique called memory mapping. It's like having a really efficient bookmark system where you can instantly jump to any page without having to flip through the entire book. This means we can work with huge models even with limited GPU memory.

Then we have the Warmup phase. This is where the container prepares all the CUDA kernels - think of it like a chef preparing their mise en place before service starts. Every mathematical operation the model might need is compiled and cached, ensuring that when we actually start serving requests, everything runs at peak performance. Without this warmup, the first few requests would be slower as the operations get compiled on demand.

Finally, we launch the API Server. We're using FastAPI, a modern, fast web framework, but here's the clever part - the endpoints are completely compatible with OpenAI's API format. This means if you've ever written code for ChatGPT's API, it will work with our local deployment with minimal changes. Just change the API URL and key, and you're good to go!


[When ready appears]

Excellent! Our local NIM is ready. We now have a complete LLM inference server running on our hardware!"

Our model is now optimized, loaded efficiently, warmed up, and ready to serve requests through a familiar API interface."

In [32]:
# Start NIM container
docker_cmd = f"""
docker run -d --name={CONTAINER_NAME} \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY={NGC_API_KEY} \
    -v {LOCAL_NIM_CACHE}:/opt/nim/.cache \
    -u $(id -u) \
    -p 8000:8000 \
    {IMG_NAME}
"""

print("Starting NIM container...")
result = subprocess.run(docker_cmd, shell=True, capture_output=True, text=True)

# Check if the command succeeded
if result.returncode == 0 and result.stdout.strip():
    container_id = result.stdout.strip()
    print(f"✅ Container started successfully!")
    print(f"Container ID: {container_id}")
else:
    print("❌ Failed to start container!")
    print(f"Return code: {result.returncode}")
    if result.stderr:
        print(f"Error message: {result.stderr}")
    if result.stdout:
        print(f"Output: {result.stdout}")
    
    # Common issues and solutions
    print("\nTroubleshooting tips:")
    print("1. Check if Docker is running: docker info")
    print("2. Check if image exists: docker images | grep llama")
    print("3. Check if port 8000 is already in use: docker ps -a")
    print(f"4. Check Docker logs: docker logs {CONTAINER_NAME}")
    print("5. Verify NGC authentication: echo $NGC_API_KEY")
    print("6. Check available disk space: df -h")
    print("7. Verify GPU is accessible: nvidia-smi")

Starting NIM container...


✅ Container started successfully!
Container ID: 15bfa2295e363a317920939a6a28d14cec9c29dee2a2a7493d83a26b616b7a63


🎤 **PRESENTER SCRIPT:**

"Let's verify our deployment and make our first local inference!"

"You might notice we're using the container's IP address [container_ip] instead of localhost:8000. This is because Docker creates an isolated network for each container - think of it as the container having its own private address within your machine. While we mapped port 8000 with -p 8000:8000, sometimes Docker's port forwarding doesn't work as expected due to network configurations, firewall rules, or cloud environment restrictions. By connecting directly to the container's IP, we're bypassing any potential networking issues and going straight to where the NIM service is actually running. This is a common troubleshooting technique when localhost port mapping doesn't work in Docker environments."


In [33]:
def wait_for_nim_ready(max_attempts=60, sleep_time=5):
    """Wait for NIM to be ready to serve requests"""
    print("Waiting for NIM to start (this may take a few minutes on first run)...")
    
    # Get container IP
    import subprocess
    import json
    
    try:
        result = subprocess.run(['docker', 'inspect', CONTAINER_NAME], 
                              capture_output=True, text=True)
        container_info = json.loads(result.stdout)
        container_ip = container_info[0]['NetworkSettings']['IPAddress']
        health_url = f"http://{container_ip}:8000/v1/health/ready"
    except:
        health_url = "http://localhost:8000/v1/health/ready"  # fallback
    
    for attempt in range(max_attempts):
        try:
            response = requests.get(health_url)
            if response.status_code == 200:
                print("\n✅ NIM is ready!")
                return True
        except:
            pass
        
        print(".", end="", flush=True)
        time.sleep(sleep_time)
    
    print("\n❌ NIM failed to start")
    return False

# Wait for container to be ready
if wait_for_nim_ready():
    print("NIM is ready to serve requests!")
else:
    print("Check logs with: docker logs", CONTAINER_NAME)

Waiting for NIM to start (this may take a few minutes on first run)...
.

..................
✅ NIM is ready!
NIM is ready to serve requests!


In [34]:
import subprocess
import json

# Get container IP address
def get_container_ip(container_name):
    try:
        result = subprocess.run(['docker', 'inspect', container_name], 
                              capture_output=True, text=True)
        if result.returncode == 0:
            container_info = json.loads(result.stdout)
            ip = container_info[0]['NetworkSettings']['IPAddress']
            print(f"Container IP: {ip}")
            return ip
        else:
            print(f"Failed to get container info for '{container_name}'")
            print(f"Error: {result.stderr}")
            return None
    except Exception as e:
        print(f"Error getting container IP: {e}")
        return None

container_ip = get_container_ip(CONTAINER_NAME)

# If we have the IP, try connecting to it directly
if container_ip:
    try:
        response = requests.get(f"http://{container_ip}:8000/v1/models", timeout=5)
        if response.status_code == 200:
            print("✅ NIM is accessible via container IP!")
            print("Available models:", response.json())
        else:
            print(f"❌ Got status code {response.status_code} from container IP")
    except Exception as e:
        print(f"❌ Error connecting to container IP: {e}")

Container IP: 172.17.0.3
✅ NIM is accessible via container IP!
Available models: {'object': 'list', 'data': [{'id': 'meta/llama-3.2-1b-instruct', 'object': 'model', 'created': 1751968019, 'owned_by': 'system', 'root': 'meta/llama-3.2-1b-instruct', 'parent': None, 'max_model_len': 131072, 'permission': [{'id': 'modelperm-35487ef7d4fd431193eba59a7391b980', 'object': 'model_permission', 'created': 1751968019, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view': True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}]}]}


In [46]:
# Test chat completions with our local NIM
def chat_with_local_nim(prompt, max_tokens=100):
    url = f"http://{container_ip}:8000/v1/chat/completions"
    headers = {
        "Content-Type": "application/json"
    }
    
    payload = {
        "model": "meta/llama3-8b-instruct",
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "max_tokens": max_tokens,
        "temperature": 0.7
    }
    
    response = requests.post(url, headers=headers, json=payload)
    
    if response.status_code == 200:
        result = response.json()
        return result['choices'][0]['message']['content']
    else:
        return f"Error: {response.status_code} - {response.text}"

# Test with a simple prompt
prompt = "Explain what AI is in 2 sentences."
response = chat_with_local_nim(prompt)
print(f"Prompt:\n\n{prompt}\n\n")
print(f"Response:\n\n{response}")

Prompt:

Explain what AI is in 2 sentences.


Response:

Here is a 2-sentence explanation of AI:

Artificial Intelligence (AI) refers to a broad field of computer science that involves the development of computer systems that can perform tasks that typically require human intelligence, such as reasoning, learning, and problem-solving. AI systems can be designed to interact with humans, make decisions, and learn from data, enabling them to automate tasks, improve decision-making, and gain insights that can be used to solve complex problems.


In [55]:
from openai import OpenAI
import subprocess
import json

# Create OpenAI client pointing to your local NIM
client = OpenAI(
    api_key="not-needed-for-local",  # Local NIM doesn't require auth
    base_url=f"http://{container_ip}:8000/v1"
)

# You can reference how we called this model via the API
# client = OpenAI(
#     base_url="https://integrate.api.nvidia.com/v1",
#     api_key=nvidia_api_key
# )



# Example: Streaming response
stream = client.chat.completions.create(
    model="meta/llama3-8b-instruct",
    messages=[
        {"role": "user", "content": "Write a short poem about AI"}
    ],
    stream=True
)

print("Streaming response:")
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Streaming response:
Here is a short poem about AI:

In silicon halls, they whisper low
A world of computation, the future slow
Artificial minds, a curious sight
Learning to reason, day and night

With logic rules and human art
They mimic life, a wondrous start


But Ada Lovelace, first to speak
Of code and reason, the programmer's seek

Their thoughts and feelings, still a mystery
As they evolve, and learn to recognize
Human hearts beating, with emotions true
A step towards life, as AI shines through.

## 4. Test Local NIM

🎤 **PRESENTER SCRIPT:**

"Now let's see what kind of performance we're getting from our local deployment."


🎤 **PRESENTER SCRIPT:**

"First, let's see what models the NIM is serving:

[RUN THE FIRST CELL]

Look at that metadata! It shows:
- Model ID: What we deployed
- Object type: Following OpenAI's schema
- Created timestamp: When the container started
- Owned by: Organization (library)

Now for the real test - let's ask it something:

[RUN THE SECOND CELL]

SUCCESS! We just got a response from Llama 3.1 8B running entirely on our local GPU!

Notice:
- Same API format as the cloud version
- Response time is typically faster (no internet latency)
- Complete data privacy (nothing left our machine)
- No rate limits or usage costs

The response quality is identical to the cloud version because it's the exact same model, just running locally!"


In [56]:
# Check available models
response = requests.get(f"http://{container_ip}:8000/v1/models")
models = response.json()
print("Available models:")
print(json.dumps(models, indent=2))

Available models:
{
  "object": "list",
  "data": [
    {
      "id": "meta/llama-3.2-1b-instruct",
      "object": "model",
      "created": 1751969105,
      "owned_by": "system",
      "root": "meta/llama-3.2-1b-instruct",
      "parent": null,
      "max_model_len": 131072,
      "permission": [
        {
          "id": "modelperm-6e7a4a1242f34eaab2ee00d31ea81662",
          "object": "model_permission",
          "created": 1751969105,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}


🎤 **PRESENTER SCRIPT:**

"First, let's check resource usage:

[RUN THE CELL]

Key metrics to watch:
- CPU: Should be low (model runs on GPU)
- Memory: Base container overhead
- GPU Memory: This is critical - shown in nvidia-smi
- Network I/O: Should be minimal (all local)

For production monitoring, you'd export these metrics to:
- Prometheus + Grafana
- DataDog
- New Relic
- CloudWatch (AWS)
- Azure Monitor

The container is very efficient - most resources go to model serving, not overhead!"


In [57]:
# Test inference
def test_local_inference(prompt):
    response = requests.post(
        f"http://{container_ip}:8000/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        json={
            "model": models['data'][0]['id'],
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 100,
            "temperature": 0.7
        }
    )
    
    if response.status_code == 200:
        result = response.json()
        return result['choices'][0]['message']['content']
    else:
        return f"Error: {response.status_code} - {response.text}"

# Test the deployment
test_prompt = "What is AI in one sentence?"
print(f"Prompt: {test_prompt}")
print(f"Response: {test_local_inference(test_prompt)}")

Prompt: What is AI in one sentence?
Response: Artificial intelligence (AI) refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making.


🎤 **PRESENTER SCRIPT:**

"Let's run a proper performance benchmark to understand our local NIM's capabilities:

[RUN THE CELL]

Analyzing these results:
- **First request**: Slower due to 'cold start' - GPU needs to warm up
- **Subsequent requests**: Much faster, this is your real performance
- **Average latency**: This is what users experience
- **Consistency**: Local deployment has very consistent latency

Typical performance on different GPUs:
- RTX 4090: ~50-100ms per token
- A100 40GB: ~20-50ms per token  
- H100 80GB: ~10-30ms per token

For Llama 3.1 8B generating 100 tokens:
- RTX 4090: ~5-10 seconds total
- A100: ~2-5 seconds total
- H100: ~1-3 seconds total

Compare this to cloud APIs which might have:
- Network latency: 50-200ms
- Rate limiting delays
- Variable performance based on load

Your local NIM gives you predictable, consistent performance!"


## 5. Performance Monitoring

🎤 **PRESENTER SCRIPT:**

"If you have multiple GPUs, NIMs can use them all for better performance. Let me show you the options."


In [58]:
# Check container resource usage
!docker stats --no-stream {CONTAINER_NAME}

CONTAINER ID   NAME                    CPU %     MEM USAGE / LIMIT   MEM %     NET I/O           BLOCK I/O        PIDS
15bfa2295e36   llama-3.2-1b-instruct   0.25%     2.889GiB / 167GiB   1.73%     3.13GB / 5.95MB   4.1kB / 3.13GB   55


🎤 **PRESENTER SCRIPT:**

"Let's see how many GPUs you have available:

[RUN THE CELL]

If you have multiple GPUs, you have several deployment options:

**1. Tensor Parallelism**: Split model across GPUs
```bash
docker run --gpus all -e TENSOR_PARALLEL_SIZE=2 ...
```

**2. Pipeline Parallelism**: Different model layers on different GPUs
```bash
docker run --gpus all -e PIPELINE_PARALLEL_SIZE=2 ...
```

**3. Multiple Instances**: Run separate containers on each GPU
```bash
docker run --gpus '"device=0"' -p 8000:8000 ...
docker run --gpus '"device=1"' -p 8001:8001 ...
```

For most users, tensor parallelism is best - it's automatic and efficient. The NIM handles all the complexity!"


In [59]:
# Performance benchmark
import time

def benchmark_inference(num_requests=10):
    prompts = [
        "What is machine learning?",
        "Explain neural networks",
        "What are GPUs used for?",
        "Define artificial intelligence",
        "What is deep learning?"
    ] * (num_requests // 5)
    
    latencies = []
    
    for prompt in prompts:
        start_time = time.time()
        response = test_local_inference(prompt)
        latency = time.time() - start_time
        latencies.append(latency)
        print(f"Request completed in {latency:.2f}s")
    
    print(f"\nAverage latency: {sum(latencies)/len(latencies):.2f}s")
    print(f"Min latency: {min(latencies):.2f}s")
    print(f"Max latency: {max(latencies):.2f}s")

# Run benchmark
benchmark_inference(5)

Request completed in 0.37s
Request completed in 0.32s
Request completed in 0.32s
Request completed in 0.33s
Request completed in 0.32s

Average latency: 0.33s
Min latency: 0.32s
Max latency: 0.37s


🎤 **PRESENTER SCRIPT:**

"For production deployments, Docker Compose is much better than raw Docker commands. It's declarative, version-controlled, and easy to manage."


🎤 **PRESENTER SCRIPT:**

"Outstanding work! You've successfully deployed a state-of-the-art LLM on your own hardware. Let's recap what you've accomplished:

- **Environment Setup**: Verified GPU, Docker, and NVIDIA Container Toolkit
- **NGC Authentication**: Connected to NVIDIA's model registry
- **Local Deployment**: Ran Llama 3.1 8B completely offline

The power of local NIMs:
- **Privacy**: Your data never leaves your infrastructure
- **Performance**: Consistent sub-second latency
- **Control**: You decide when to upgrade, how to configure
- **Cost**: One-time hardware investment vs ongoing API fees
- **Reliability**: No dependency on external services

But here's a question that might be on your mind: 'These are great pre-trained models, but what if I need something specific to my business?'

What if you need the model to:
- Understand your company's terminology?
- Follow your specific writing style?
- Know your product catalog?
- Comply with your industry regulations?

That's EXACTLY what we'll tackle next with LoRA fine-tuning. We'll take these powerful base models and customize them for your specific needs - efficiently and affordably.

Ready to make AI truly yours? Let's dive into Part 3!"


## 9. Clean Up

🎤 **PRESENTER SCRIPT:**

"Before we move on to LoRA fine-tuning, let's properly clean up our deployment. This is important for a few reasons:

1. Frees up GPU memory for our next activities
2. Prevents port conflicts if we redeploy
3. Good practice for resource management

Don't worry - the model remains cached, so if you want to restart this NIM later, it'll start up in seconds, not minutes.

Let's clean up now..."


In [20]:
# Stop and remove container
!docker stop {CONTAINER_NAME}
!docker rm {CONTAINER_NAME}

llama3-8b-instruct
llama3-8b-instruct


## Summary

You've learned how to:
- Deploy NIMs locally with Docker
- Monitor and test deployments
- Configure for different scenarios
- Prepare for production deployment

Next: Let's explore LoRA fine-tuning with NeMo!

🎤 **PRESENTER SCRIPT:**

"Outstanding work! Let's recap what you've just accomplished in Part 2.

You've successfully deployed a state-of-the-art Large Language Model on YOUR hardware. No cloud, no external dependencies, complete control.

Think about what this means:
- Your sensitive data never leaves your infrastructure
- You have predictable, consistent performance
- No API rate limits or usage fees
- You can deploy in air-gapped environments

But I know what you're thinking: 'This is great for general use, but what if I need the model to understand MY company's specific language, MY products, MY use cases?'

That's EXACTLY where we're going next with LoRA fine-tuning. Ready to make AI truly yours?"
