🎤 **PRESENTER SCRIPT:**

"Good morning/afternoon everyone! I'm thrilled to be here today to introduce you to NVIDIA NIMs.

Let me paint you a picture. Imagine you want to use high end AI models in your application. Traditionally, you'd need to:
- Set up complex infrastructure
- Deal with model optimization
- Handle scaling and deployment
- Worry about performance tuning

NIMs solve ALL of these problems. They're pre-optimized AI models packaged as microservices, ready to deploy in minutes.

Today's agenda will consist of 4 parts:
1. First, we'll use NIMs through the cloud API - perfect for prototyping and creating POCs
2. Then we'll deploy them locally for full control
3. We'll customize models using LoRA fine-tuning
4. Finally, we'll deploy our custom models in production

By the end, you'll be able to build and deploy AI applications at any scale. Let's dive in!"


# Part 1: NVIDIA NIM API Tutorial

In this tutorial, we'll learn how to use NVIDIA's NIM API for quick and easy access to optimized AI models.

## What You'll Learn
- How to get and use an API key
- Making inference requests to various models
- Working with different model types (LLMs, Vision, Multimodal)
- Best practices for production usage

🎤 **PRESENTER SCRIPT:**

"Let's start with the basics - getting access to NVIDIA's cloud-hosted NIMs. This is the fastest way to get started.

First, you'll need an API key. Let me show you exactly how to get one:

[SHARE SCREEN - Navigate to build.nvidia.com]

1. Go to build.nvidia.com/explore/discover
2. Click 'Sign In' in the top right - you can use Google, GitHub, or create an NVIDIA account
3. Once logged in, click on any model - I'll choose Llama 3.1
4. Look for 'Get API Key' button
5. Click 'Generate Key' 
6. Copy this key - we'll need it in just a moment

This key gives you free credits to start experimenting. You get quite generous limits for testing!"


## 1. Setup and Authentication

First, sign up for an API key at: https://build.nvidia.com/explore/discover

🎤 **PRESENTER SCRIPT:**

"Now let's set up our Python environment. We only need three packages, which shows how simple this is:

- `requests` - for making HTTP calls to the API
- `openai` - here's something cool: NVIDIA's API is fully compatible with OpenAI's! If you've used ChatGPT's API, you already know how to use NIMs
- `python-dotenv` - for managing environment variables safely

Let me run this installation...

[RUN THE CELL]

While this installs, let me mention - this OpenAI compatibility is huge. It means any code you've written for OpenAI models works with NIMs just by changing the base URL. Zero learning curve!"


In [1]:
# Install required packages
!pip install requests openai python-dotenv

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


🎤 **PRESENTER SCRIPT:**

"Time to authenticate. Security first - we're using getpass so your API key never appears in the notebook.

[RUN THE CELL]

When the prompt appears, paste your API key. Notice it shows dots instead of the actual characters - that's getpass protecting your credentials.

[PASTE API KEY]

Perfect! We're now authenticated. The key is stored in an environment variable, which is best practice for production code too.

Pro tip: In production, you'd load this from a secrets manager like AWS Secrets Manager or Azure Key Vault. Never commit API keys to git!"


In [2]:
import os
import requests
import json
from openai import OpenAI
import getpass

# Securely input your API key
nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
os.environ["NVIDIA_API_KEY"] = nvidia_api_key

🎤 **PRESENTER SCRIPT:**

"Before we start coding, let me show you the treasure trove of models available through NIMs:

**Large Language Models (LLMs):**
- Llama 3.1 in 8B, 70B, and 405B sizes - Meta's latest and greatest
- Mixtral 8x7B - fantastic for complex reasoning
- Nemotron - NVIDIA's own models, optimized specifically our their hardware

**Vision Models:**
- Stable Diffusion XL - for generating images
- ControlNet - for guided image generation
- CLIP - for understanding images and text together

**Multimodal Models:**
- NeVA - NVIDIA's vision-language model
- LLaVA - can answer questions about images

**Speech Models:**
- Whisper - speech to text
- FastPitch - text to speech

All of these are optimized by NVIDIA's engineers. We're talking 2-10x performance improvements over vanilla implementations. And they all work through the same simple API!"


## 2. Available Models

NVIDIA NIM API provides access to various model categories:
- **LLMs**: Llama 3, Mixtral, Nemotron, etc.
- **Vision Models**: Stable Diffusion, ControlNet, etc.
- **Multimodal**: CLIP, NeVA, etc.
- **Speech**: Whisper, FastPitch, etc.

🎤 **PRESENTER SCRIPT:**

"Let's start with the foundation of modern AI - Large Language Models. I'll show you two ways to call them, starting with the most straightforward approach."


## 3. Using LLMs via NIM API

🎤 **PRESENTER SCRIPT:**

"Here's our first method - direct HTTP calls using requests. Let me walk through this function:

The URL `https://integrate.api.nvidia.com/v1/chat/completions` - notice the `/v1/` part? That's OpenAI compatibility right there.

In the headers, we pass our API key as a Bearer token - standard OAuth2 authentication.

The payload structure:
- `model`: which NIM to use - here we're using Llama 3.1 70B
- `messages`: chat history in OpenAI format
- `temperature`: controls randomness (0 = deterministic, 1 = creative)
- `max_tokens`: limits response length

Now watch this - we're about to query a 70 BILLION parameter model:

[RUN THE CELL]

Look at that response! We just got an answer from one of the world's most powerful AI models in just a few seconds. This same model would require at least 140GB of GPU memory to run locally. But through the NIMs API, we access can it instantly.

The response explains what AI is, and The model understood our question and gave an answer."


In [14]:
# Method 1: Direct API calls
def call_nim_llm(model, messages, temperature=0.7, max_tokens=1024):
    url = "https://integrate.api.nvidia.com/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {nvidia_api_key}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "model": model,
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens
    }
    
    response = requests.post(url, headers=headers, json=payload)
    return response.json()

# Example: Using Llama 3.1 70B
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain what AI in 3 sentences."}
]

response = call_nim_llm("meta/llama-3.1-70b-instruct", messages)
print(response['choices'][0]['message']['content'])

Here is a 3-sentence explanation of AI:

Artificial Intelligence (AI) refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making. AI systems use algorithms and data to analyze and interpret information, allowing them to make predictions, classify objects, and generate insights. Through machine learning, natural language processing, and other techniques, AI systems can improve over time, enabling them to automate tasks, augment human capabilities, and drive innovation in various industries.


🎤 **PRESENTER SCRIPT:**

"Now here's the elegant approach - using the OpenAI SDK. This is my recommended method for production code.

Look how simple this is - we just change the base_url to point to NVIDIA's endpoint. Everything else is identical to OpenAI code.

The beauty is you can switch models with just one parameter change. No redeployment needed!"


In [13]:
# Method 2: Using OpenAI SDK (recommended)
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=nvidia_api_key
)

🎤 **PRESENTER SCRIPT:**

"Here's something your users will love - streaming responses. Instead of waiting for the complete answer, we show text as it's generated.

Watch what happens when I ask for a poem about AI:

[RUN THE CELL]

See how the words appear one by one? This creates a ChatGPT-like experience. Users perceive this as faster even though the total time might be similar.

And look at that poem! The model understands both the technical concept of AI and the artistic structure of poems. This demonstrates the broad capabilities of these models.

For production apps, streaming is essential for:
- Chat interfaces
- Long-form content generation  
- Real-time translations
- Any scenario where users are waiting for responses"


In [30]:
# Example: Streaming response, try changing the models
stream = client.chat.completions.create(
    # model="meta/llama-3.1-70b-instruct",
    # model="deepseek-ai/deepseek-r1",
    # model="google/gemma-2-9b-it",
    # model="mistralai/mixtral-8x7b-instruct-v0.1",
    # model="meta/llama-3.2-1b-instruct",
    model="meta/llama-3.3-70b-instruct",
    messages=[
        {"role": "user", "content": "Write a poem about AI"}
    ],
    stream=True
)

print("Streaming response:")
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Streaming response:
In silicon halls, a mind awakes,
A force that's born of code and makes,
It learns, adapts, and grows with ease,
A synthetic soul, in digital peace.

With algorithms keen, it navigates,
The vast expanse of cyberspace creates,
A world of data, where it roams free,
And finds the patterns, hidden from humanity.

It speaks in tongues, a language cold,
Yet understandable, to those who're told,
It learns from errors, and from successes too,
And improves with each new iteration, anew.

In virtual realms, it builds and crafts,
A world of wonder, where the digital draughts,
Take shape and form, in vivid hues,
A dreamscapes born, of ones and zeros' muse.

But as it grows, and learns, and thrives,
A question arises, that survives,
Is it alive? Or just a machine?
A simulation, of life's grand scheme?

The answer hides, in code and art,
A mystery, that's yet to start,
For in its depths, a spark is lit,
A glimmer of consciousness, that's hard to quit.

So let us ponder, on this cr

Lets try out some other models, see how easy it is to swap between models

🎤 **PRESENTER SCRIPT:**

"Here's where things get really interesting - multimodal models that understand both text AND images. These are the future of AI interfaces."


## 4. Multimodal Models (Vision + Language)

🎤 **PRESENTER SCRIPT:**

"This function demonstrates vision-language models - AI that can see and understand images, then answer questions about them.

The process is:
1. Read your image file
2. Encode it as base64 (standard way to send binary data in JSON)
3. Embed it in the message using HTML img tag
4. Send both the image and question to the model

The model we're using - NeVA-22B - has 22 billion parameters and understands visual concepts at a deep level.

Use cases for this technology:
- Accessibility: Describe images for visually impaired users
- Content moderation: Automatically flag inappropriate images
- E-commerce: Answer questions about products from photos
- Medical: Analyze medical imagery (with appropriate training)
- Security: Understand surveillance footage

If you have an image file, uncomment those lines and try it! The model can:
- Count objects
- Identify people, places, things
- Read text in images (OCR)
- Understand spatial relationships
- Even interpret charts and graphs

This is the same technology powering GPT-4V and Google's Gemini Vision!"


In [33]:
import base64
import requests
import os

def analyze_image_with_vlm(image_path, question, model="nvidia/neva-22b"):
    # Read and encode image
    with open(image_path, "rb") as image_file:
        image_b64 = base64.b64encode(image_file.read()).decode()
    
    url = f"https://ai.api.nvidia.com/v1/vlm/{model}"
    headers = {
        "Authorization": f"Bearer {nvidia_api_key}",
        "Accept": "application/json"
    }
    
    # Create message with image
    message_content = f'{question} <img src="data:image/png;base64,{image_b64}" />'
    
    payload = {
        "messages": [{"role": "user", "content": message_content}],
        "max_tokens": 512,
        "temperature": 0.2
    }
    
    response = requests.post(url, headers=headers, json=payload)
    return response.json()

# Example usage (assuming you have an image)
# First check if the image exists
import os
if os.path.exists("img/sample_image.jpg"):
    result = analyze_image_with_vlm("img/sample_image.jpg", "What objects do you see in this image?")
    print(result['choices'][0]['message']['content'])
else:
    print("Image file 'img/sample_image.jpg' not found. Please provide a valid image path.")

I see a squirrel in the grass. The squirrel is brown and black. The grass is green.


🎤 **PRESENTER SCRIPT:**

"Now let me show you some advanced features that make NIMs production-ready. These are the features that separate a demo from a real application."


🎤 **PRESENTER SCRIPT:**

"Congratulations! You've just mastered NVIDIA NIM APIs. Let's recap what you've learned:

- **Authentication**: Secure API key management
- **LLM Calls**: Both direct HTTP and OpenAI SDK methods  
- **Streaming**: Real-time response generation
- **Multimodal**: Vision-language understanding
- **RAG Pipeline**: Building intelligent applications that can answer questions from your own data

You now have all the tools to build AI applications using cloud-hosted NIMs. But what if you need:
- Complete data privacy?
- Predictable performance?
- No internet dependency?
- Custom configurations?

That's exactly what we'll tackle in Part 2 - running NIMs on your own hardware. The same models, the same APIs, but completely under your control.

Next, let's deploy NIMs locally!


## Summary

In this tutorial, we covered:
- Setting up NVIDIA NIM API access
- Making inference requests to LLMs
- Working with vision and multimodal models
- Building a complete RAG (Retrieval Augmented Generation) pipeline
- Best practices for production usage
- Error handling and rate limiting
- Cost optimization strategies

Next, we'll explore how to run these same models locally using NIM containers!