# Introduction to NVIDIA Inference Microservice (NIM) APIs
NIMs are like OpenAI’s API, but it’s designed to run a pletheroa of powerful models (like LLaMA 3, Mistral, or custom vision/speech models) either cloud hosted or on NVIDIA infrastructure. A NIM is a single model + software stack, packaged into a container and specificially designed to be run on NVIDIA RTX GPUs. NIMs can be used to run LLM models for chat, agents, and all other tasks that require inference.

*TODO* purpose of this environment running LLMs is difficult locally without HW, always need api key, etc.

## Content Overview 

- [Prerequisites](#Prerequisites)
- [Basic NIM Call](#Basic-NIM-Call)
- [Try it Yourself](#Try-It-Yourself:-Explore-Another-Chat-Model-from-the-NIM-Catalog)

## Prerequisites
Prior to getting started, you will need an NVIDIA API Key from the NVIDIA API Catalog to access the models used in this notebook.  

Need an API Key? It's Free!
  1. Navigate to **[NVIDIA API Catalog](https://build.nvidia.com/explore/discover)**.
  2. Select any model, such as `llama-3.3-70b-instruct`.
  3. On the right panel above the sample code snippet, click on "Get API Key". This will prompt you to log in if you have not already.

In [None]:
import os
import getpass
from dotenv import load_dotenv

# Load environment variables from a .env file if available
load_dotenv()

# Try to get the API key from the environment
api_key = os.getenv("NVIDIA_API_KEY")

# Prompt if not set or invalid
if not api_key or not api_key.startswith("nvapi-"):
    print("NVIDIA API key not found or invalid.")
    api_key = getpass.getpass("🔐 Enter your NVIDIA API key: ").strip()
    if not api_key.startswith("nvapi-"):
        raise ValueError(f"{api_key[:5]}... is not a valid NVIDIA API key")
    # Set in environment for the current session
    os.environ["NVIDIA_API_KEY"] = api_key

## Basic NIM Call
Lets perform a basic API call to a NIM to learn about its format.

In [None]:
# NVIDIA NIM supports the OpenAI-compatible API interface, so this client works with NIM too.
from openai import OpenAI

# Initialize the client
client = OpenAI(
  base_url = "https://integrate.api.nvidia.com/v1",
  api_key = os.environ["NVIDIA_API_KEY"]
)

- `base_url`: Tells the OpenAI client to send requests to NVIDIA’s NIM endpoint instead of OpenAI’s.
- `api_key`: Used for authentication. Required only if you’re calling from outside NVIDIA’s internal systems, and for this course.

### Create a Chat Completion Request

In [None]:
completion = client.chat.completions.create(
  model="meta/llama-3.3-70b-instruct",
    
  messages=[{
      "role":"user",
      "content":"Tell me about Dumbledore."
  }],
    
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
  stream=True
)

- `model`: The model you want to use. In this case, it’s LLaMA 3.3 70B Instruct, hosted via NIM.
- `messages`: Follows the Chat Markup Language (ChatML) format - like ChatGPT- where you provide a list of role-based messages. You can have:
    - role: "system" – sets overall behavior
    - role: "user" – the actual input or question
    - role: "assistant" – prior responses (for context)
- `temperature`: Controls randomness. Lower = more deterministic.
- `top_p`: Controls nucleus sampling (alternative to temperature).
- `max_tokens`: The maximum number of tokens in the output.
- `stream=True`: Enables streaming responses (partial chunks instead of waiting for the full response).

### Handle & Print Streaming Output

The below loops through each streamed chunk of text as it arrives.
- `chunk.choices[0].delta.content`: Contains just the new token/word from the model.
- `print(..., end="")`: Prints the text incrementally without adding a newline, creating a fluid chat experience.

In [None]:
for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

---
This script:
1. Uses the OpenAI SDK
2. Points it to NVIDIA NIM’s OpenAI API compatible endpoint
3. Requests a streamed chat response from `LLaMA 3.3 70B`
4. Outputs the text live as it’s generated

## Try It Yourself: Explore Another Chat Model from the NIM Catalog
You’ve just used [meta/llama-3.3-70b-instruct](https://build.nvidia.com/meta/llama-3_3-70b-instruct) for a streaming chat interaction. Now it’s your turn to explore the [NIM catalog](https://build.nvidia.com/search?q=Reasoning) and run a different model!  

Find a chat-capable model, such as:
- [Deepseek-R1](https://build.nvidia.com/deepseek-ai/deepseek-r1)
- [Qwen2.5 7B Instruct](https://build.nvidia.com/qwen/qwen2_5-7b-instruct)
- [Llama-3.1 Nemotron Ultra 253B](https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1)

Update the below field with your chosen model:

In [None]:
completion = client.chat.completions.create(
    
  model="",
    
  messages=[{
      "role":"user",
      "content":"Tell me about Dumbledore."
  }],
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")