# VLM NIM Workshop

NVIDIA Inference Microservices (NIMs) are a collection of easy to use API driven microservices to interact with AI models. 

This workshop will focus on the Visual Language Models that are currently available. 

- <a href=https://build.nvidia.com/nvidia/neva-22b> Neva-22b </a>
- <a href=https://build.nvidia.com/microsoft/microsoft-kosmos-2> Kosmos-2 </a>
- <a href=https://build.nvidia.com/adept/fuyu-8b> Fuyu-8b </a>


To learn more about NIMs visit <a href=https://build.nvidia.com/explore/discover> ai.nvidia.com </a>

This workshop has four parts:

**Part 0.** Setup Environment  
**Part 1.** Text Chat  
**Part 2.** Image Chat  
**Part 3.** Video Understanding

## Part 0: Setup Environment 

Ensure the "fire.mp4" and "test_image.png" are in the same directory as the notebook. 

Fill in the *api_key* variable with your API key that was generated from the NIM website. 

In [None]:
api_key = "FIX ME" #FIX ME 

Ensure you have launched the jupyter notebook with the correct conda environment so all dependecies are present. 

In [None]:
!pip install opencv-python requests pillow numpy flask

In [None]:
import os, json, io, subprocess, base64
from threading import Thread 
from time import time 

from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
import matplotlib.patches as patches

import numpy as np
import requests 
import cv2

In [None]:
#Setup VLM NIM Urls 
neva_api_url = "https://ai.api.nvidia.com/v1/vlm/nvidia/neva-22b"
kosmos2_api_url = "https://ai.api.nvidia.com/v1/vlm/microsoft/kosmos-2"
fuyu8b_api_url = "https://ai.api.nvidia.com/v1/vlm/adept/fuyu-8b"

In [None]:
total_tokens_used = 0

## Part 1: Text Chat

This section is a simple example to call a VLM NIM with a POST request using just text input. The request will be made up of some headers that will include your API Key for authorization and then the payload with the content for the VLM. 

The API key should be presented as a Bearer token and the request body will be JSON format so we need to specify this in the header. 

In [None]:
headers = {
  "Authorization": f"Bearer {api_key}",
  "Accept": "application/json"
}

The payload is JSON format and follows an API schema similar to the OpenAI API. For full details on the API Spec for NIMs, visit these pages:

- https://docs.api.nvidia.com/nim/reference/nvidia-neva-22b  
- https://docs.api.nvidia.com/nim/reference/microsoft-kosmos-2
- https://docs.api.nvidia.com/nim/reference/adept-fuyu-8b

In [None]:
payload = {
  "messages": [
    {
      "role": "user",
      "content": f'Can you tell me what you are capable of?'
    }
  ],
  "max_tokens": 1024,
  "temperature": 0.20,
  "top_p": 0.70,
  "stream": False
}

The payload is made up of two main parts, the messages and the hyperparameters. 

The messages key and associated list value is the set of all messages between the "user" and the "assistant". The "user" being the person interacting with the model and the "assistant" being the VLM. 

In this notebook we will only send single messages with the "user" role but to implement a full chatting experience, the return of the VLM would be appended to these messages as the "assistant" then followed up by the next "user" message. This creates a multi turn chat that has the history of the conversation for the VLM to use. 

Below the "messages" field are a set of hyperparameters that can be controlled to tune the VLM. 

- max_tokens: Maximum number of tokens to generate for the response. 
- temperature: The randomness of the output. Higher temperature allows for less likely values to be chosen in the output.   
- top_p: Also controls the randomness of the output. Higher top_p will make the LLM choose more likely values. 
- stream: Streaming responses can be used to get tokens as soon as they are generated instead of waiting for the complete response. 

For simplicity, this notebook will not use streaming responses. To see how to do this, visit the documentation pages linked in the cell above.

In [None]:
response = requests.post(neva_api_url, headers=headers, json=payload)
response = response.json()

#print the full JSON response 
json_string = json.dumps(response, indent=4, sort_keys=True)
print(json_string)

#print only the reply 
print(response["choices"][0]["message"]["content"])
total_tokens_used += response["usage"]["total_tokens"]

With the headers and payload setup, the Python requests library can be used to send a POST request to the VLM API url. The response will be in JSON format and can be parsed and then accessed to get the reply from the VLM. 

This section covered how to interact with VLMs using just text input. The API schema and style of requests is the same if you were to interact with the LLM NIMs as well. 

## Part 2: Image Chat

The VLMs are unique because unlike LLMs, they can accept visual and text input. This section will cover the basics on how to provide Images to the VLM.

### Part 2.1: Image Preprocessing

Images are often provded to the multimodal models at a lower resolutions such as 224x224 or 336x336. This is based on the input size of the Vision encoder used in the VLM. 

To reduce our request size, we can preprocess the image to the input resolution used by the VLM. This is not strictly necessary as the NIM itself will process the image to the correct input size but we can reduce latency and API calls by converting our image to JPEG and downsizing it before uploading it through the request. 

After the image processing is done, it is converted to a base 64 string. A base 64 string encoded image is a common way to serialize images when they are included directly with an API request. 

The maximum image size supported when included directly in the reponse in 180,000 bytes. For larger files, the <a href=https://docs.api.nvidia.com/cloud-functions/reference/createasset> large asset API </a> can be used. This requires a few more API calls but allows for large files to be given to the NIMs.  

For this notebook, our image sizes can be reduced sufficiently to allow for direct upload directly in the chat completion requests so the large asset API is not needed. 

We will use this image for our testing: 

![Test Image](test_image.png)

In [None]:
def process_image(image):
    """ Resize image, encode as jpeg to shrink size then convert to b64 for upload """
    if isinstance(image, str):
        image = Image.open(image).convert("RGB")
    elif isinstance(image, Image.Image):
        image = image.convert("RGB")
        
    image = image.resize((336,336)) #Resize or center crop and padding to be square are common approaches 
    buf = io.BytesIO() #temporary buffer to save processed image 
    image.save(buf, format="JPEG") #save as jpeg to reduce size
    image = buf.getvalue()
    image_b64 = base64.b64encode(image).decode() #conver to b64 string
    assert len(image_b64) < 180_000, "Image to large to upload." #ensure image is small enough
    return image_b64

After processing the image, this is what the VLM will see:

![Test Image](test_image_resized.jpeg)

In [None]:
headers = {
  "Authorization": f"Bearer {api_key}",
  "Accept": "application/json"
}

The headers are configured the same way as in Part 0. 

In [None]:
image_b64 = process_image("test_image.png") #get the base 64 representation of our reduced size image 
payload = {
  "messages": [
    {
      "role": "user",
      "content": f'Describe what you see in this image. <img src="data:image/jpeg;base64,{image_b64}" />'
    }
  ],
  "max_tokens": 1024,
  "temperature": 0.20,
  "top_p": 0.70,
  "seed": 0,
  "stream": False
}

In the payload, our content field now contains our image by supplying an image tag with our prompt and referencing the base 64 string.

```<img src="data:image/jpeg;base64,{image_b64}"/```

Now when the POST request is sent, the VLM will take into account the image in our prompt!

In [None]:
response = requests.post(neva_api_url, headers=headers, json=payload)
response = response.json()

#print only the reply 
print(response["choices"][0]["message"]["content"])
total_tokens_used += response["usage"]["total_tokens"]

### Part 2.1.2 Exercise - Try on your own image with your own queries

In this part, try these APIs on your own images. Play with the 4 hyperparameters and see what type of responses it generates

In [None]:
image_b64 = process_image("#FIXME") #get the base 64 representation of our reduced size image 
payload = {
  "messages": [
    {
      "role": "user",
      "content": f'#FIXME <img src="data:image/jpeg;base64,{image_b64}" />'
    }
  ],
  "max_tokens": 1024, #TEST
  "temperature": 0.20, #TEST
  "top_p": 0.70, #TEST
  "seed": 0, #TEST
  "stream": False
}

response = requests.post(neva_api_url, headers=headers, json=payload)
response = response.json()

#print only the reply 
print(response["choices"][0]["message"]["content"])
total_tokens_used += response["usage"]["total_tokens"]

### Part 2.2 Abstracting the REST API

For easier use, we can wrap the API requests and image processing into a simple callable class.

In [None]:
class VLM:
    def __init__(self, url, api_key):
        """ Provide NIM API URL and an API key"""
        self.api_key = api_key
        self.url = url 
        self.headers = {"Authorization": f"Bearer {self.api_key}", "Accept": "application/json"}

    def _encode_image(self, image):
        """ Resize image, encode as jpeg to shrink size then convert to b64 for upload """

        if isinstance(image, str): #file path
            image = Image.open(image).convert("RGB")
        elif isinstance(image, Image.Image): #pil image 
            image = image.convert("RGB")
        elif isinstance(image, np.ndarray): #cv2 / np array image 
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            image = Image.fromarray(image)
        else:
            print(f"Unsupported image input: {type(image)}")
            return None 
            
        image = image.resize((336,336))
        buf = io.BytesIO()
        image.save(buf, format="JPEG")
        image = buf.getvalue()
        image_b64 = base64.b64encode(image).decode()
        assert len(image_b64) < 180_000, "Image to large to upload."
        return image_b64

    def __call__(self, prompt, image):
        """ Call VLM object with the prompt and path to image """ 
        image_b64 = self._encode_image(image)

        #For simplicity, the image will be appended to the end of the prompt. 
        payload = {
              "messages": [
                {
                  "role": "user",
                  "content": f'{prompt} Here is the image: <img src="data:image/jpeg;base64,{image_b64}" />'
                }
              ],
              "max_tokens": 128,
              "temperature": 0.20,
              "top_p": 0.70,
              "stream": False
        }

        response = requests.post(self.url, headers=headers, json=payload)
        response = response.json()
        reply = response["choices"][0]["message"]["content"]
        return reply, response #return reply and the full response

Now we have an easy to use VLM class that wraps a multimodal NIM. 

Lets run some images against Neva, Kosmos and Fuyu to compare. 

### Part 2.3 Comparing VLMs

In [None]:
#Create a VLM object for each supported model 
neva = VLM(neva_api_url, api_key)
kosmos2 = VLM(kosmos2_api_url, api_key)
fuyu8b = VLM(fuyu8b_api_url, api_key)

Adjust the *custom_prompt* variable if you want to try different prompts.   
Adjust the *image_path* variable if you want to try different images. 

In [None]:
custom_prompt = "Can you tell me about the image?" #CHANGE ME
image_path = "test_image.png" #CHANGE ME

In [None]:
#NEVA
start_time = time()
neva_response, full_response = neva(custom_prompt, image_path)
print(f"Neva Response: {neva_response}")
print(f"Neva Time: {time() - start_time} \n")
total_tokens_used += full_response["usage"]["total_tokens"]

In [None]:
#FUYU
start_time = time()
neva_response, full_response = fuyu8b(custom_prompt, image_path)
print(f"Fuyu-8b Response: {neva_response}")
print(f"Fuyu-8b Time: {time() - start_time} \n")
total_tokens_used += full_response["usage"]["total_tokens"]

In [None]:
#KOSMOS 
start_time = time() 
kosmos2_response, full_response = kosmos2(custom_prompt, image_path)
print(f"Kosmos-2 Response: {kosmos2_response}")
print(f"Kosmos-2 Time: {time() - start_time}")
total_tokens_used += full_response["usage"]["total_tokens"]

Now we can see the time each model takes to send a request as well as the difference in their outputs. 

### Part 2.4 Grounding with Kosmos

The Kosmos model is designed with grounding capability. Allowing it to localize areas in the image. This is particularly usefull for questions that require more precise answers such as counting and positioning. 

In the full response from Kosmos, we get bounding boxes of objects detected in addition to the text reply. This is a unique feature of the Kosmos model. The bounding box overlay can be viewed on the NIM demo for Kosmos https://build.nvidia.com/microsoft/microsoft-kosmos-2. 

In [None]:
kosmos2_response, full_response = kosmos2("Can you show me where all the cars are?", "test_image.png")
print(json.dumps(full_response, indent=4, sort_keys=True))
total_tokens_used += full_response["usage"]["total_tokens"]

The bounding boxes are associated with specific substrings in the output. Allowing parts of the output such as "cars" to then be grounded by a set of bounding boxes that indicate where cars are present in the image.

### Part 2.5 Visualize the bounding boxes

The cosmos model predicted 2 distinct entites - "all the cars" and "the highway". Let's draw bounding boxes around each of the objects. 

In [None]:
im = Image.open('test_image.png')
width, height = im.size

# Create figure and axes
fig, ax = plt.subplots()

# Display the image
ax.imshow(im)

entities = full_response["choices"][0]["message"]["entities"]

#print(json.dumps(entities, indent=2, sort_keys=True))
bbox_colors = ['b', 'r', 'g']
count=0

for objects in entities:
    for bbox in objects["bboxes"]:
        
        x0_scaled = bbox[0]*width
        y0_scaled = bbox[1]*height
        x_len = (bbox[2]*width)-x0_scaled
        y_len = (bbox[3]*height)-y0_scaled
        
        rect = patches.Rectangle((x0_scaled, y0_scaled), x_len, y_len, linewidth=1, edgecolor=bbox_colors[count], facecolor='none')
        plt.text(x0_scaled, y0_scaled, objects["phrase"])
        
        # Add the patch to the Axes
        ax.add_patch(rect)
    count+=1

plt.show()


# Part 3: Video Understanding

With the basics covered, we can now build a more useful pipeline that will give us insights into live streaming videos.

Streaming videos from security cameras, drones, etc. generates a lot of useful video. However, it is a challenge to extract insights from the videos due to the sheer amount of footage that gets recorded. 

A common scenario is we want to be alerted when certain events are caught on camera such a fire, smoke or a person appearing in a restricted area. 

Instead of manually monitoring a live stream 24/7, we can use a VLM NIM to monitor a live stream for us. We can then tell the VLM what to look for such as a fire or smoke and the VLM can respond when it detects it on the live stream. 

### Part 3.1: Simple Video Pipeline 

We can start by building a pipeline that can open a video file and call the VLM. The sample video we will use has 2 scenes. The first scene is ariel footage of mountain landscape and then it transitions to a scene with fire and smoke. 

We can setup the prompt to be "Is there a fire in the image? Answer yes or no". The VLM will follow our instructions and tell us yes or no when it detects a fire in the video. This could then be parsed by another script to take some actions such as sending a notification to the user. 

In [None]:
def display_image(frame, frame_number, prompt, reply):
    img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) 
    
    print_frame = 'Frame Number: ' + str(frame_number)
    
    font_scale = 1
    thickness = 3
    text_color=(120,215,21)
    cv2.putText(img, print_frame, (50, 50), cv2.FONT_HERSHEY_SIMPLEX, font_scale, text_color, thickness)
    cv2.putText(img, f"VLM Input: {prompt}", (50, 120), cv2.FONT_HERSHEY_SIMPLEX, font_scale, text_color, thickness)
    cv2.putText(img, f"VLM Response: {reply}", (50, 190), cv2.FONT_HERSHEY_SIMPLEX, font_scale, text_color, thickness)
    
    display_image = Image.fromarray(img)
    draw = ImageDraw.Draw(display_image)
    
    scale = 0.25
    
    display(display_image.resize(( int(display_image.width * scale), int(display_image.height * scale))))
    

In [None]:
#load video and run vlm in a loop with prompt 
neva = VLM(neva_api_url, api_key)
video_path = "fire.mp4"
prompt  = "Is there a fire in the image? Answer yes or no."

cap = cv2.VideoCapture(video_path) #open video file with openCV

neva = VLM("https://ai.api.nvidia.com/v1/vlm/nvidia/neva-22b", api_key)

count = 0
while True:
    ret, frame = cap.read()
    if frame is None:
        continue 

    # Skip the first 600 frames (20sec)
    if count < 600:
        count += 1
        continue
        
    if (count % 60)!=0: 
        count += 1
        continue
        
    reply, full_response = neva(prompt, frame)
    
    total_tokens_used += full_response["usage"]["total_tokens"]
    frame_number = count + 1
    
    # Display frame
    display_image(frame, frame_number, prompt, reply)
    
    #print(reply)
    reply, full_response = neva(prompt, frame)
    total_tokens_used += full_response["usage"]["total_tokens"]
    count += 1
    if count > 1000:
        break 

The simple pipeline is still quite limited as it will only work on a video file and process it frame by frame. 

Often, it is not necessary to process every frame as some events will change slowly over time. 

We can adjust the pipeline to make it more friendly for video streaming use cases. 

Additionally we want to see the output visually instead of through the command line. We can create an overlay window and print the VLM response. 

We also want to take advantage of the VLM's natural language ability. The strength of the VLM is its ability to change and be interactive. To enable this, we can setup a REST API server to dynamically change the prompt to the VLM as it is processing the video stream. 

### Part 3.2 Interactive Video Pipeline

We can launch this more advanced pipeline in the background with a python subprocess. Then interact with it through REST requests in this notebook to change the prompt. When launched, an overlay window will appear that will show the VLM responses ontop of the input video stream. 

The video streaming pipeline is in the main.py script. The VLM class defined above is in vlm.py with some extra logic to handle threading and dynamic prompts. The API server is in api_server.py. Explore these scripts to see how this more complex pipeline is built.

In [None]:
models_available = {"neva":"https://ai.api.nvidia.com/v1/vlm/nvidia/neva-22b", "kosmos-2":"https://ai.api.nvidia.com/v1/vlm/microsoft/kosmos-2", "fuyu-8b":"https://ai.api.nvidia.com/v1/vlm/adept/fuyu-8b"}
launched_processes = []

Configure the cell below to adjust the model and video file used to run the pipeline. 

In [None]:
model_selection = "neva"  #["neva", "kosmos-2", "fuyu-8b"]
video_file = "fire.mp4" #fire.mp4 is the default sample video with this workshop. 

Launch Video Pipeline in subprocess 

In [None]:
port = 5432
model_url = models_available[model_selection]

In [None]:
import sys 
python_exe = sys.executable
print(python_exe)

log = open("log_file.txt", "w+")
process = subprocess.Popen([python_exe, "main.py", "--model_url", model_url, "--video_file", video_file, "--api_key", api_key, "--port", str(port), "--overlay"], stdout=log, stderr=log)
print(process)
launched_processes.append(process)

Once the subprocess is launched, a window should pop up that is playing the video with a text overlay showing the output of the VLM. 

Run the cell below to send an API request to change the prompt. You should see the output of the VLM change a few seconds after sending the new prompt. 

The prompt will run continously once it is sent. You can run the cell below several times and adjust the *prompt* variable to see how the output changes.

In [None]:
#send prompt 
prompt = "Is there fire? Answer yes or no." #CHANGE ME

params = {"query":prompt}
url = f"http://0.0.0.0:{port}/query"
response = requests.get(url, params=params)
print(response)
print(response.text)
total_tokens_used += 610


* Note: If the the REST API request hangs for a long time or is not able to connect, modify the *url* variable and replace the 0.0.0.0 with the IP address of your computer. This can be found through the command line with the ipconfig or ifconfig commands.

Run the cell below to kill the subprocess that is running the streaming pipeline. You can also click on the overlay window and press the 'q' key to quit the process. 

In [None]:
#Clean up sub process
for p in launched_processes:
    p.terminate() 
log.close()

### Check the number of tokens that you have consumed

In [None]:
print(f"Total Tokens Consumed: {total_tokens_used}")