SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.  
SPDX-License-Identifier: Apache-2.0

# VLM NIM Workshop

NVIDIA Inference Microservices (NIMs) are a collection of easy to use API driven microservices to interact with AI models. 

This workshop focuses on the Visual Language Models that are currently available. 

- <a href=https://build.nvidia.com/nvidia/neva-22b> Neva-22b </a>
- <a href=https://build.nvidia.com/microsoft/microsoft-kosmos-2> Kosmos-2 </a>
- <a href=https://build.nvidia.com/adept/fuyu-8b> Fuyu-8b </a>
- <a href=https://build.nvidia.com/google/google-paligemma> paligemma </a>
- <a href=https://build.nvidia.com/microsoft/phi-3-vision-128k-instruct> phi-3-vision-128k-instruct </a>



To learn more about NIMs visit <a href=https://build.nvidia.com/explore/discover> ai.nvidia.com </a>

This workshop has four parts:

**Part 0.** Setup Environment  
**Part 1.** Text Chat  
**Part 2.** Image Chat  
**Part 3.** Video Understanding

## Part 0: Setup Environment 

1) This notebook requires the user to bring a video in mp4 format to run the streaming pipeline in Part 3. The notebook will assume there is a file named **test_video.mp4** in the same directory as this notebook. Adjust the cell below to define a path to a test video file. 

2) Fill in the *api_key* variable with your API key that was generated from the NIM website. 

In [39]:
import requests

url = 'https://dev263135.service-now.com/x_146833_genaico_0.CatImage3.jpg'
testimage = '/opt/app-root/src/metropolis-nim-workflows/workflows/vlm_alerts/aniltestimage.jpg'

response = requests.get(url)
if response.status_code == 200:
    with open(testimage, 'wb') as file:
        file.write(response.content)
    print('File downloaded successfully')
else:
    print('Failed to download file')


File downloaded successfully


In [40]:
test_video = "test_video.mp4" #FIXME - point to a 1080p mp4 video file. 
api_key = "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" #FIX ME 

The next cell will install all the necessary Python packages for this notebook. 

In [41]:
import sys 
python_exe = sys.executable
!{python_exe} -m pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [42]:
import json, io, subprocess, base64
from time import time 

from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches

import numpy as np
import requests 
import cv2

Ensure that no errors occured during the installation and import in the two cells above before continuing.

In [43]:
#Setup VLM NIM Urls 
neva_api_url = "https://ai.api.nvidia.com/v1/vlm/nvidia/neva-22b"

## Part 2: Image Chat

The VLMs are unique because unlike LLMs, they can accept visual and text input. This section will cover the basics on how to provide Images to the VLM.

### Part 2.1: Image Preprocessing

Images are often provded to the multimodal models at a lower resolutions such as 224x224 or 336x336. This is based on the input size of the vision encoder used in the VLM. 

To reduce our request size, we can preprocess the image to the input resolution used by the VLM. This is not strictly necessary as the NIM itself will process the image to the correct input size but we can reduce latency and API calls by converting our image to JPEG and downsizing it before uploading it through the request. 

After the image processing is done, it is converted to a base 64 string. A base 64 string encoded image is a common way to serialize images when they are included directly with a REST API request. 

The maximum image size supported when included directly in the reponse in 180,000 bytes. For larger files, the <a href=https://docs.api.nvidia.com/cloud-functions/reference/createasset> large asset API </a> can be used. This requires a few more API calls but allows for large files to be given to the NIMs.  

For this notebook, our image sizes can be reduced sufficiently to allow for upload directly in the chat completion requests so the large asset API is not needed. 

We will use this image for our testing: 

![Test Image](test_image.png)

In [44]:
def process_image(image):
    """ Resize image, encode as jpeg to shrink size then convert to b64 for upload """
    if isinstance(image, str):
        image = Image.open(image).convert("RGB")
    elif isinstance(image, Image.Image):
        image = image.convert("RGB")
        
    image = image.resize((336,336)) #Resize or center crop and padding to be square are common approaches 
    buf = io.BytesIO() #temporary buffer to save processed image 
    image.save(buf, format="JPEG") #save as jpeg to reduce size
    image = buf.getvalue()
    image_b64 = base64.b64encode(image).decode() #convert to b64 string
    assert len(image_b64) < 180_000, "Image to large to upload." #ensure image is small enough
    return image_b64

After processing the image, this is what the VLM will see:

![Test Image](test_image_resized.jpeg)

In [45]:
headers = {
  "Authorization": f"Bearer {api_key}",
  "Accept": "application/json"
}

The headers are configured the same way as in Part 0. 

### Part 2.1.2 Exercise - Try on your own image with your own queries

In this part, try these APIs on your own images. Play with the 4 hyperparameters and see what type of responses it generates

In [46]:
image_b64 = process_image(testimage) #put the filepath to your own image
payload = {
  "messages": [
    {
      "role": "user",
      #"content": f'Describe what you see in this image. <img src="data:image/jpeg;base64,{image_b64}" />'
        "content": f'You are seeing brake pads in the image can you tell if they are having good grip on them <img src="data:image/jpeg;base64,{image_b64}" />'
    }
  ],
  "max_tokens": 1024, #TEST
  "temperature": 0.20, #TEST
  "top_p": 0.70, #TEST
  "seed": 0, #TEST
  "stream": False
}

response = requests.post(neva_api_url, headers=headers, json=payload)
response = response.json()

#print only the reply 
print(response["choices"][0]["message"]["content"])

The image shows a cat sitting on a bed. There are no brake pads in the image.


### Part 2.2 Abstracting the REST API

For easier use, we can wrap the API requests and image processing into a simple callable class.

In [47]:
class VLM:
    def __init__(self, url, api_key):
        """ Provide NIM API URL and an API key"""
        self.api_key = api_key
        self.url = url 
        self.headers = {"Authorization": f"Bearer {self.api_key}", "Accept": "application/json"}

    def _encode_image(self, image):
        """ Resize image, encode as jpeg to shrink size then convert to b64 for upload """

        if isinstance(image, str): #file path
            image = Image.open(image).convert("RGB")
        elif isinstance(image, Image.Image): #pil image 
            image = image.convert("RGB")
        elif isinstance(image, np.ndarray): #cv2 / np array image 
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            image = Image.fromarray(image)
        else:
            print(f"Unsupported image input: {type(image)}")
            return None 
            
        image = image.resize((336,336))
        buf = io.BytesIO()
        image.save(buf, format="JPEG")
        image = buf.getvalue()
        image_b64 = base64.b64encode(image).decode()
        assert len(image_b64) < 180_000, "Image too large to upload."
        return image_b64

    def __call__(self, prompt, image):
        """ Call VLM object with the prompt and path to image """ 
        image_b64 = self._encode_image(image)

        #For simplicity, the image will be appended to the end of the prompt. 
        payload = {
              "messages": [
                {
                  "role": "user",
                  "content": f'{prompt} Here is the image: <img src="data:image/jpeg;base64,{image_b64}" />'
                }
              ],
              "max_tokens": 128,
              "temperature": 0.20,
              "top_p": 0.70,
              "stream": False
        }

        response = requests.post(self.url, headers=headers, json=payload)
        response = response.json()
        reply = response["choices"][0]["message"]["content"]
        return reply, response #return reply and the full response

Now we have an easy to use VLM class that wraps a multimodal NIM. 

Lets run some images against Neva, Kosmos and Fuyu to compare. 

### Part 2.3.1 Comparing VLMs

In [48]:
#Create a VLM object for each supported model 
neva = VLM(neva_api_url, api_key)

Adjust the *custom_prompt* variable if you want to try different prompts.   
Adjust the *image_path* variable if you want to try different images. 

In [49]:
custom_prompt = "f'You are seeing brake pads in the image can you tell if they are having good grip on them" #CHANGE ME
image_path = testimage #CHANGE ME

In [50]:
#NEVA
start_time = time()
response, _ = neva(custom_prompt, image_path)
print(f"Neva Response: {response}")
print(f"Neva Time: {time() - start_time} \n")

Neva Response: In the image you can see a cat sitting on a bed. There are no brake pads in the image.
Neva Time: 1.1273276805877686 

