# Use GPT-4o with Images and Videos


GPT-4o is a large multimodal model (LMM) developed by OpenAI that can analyze images and provide textual responses to questions about them. It incorporates both natural language processing and visual understanding.

The GPT-4o model answers general questions about what's present in the images. You can also show it video if you break the video into individual frames.

To run this notebook in Codespaces, you need to run the following command on the terminal to install opencv and the underlying libraries. If you are running this notebook locally, your installation steps may vary.

In [None]:
!sudo apt-get update; sudo apt-get install libgl1 -y
%pip install opencv-python

## Processing Images

The following command shows the most basic way to use the GPT-4o model with code. If this is your first time using these models programmatically, we recommend starting with our [OpenAI getting started notebook](../../../samples/python/openai/getting_started.ipynb). 

We will start by creating a client object based on the envionment variable `OPENAI_API_KEY` that you have set up. 

In [None]:
# imports
import time  # for measuring time duration of API calls
from openai import OpenAI
import dotenv
import os

dotenv.load_dotenv()

if not os.getenv("GITHUB_TOKEN"):
    raise ValueError("GITHUB_TOKEN is not set")

os.environ["OPENAI_API_KEY"] = os.getenv("GITHUB_TOKEN")
os.environ["OPENAI_BASE_URL"] = "https://models.github.ai/inference"

GPT_MODEL = "openai/gpt-4o-mini"

client = OpenAI()

Then call the client's chat completions **create** method. The following code shows a sample request body. The format is the same as is used for other chat completions with GPT-4o, except that the message content contains array with text and images (either a valid HTTP or HTTPS URL to an image, or a base-64-encoded image). 
    

In [None]:
def describe_image(image_url):
    response = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[
            { "role": "system", "content": "You are a helpful assistant." },
            { "role": "user", "content": [  
                { 
                    "type": "text", 
                    "text": "Describe this picture:" 
                },
                { 
                    "type": "image_url",
                    "image_url": {
                        "url": image_url
                    }
                }
            ] } 
        ],
        max_tokens=2000 
    )
    return response

image_url = "https://github.com/microsoft/assistant-pf-demo/blob/main/images/sad-puppy.png?raw=true"

print(describe_image(image_url))


### Use a local image

If you want to use a local image, you can use the following Python code to convert it to base64 so it can be passed to the API. Alternative file conversion tools are available online. Let's try ask gpt-4o-mini about the image below.

![](data/sad-puppy.png)

In [None]:
import base64, json
from mimetypes import guess_type

# Function to encode a local image into data URL 
def local_image_to_data_url(image_path):
    # Guess the MIME type of the image based on the file extension
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = 'application/octet-stream'  # Default MIME type if none is found

    # Read and encode the image file
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8')

    # Construct the data URL
    return f"data:{mime_type};base64,{base64_encoded_data}"

# Example usage
image_path = 'data/sad-puppy.png'
data_url = local_image_to_data_url(image_path)
print(f"Data URL: {data_url[:100]}...")
response = describe_image(data_url)
print(json.dumps(response.model_dump(), indent=2))

### Output

The API response should look something like the following.

```json
{
  "id": "chatcmpl-9vT1HlFN3bzRXTbYQgtRwPLfnGMxk",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "The picture features a fluffy, light-colored puppy sitting on a rug. In front of the puppy is a dark-colored food bowl. The setting appears warm and cozy, with wooden furniture and a neutral-colored wall in the background. The overall mood conveys a sense of comfort and playfulness.",
        "role": "assistant",
        "function_call": null,
        "tool_calls": null
      }
    }
  ],
  "created": 1723483275,
  "model": "gpt-4o-mini",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": "fp_276aa25277",
  "usage": {
    "completion_tokens": 57,
    "prompt_tokens": 262,
    "total_tokens": 319
  }
}
```

Every response includes a `"finish_details"` field. It has the following possible values:
- `stop`: API returned complete model output.
- `length`: Incomplete model output due to the `max_tokens` input parameter or model's token limit.
- `content_filter`: Omitted content due to a flag from our content filters.

## Detail parameter settings in image processing: Low, High, Auto  

The _detail_ parameter in the model offers three choices: `low`, `high`, or `auto`, to adjust the way the model interprets and processes images. The default setting is auto, where the model decides between low or high based on the size of the image input. 
- `low` setting: the model does not activate the "high res" mode, instead processes a lower resolution 512x512 version, resulting in quicker responses and reduced token consumption for scenarios where fine detail isn't crucial.
- `high` setting: the model activates "high res" mode. Here, the model initially views the low-resolution image and then generates detailed 512x512 segments from the input image. Each segment uses double the token budget, allowing for a more detailed interpretation of the image.''

For details on how the image parameters impact tokens used please see [Image Tokens]() below.


In [None]:

def describe_image_detail(image_url, detail = "auto"):
    response = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[
            { "role": "system", "content": "You are a helpful assistant." },
            { "role": "user", "content": [  
                { 
                    "type": "text", 
                    "text": "Describe this picture:" 
                },
                { 
                    "type": "image_url",
                    "image_url": {
                        "detail": detail,
                        "url": image_url
                    }
                }
            ] } 
        ],
        max_tokens=2000 
    )
    return response

for detail in ["auto", "low", "high"]:
    response = describe_image_detail(data_url, detail=detail)
    print(detail, response.usage.prompt_tokens, response.choices[0].message.content)

## Processing Videos

To process videos, you need to break them down into individual frames and then pass the sequence of frames to the model. To save on tokens, you can skip frames or only process a subset of the frames. It is also advisable to use `low` detail setting for videos to save on tokens and be able to process more frames.

Below we are processing the well-known Big Buck Bunny video

[![Watch the video](https://img.youtube.com/vi/aqz-KE-bpKQ/hqdefault.jpg)](https://www.youtube.com/watch?v=aqz-KE-bpKQ)





Let's start by reading the frames into memory as base64 encoded images. We will then pass the frames to the model for processing.


In [None]:
from IPython.display import display, Image, Audio
import cv2

video = cv2.VideoCapture("https://download.blender.org/peach/bigbuckbunny_movies/BigBuckBunny_320x180.mp4")

# Get the frame rate of the video
frame_rate = video.get(cv2.CAP_PROP_FPS)
print(f"Frame Rate: {frame_rate} FPS")


base64Frames = []
while video.isOpened():
    success, frame = video.read()
    if not success:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

video.release()
print(len(base64Frames), "frames read.")

In [None]:
# play the first 10s of the video
import time
display_handle = display(None, display_id=True)
for img in base64Frames[:int(frame_rate * 10)]:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8"))))
    time.sleep(1/frame_rate)

Once we have the video frames, we craft our prompt and send a request to GPT:
> Note that we don't need to send every frame for GPT to understand what's going on in the video. In this case, we will send one frame per second.

> Also, the current limit of the number of images that can be sent in one request is 250. If you have more than 250 frames, you will need to split them into multiple requests. Additional limits exist for the free tier of GitHub Models API.



Next, we will process the video in batches and have the model describe the whole plot of the video. Due to the token limits of the free service, you need to process the video in rather small batches (we are using 50 here). And, we only sample one frame every 4 seconds. You can play with those parameters to get the best results for your video.

In [None]:
n = 4 # seconds
one_frame_per_n_seconds = base64Frames[0::int(n*frame_rate)]
print(len(one_frame_per_n_seconds), "frames to describe.")
# process the whole video in 50 frames intervals
batch_size = 50
plot = ""
for i in range(0, len(one_frame_per_n_seconds), batch_size):
    print(f"Processing frames {i} to {i+batch_size}")
    frame_batch = one_frame_per_n_seconds[i:i+batch_size]
    PROMPT_MESSAGES = [
        {
            "role": "system",
            "content": "You are being presented with video frames and are asked to generate the plot of the video from them. "
                f"You will be shown the video's plot up to and then {batch_size} frames, representing the next {batch_size} seconds of the video. "
                "Expand on the plot of the video based on these frames. "
                f"The plot so far is: \n {plot}"
        },
        {
            "role": "user",
            "content": [
                *map(lambda image_url: { 
                        "type": "image_url",
                        "image_url": {
                            "detail": "low",
                            "url": f"data:image/jpg;base64,{image_url}"
                        }
                    }, frame_batch),
            ],
        },
    ]
    
    result = client.chat.completions.create(model="openai/gpt-4o-mini", messages=PROMPT_MESSAGES)
    print(f"used {result.usage.prompt_tokens} prompt tokens and {result.usage.completion_tokens} completion tokens.\n")
    print(result.choices[0].message.content)
    plot += result.choices[0].message.content + "\n"   



## Image tokens 

The token cost of an input image depends on two main factors: the size of the image and the detail setting (low or high) used for each image. Here's a breakdown of how it works:

- **Detail: Low resolution mode**
    - Low detail allows the API to return faster responses and consume fewer input tokens for use cases that don’t require high detail.
    - These images cost 85 tokens each, regardless of the image size.
    - **Example: 4096 x 8192 image (low detail)**: The cost is a fixed 85 tokens, because it's a low detail image, and the size doesn't affect the cost in this mode.
      
- **Detail: High resolution mode**
    - High detail lets the API see the image in more detail by cropping it into smaller squares. Each square uses more tokens to generate text.
    - The token cost is calculated by a series of scaling steps:
        1. The image is first scaled to fit within a 2048 x 2048 square while maintaining its aspect ratio.
        1. The image is then scaled down so that the shortest side is 768 pixels long.
        1. The image is divided into 512-pixel square tiles, and the number of these tiles (rounding up for partial tiles) determines the final cost. Each tile costs 170 tokens.
        1. An additional 85 tokens are added to the total cost.
    - **Example: 2048 x 4096 image (high detail)**
        1. Initially resized to 1024 x 2048 to fit in the 2048 square.
        1. Further resized to 768 x 1536.
        1. Requires six 512px tiles to cover.
        1. Total cost is `170 × 6 + 85 = 1105` tokens.
