## Multimodal conversation - adding images/videos
With the latest GPT-4 vision API, you can now attach images to your messages and the LLM will provide an appropriate response. Try out the code below:

### In order to get access to GPT4 vision preview you need to buy at least $0.50 of credits on your OpenAI account. You can also buy $6 of credits and you will be upgraded to tier-1 which gives you access to such models as well as higher rate limits.



In [None]:
%pip install openai
%pip install opencv-python

In [41]:
# Import the required modules
from openai import OpenAI
from IPython.display import display, Image, Audio, Markdown


# Set the client and api key

llm_client = OpenAI(
    api_key=''
)

In [None]:
# Conversational demonstration using the new vision api (BETA)
# Note that rather than using a string for context, we can now give a list with json/dicts that have a mixture of text and either image urls or base64 representations that you would essentially upload yourself.
# You can use then when you need to use whatever is in the image to get an description or an answer that takes it into context

response = llm_client.chat.completions.create(
    model='gpt-4-vision-preview',
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Whats in this image?"},
                {
                    "type": "image_url",
                    "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
            ],
        }
    ],
    temperature=0.7,
    max_tokens=300
)

Markdown(response.choices[0].message.content)

As you can see, we get an accurate descrption of the image. The image can be a url or an uploaded image represented in base64. Let us try another example whilst also using some behavioural prompts.

We will load up a video of a crosswalk and try to provide narration:

In [None]:

import cv2 
import base64
import time

video = cv2.VideoCapture("./media/pexels-kelly-lacy-7409134 (720p).mp4")

base64Frames = []
while video.isOpened():
    success, frame = video.read()
    if not success:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

video.release()
print(len(base64Frames), "frames read.")



In [None]:
display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8"))))
    time.sleep(0.025)


In [45]:
PROMPT_MESSAGES = [
        {"role": "system", "content": "You're Spot, an application that helps people with vision diffculties to be aware of their surroundings."},
        {"role": "system", "content": "As Spot, your responsibility is the well-being of the user and take their safety in the highest regard."},
        {"role": "system", "content": "As Spot, you do this by analysing the video frames provided to you to provide guidance and safety to the user."},
        {"role": "system", "content": "Remember Spot, the user is blind so you will need to explain any important details. Clearly indicate any dangers."},
        {"role": "user",
        "content": [
            {"type": "text", "text": "Spot, what do you see?"},
            {"type": "image_url", "image_url": f"data:image/jpeg;base64,{base64Frames[0]}"},
            {"type": "image_url", "image_url": f"data:image/jpeg;base64,{base64Frames[20]}"},
            {"type": "image_url", "image_url": f"data:image/jpeg;base64,{base64Frames[50]}"},
            {"type": "image_url", "image_url": f"data:image/jpeg;base64,{base64Frames[200]}"}
        ],
    },
]

result = llm_client.chat.completions.create(
     model="gpt-4-vision-preview",
    messages= PROMPT_MESSAGES,
    max_tokens= 200,
)

Markdown(result.choices[0].message.content)



You're looking at a busy urban street scene during what appears to be daytime. There's moderate traffic with vehicles on the street, and people appear to be walking on the sidewalk. You can hear the hum of city noise, perhaps cars passing by, and people's voices.

Directly ahead, you'll notice a crosswalk. There is a pedestrian crossing signal showing a red hand, indicating that it is not safe to cross at the moment. There are several pedestrians waiting at the crossing, suggesting they are obeying the signal.

On the right-hand side of the image, there's an outdoor seating area with people which could indicate a restaurant or cafe. This area is blocked off by barriers, so there is no direct danger from vehicular traffic there.

To ensure your safety, I advise you to come to a stop if you are approaching the crosswalk and wait for the appropriate audible signals that indicate it is safe to cross. Always pay attention to any surrounding activity and potential hazards, and pause if

In [None]:

response = llm_client.audio.speech.create(
    model="tts-1",
    input=result.choices[0].message.content,
    voice="onyx"
).stream_to_file('output.mp3')

Audio('output.mp3')
