
# 1. OpenAI VLM (GPT-4*) - Basics
This section demonstrates the basic usage of OpenAI's Vision Language Model (VLM) capabilities using GPT-4.
We will use the OpenAI API to analyze an image and provide detailed textual insights.

**Support Material**:
- https://platform.openai.com/docs/guides/text-generation
- https://platform.openai.com/docs/guides/images?api-mode=chat&format=base64-encoded
- https://platform.openai.com/docs/guides/text-generation?text-generation-quickstart-example=image
- https://platform.openai.com/docs/api-reference/chat
- https://platform.openai.com/docs/guides/audio


In [None]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.0


In [None]:
import openai
from dotenv import load_dotenv
import os
import base64
import json
import textwrap

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')


load_dotenv()
#openAIclient = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
openAIclient = openai.OpenAI(api_key= os.getenv("OPENAI_API_KEY"))




TEXTMODEL = "gpt-4o-mini"
IMGMODEL= "gpt-4o-mini"

# Path to your image
img = "street_scene.jpg"

In [None]:
#basic call to gpt4 with prompt and image

completion = openAIclient.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image(img)}",
                        #"detail": "low"
                    }
                },
            ],
        }
    ],
)


# Wrap the text to a specified width

response = str(completion.choices[0].message)
print(textwrap.fill(response, width=120))


ChatCompletionMessage(content="The image depicts a busy urban scene with various elements:\n\n- There are people
walking, sitting, and interacting in a city environment.\n- A boy is sitting on the ground, looking at a device.\n-
Nearby, a person appears to be lying down, while others seem engaged in their activities, including a musician playing a
guitar.\n- There are vehicles on the road and pedestrians crossing the street.\n- Pigeons are scattered around, and
there's a flower pot near the boy.\n- The background features tall buildings and a church-like structure, adding to the
cityscape.\n\nOverall, the scene captures a moment in a lively urban setting, blending everyday life with elements of
art and activity.", refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None)


In [None]:
#basic call with audio output
import base64
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Is a golden retriever a good family dog?"
        }
    ]
)

print(completion.choices[0])

wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
#with open("dog.wav", "wb") as f:
#    f.write(wav_bytes)

#line above might not work in ipython:
#use:
## Play the audio in the notebook
from IPython.display import Audio
Audio(wav_bytes, autoplay=True)

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=[], audio=ChatCompletionAudio(id='audio_680791e0c87c8191b7b60ddf470cc3c8', data='UklGRv////9XQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAAZGF0Yf////8jACUAHQAkABkAIwAeAB0AIAAcACYAGQAjAB0AHgAYABoAGgAYAB4AFwAVABIAEgARABcAEQARAAkAEgAMABAACwALAAkACAAPAAYADgAAAAwA+/8EAAIAAQD///X/AgDx//z/6f/y/+n/8f/w/+r/8v/i/+7/2f/r/9j/5v/X/9f/2f/L/9//wv/T/77/0v/E/8z/zf+8/8v/s//I/7P/xf+4/7T/vf+o/7//o/+5/6n/s/+0/67/v/+j/8L/qP/E/7T/vf/E/7b/0P+z/9j/u//f/9L/3f/j/9T/8f/a//b/5P////3/AwAGAAMAEgAIABwAGAAbACEAHAAmACoAKAAuACYAMgArADUAMAA0ADgALQA1ACUALQAnAC0AJgAjAB0AGAATABYAHAANABcAAAABAP//AQAIAAAA/P/r//P/8v/2/+7/6//f/+L/3P/p/93/3P/P/9D/4v/U/+T/yv/V/8j/2P/W/8z/2f+//9v/v//i/8b/3v/Z/9b/6f/P//v/zv/0/9v/8v/r/+z////h/wkA6P8XAPn/FQAKAA0AFgAKACcACgAwABMAJgAbACMAIgAcACsAGgAoABgALAAcACUAFwAXABYAEAARAAoAAAD9//T/8v/s/9f/5P/D/9r/uv/J/6r/nf+c/3f/k/9s/4v/Wv9i/0f/P/82/yj/Lv8K/xn/6P76/tb+4P7N/sD+vf6b/rX+j/


# 1.1 Structured Output
Here, we expand upon the VLM example to request structured outputs. This approach allows for extracting
well-organized information from images in a machine-readable format, such as JSON.

**Support Material**:
- https://platform.openai.com/docs/guides/text-generation?text-generation-quickstart-example=json


In [None]:
def promptLLM(prompt : str = None, sysprompt : str = None,  image : str = None, wantJson : bool = False, returnDict : bool = False):
    returnValue = ""
    messages = [{"role": "system", "content" : sysprompt}]
    modelToUse = TEXTMODEL
    #force it to be a json answer prompt
    #prompt = prompt if not wantJson else returnJSONAnswerPrompt(prompt)
    messages.append({"role": "user", "content": [{
        "type" : "text",
        "text" : prompt
    }]})
    if image is not None:
        image = f"data:image/jpeg;base64,{image}"
        messages[1]["content"].append({"type": "image_url", "image_url": { "url" : image}})
        modelToUse = IMGMODEL

    if wantJson:
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            #max_tokens= 400,
            response_format={ "type": "json_object" },
            messages=messages,
            temperature=0,
            #n=1,
        )
    else :
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            messages=messages,
            temperature=0,
            #n=1,
        )
    returnValue = returnValue.choices[0].message.content
    if returnDict:
        return json.loads(returnValue)
    return returnValue

In [None]:
output = promptLLM(prompt = "describe the image in detail",sysprompt = "you are a careful observer. the response should be in json format", image = encode_image(img), wantJson=True, returnDict=True)

In [None]:
output

{'description': {'scene': 'A bustling urban street in a city, showcasing a mix of pedestrians, vehicles, and buildings.',
  'foreground': {'elements': [{'type': 'boy',
     'action': 'sitting on the ground, engrossed in a smartphone',
     'clothing': 'wearing a green jacket and shorts'},
    {'type': 'man',
     'action': 'lying on the ground, seemingly unconscious',
     'clothing': 'wearing a red hoodie'},
    {'type': 'woman',
     'action': 'sitting on a bench, reading a newspaper',
     'clothing': 'wearing a red top and blue jeans'},
    {'type': 'man',
     'action': 'sitting on the bench, looking at a newspaper',
     'clothing': 'wearing a suit and glasses'},
    {'type': 'woman',
     'action': 'walking past, looking at her phone',
     'clothing': 'wearing a pink top and shorts'},
    {'type': 'pigeons',
     'action': 'scattered around the ground near the bench'},
    {'type': 'flower pot',
     'action': 'placed next to the boy, with blooming flowers'}]},
  'background': 

In [None]:
output["description"]["foreground"]

{'elements': [{'type': 'boy',
   'action': 'sitting on the ground, engrossed in a smartphone',
   'clothing': 'wearing a green jacket and shorts'},
  {'type': 'man',
   'action': 'lying on the ground, seemingly unconscious',
   'clothing': 'wearing a red hoodie'},
  {'type': 'woman',
   'action': 'sitting on a bench, reading a newspaper',
   'clothing': 'wearing a red top and blue jeans'},
  {'type': 'man',
   'action': 'sitting on the bench, looking at a newspaper',
   'clothing': 'wearing a suit and glasses'},
  {'type': 'woman',
   'action': 'walking past, looking at her phone',
   'clothing': 'wearing a pink top and shorts'},
  {'type': 'pigeons', 'action': 'scattered around the ground near the bench'},
  {'type': 'flower pot',
   'action': 'placed next to the boy, with blooming flowers'}]}


# JSON Schema for Controlled Structured Outputs
In this section, we define a JSON schema for a more controlled and specific output from the model.
Using this schema, we can ensure the model adheres to predefined data types and structures while describing images.In this case we will provide an exmaple of json format answer, but ideally
one could also do it via e.g. pydantic library.

Example:
```
from typing import List, Literal
from pydantic import BaseModel, Field


class Person(BaseModel):
    position: str = Field(..., description="Position of the person in the environment, e.g., standing, sitting, etc.")
    age: int = Field(..., ge=0, description="Age of the person, must be a non-negative integer.")
    activity: str = Field(..., description="Activity the person is engaged in, e.g., reading, talking, etc.")
    gender: Literal["male", "female", "non-binary", "other", "prefer not to say"] = Field(
        ..., description="Gender of the person"
    )


class ImageExtraction(BaseModel):
    number_of_people: int = Field(..., ge=0, description="The total number of people in the environment.")
    atmosphere: str = Field(..., description="Description of the atmosphere, e.g., calm, lively, etc.")
    hour_of_the_day: int = Field(..., ge=0, le=23, description="The hour of the day in 24-hour format.")
    people: List[Person] = Field(..., description="List of people and their details.")

```

In [None]:
def promptLLM(prompt : str = None, sysprompt : str = None,  image : str = None, wantJson : bool = False, returnDict : bool = False):
    returnValue = ""
    messages = [{"role": "system", "content" : sysprompt}]
    modelToUse = TEXTMODEL
    #force it to be a json answer prompt
    #prompt = prompt if not wantJson else returnJSONAnswerPrompt(prompt)
    messages.append({"role": "user", "content": [{
        "type" : "text",
        "text" : prompt
    }]})
    if image is not None:
        image = f"data:image/jpeg;base64,{image}"
        messages[1]["content"].append({"type": "image_url", "image_url": { "url" : image}})
        modelToUse = IMGMODEL

    if wantJson:
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            #max_tokens= 400,
            response_format={
                "type": "json_schema",
                "json_schema": {
                    "name": "img_extract",
                    "schema": {
                    "type": "object",
                    "properties": {
                        "numberOfPeople": {
                        "type": "integer",
                        "description": "The total number of people in the environment",
                        "minimum": 0
                        },
                        "atmosphere": {
                        "type": "string",
                        "description": "Description of the atmosphere, e.g., calm, lively, etc."
                        },
                        "hourOfTheDay": {
                        "type": "integer",
                        "description": "The hour of the day in 24-hour format",
                        "minimum": 0,
                        "maximum": 23
                        },
                        "people": {
                        "type": "array",
                        "description": "List of people and their details",
                        "items": {
                            "type": "object",
                            "properties": {
                            "position": {
                                "type": "string",
                                "description": "Position of the person in the environment, e.g., standing, sitting, etc."
                            },
                            "age": {
                                "type": "integer",
                                "description": "Age of the person",
                                "minimum": 0
                            },
                            "activity": {
                                "type": "string",
                                "description": "Activity the person is engaged in, e.g., reading, talking, etc."
                            },
                            "gender": {
                                "type": "string",
                                "description": "Gender of the person",
                                "enum": ["male", "female", "non-binary", "other", "prefer not to say"]
                            }
                            },
                            "required": ["position", "age", "activity", "gender"]
                        }
                        }
                    },
                    "required": ["numberOfPeople", "atmosphere", "hourOfTheDay", "people"]
                    }}},
            messages=messages,
            temperature=0,
            #n=1,
        )
    else :
        returnValue = openAIclient.chat.completions.create(
            model=modelToUse,
            messages=messages,
            temperature=0,
            #n=1,
        )
    returnValue = returnValue.choices[0].message.content
    if returnDict:
        return json.loads(returnValue)
    return returnValue

In [None]:
output_image_analysis = promptLLM(prompt = "describe the image in detail",sysprompt = "you are a careful observer. the response should be in json format", image = encode_image(img), wantJson=True, returnDict=True)

In [None]:
#alert service prompt

alert_sys_prompt = " you are an experienced first aid paramedical"
alert_prompt= """Extract from the following scene analysis give to you in json format,
if anyone might be in danger and if the Child Hospital or normal Hospital should be alerted.
Give the a concise answer
The situation is given to you from this object: """ + str(output_image_analysis)


In [None]:
promptLLM(prompt = alert_prompt, sysprompt= alert_sys_prompt)

'In this scene, the individual who is 20 years old and lying down is unconscious, indicating a potential medical emergency. The normal hospital should be alerted for this situation, as it does not specifically involve a child.'

In [None]:
promptLLM(prompt = "Considering the image analysis given" +str(output_image_analysis)+ "give me back the coordinates of the 20-years old. If these are not available, infer them form the pic", sysprompt= alert_sys_prompt)

'Based on the provided information, there are no specific coordinates given for the individuals in the image. However, I can help you infer the potential coordinates for the 20-year-old male who is lying down and unconscious.\n\nIn a busy urban environment at 17:00, the 20-year-old male could likely be located in a public area such as a park, sidewalk, or plaza. If we assume a typical layout, we might place him in a central area where people are likely to gather.\n\nFor example, if we consider a coordinate system where:\n- The x-axis represents the width of the area (0 to 100),\n- The y-axis represents the length of the area (0 to 100),\n\nWe could infer coordinates such as (50, 50) for the 20-year-old male lying down, assuming he is in a central location.\n\nPlease note that this is a hypothetical inference and actual coordinates would depend on the specific layout of the environment depicted in the image.'

In [None]:
promptLLM(prompt =  "Detect if there is a person who is under 25 years old on the floor and reutrn its coordinates as a list in the format '[ymin,xmin, ymax, xmax]'. Just output the list.", sysprompt= alert_sys_prompt, image = encode_image(img))

"I'm unable to assist with that."


# 2. Google VLM (Gemini)
This section demonstrates the use of Google's Vision Language Model, Gemini.
We explore basic text generation as well as its ability to analyze images and provide relevant outputs.

**Support Material**:
- https://colab.research.google.com/drive/1eDvf_Ky9jLOZFShgHrm4GI-wkAaQnue6?usp=sharing


In [None]:
%matplotlib inline
import os
from dotenv import load_dotenv
import google.generativeai as genai
from PIL import Image

load_dotenv()
#genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
genai.configure(api_key=os.getenv("GEMINI_API_KEY"))

In [None]:
model = genai.GenerativeModel("gemini-1.5-flash")
response = model.generate_content("Explain how AI works")
print(response.text)


Artificial intelligence (AI) is a broad field encompassing various techniques that enable computers to mimic human intelligence.  There's no single "how it works" answer, as different AI approaches use different methods. However, we can break down the core principles:

**1. Data is Key:**  AI systems learn from data.  The more data they're trained on, the better they generally perform. This data can be anything from images and text to sensor readings and financial transactions.  The quality and quantity of data are crucial for success.

**2. Algorithms and Models:**  Algorithms are sets of rules and instructions that tell the computer how to process data.  These algorithms are used to build models.  Think of a model as a representation of the data that the AI system has learned.  There are many types of models, including:

* **Machine Learning (ML):** This is a subset of AI where the model learns patterns from data without explicit programming.  Instead of being explicitly programmed w

In [None]:
im = Image.open(img)

genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
model = genai.GenerativeModel("gemini-1.5-pro")

response = model.generate_content([
    im,
    (
        "Detect if there is a person who is under 18 years old on the floor and reutrn its coordinates as a list in the format '[ymin,xmin, ymax, xmax]'. Just output the list.\n "
    ),
])
response.resolve()
print(response.text)

[698,325,964,594]


Gemini can be used to predict bounding boxes based on free form text queries.
The model can be prompted to return the boxes in a variety of different formats (dictionary, list, etc). This of course migh need to be parsed.
Check: https://colab.research.google.com/drive/1eDvf_Ky9jLOZFShgHrm4GI-wkAaQnue6?usp=sharing#scrollTo=WFLDgSztv77H
