# Video Question And Answering using Gemini Pro Vision Model

Note: Only video frames is supported at the moment based on which questions can be asked and model will answer based on the frames of the video. Audio is not supported.

For now, videos are required to uploaded to Google Cloud Storage and will work on publicly available link.

In [1]:
%cd ..

/Users/isham993/Desktop/Programming-Tutorials/decoding-data-science/getting-started-with-gemini-models-demo


### Importing Necessary Libraries

In [2]:
from vertexai.preview.generative_models import (
    GenerativeModel,
    Part,
)
from google.oauth2 import service_account  # importing auth using service_account
import json

import os
import base64

### Authenticate Google Service Account Credentials

In [3]:
with open(os.getenv("GOOGLE_APPLICATION_CREDENTIALS"), "r") as source:
    info = json.load(source)
    service_account.Credentials.from_service_account_info(info)

GEMINI_PRO_VISION = os.getenv("GEMINI_PRO_VISION")

### Model Instantiation

In [4]:
multimodal_model = GenerativeModel(GEMINI_PRO_VISION)

### Helper Functions

In [5]:
def convert_file_to_data(file_path: str, mime_type: str):
    with open(file_path, "rb") as file:
        data = base64.b64encode(file.read())
        file_data = Part.from_data(data=base64.b64decode(data), mime_type=mime_type)
        return file_data

### Getting JSON Response

Lets ask question to the video showcasing the profession of the person and other details.

<video width="500" height="500" controls>
  <source src="./person_doing_something.mp4" type="video/mp4">
</video>

In [6]:
prompt = """
Answer the following questions using the video only:
What is the profession of the main person?
What is he doing he doing exactly?
Which city was this recorded in?
Provide the answer JSON.
"""
video = convert_file_to_data(
    file_path="./artifacts/sample_videos/person_doing_something.mp4",
    mime_type="video/mp4",
)
contents = [prompt, video]

responses = multimodal_model.generate_content(contents, stream=True)


print("\n-------Response--------")
for response in responses:
    print(response.text, end="")


-------Response--------
 ```json
{
  "profession": "detective",
  "action": "inspecting a document with a magnifying glass",
  "city": "not possible to tell"
}
```

### Extracting tags of objects throughout the video

Example of video url in Google Storage:

- `file_path = "github-repo/img/gemini/multimodality_usecases_overview/pixel8.mp4"`
- `video_url = f"https://storage.googleapis.com/{file_path}"`

The model only accepts uri for videos at this point. 
- `video_uri = f"gs://{file_path}"`

So, `https://storage.googleapis.com/` will be replaced with `gs://` when passing to `Part.from_uri` class. 

Gemini pro vision model also is able to extract tags from video. Lets see how it does for the following video. 

<video width="500" height="500" controls>
  <source src="https://storage.googleapis.com/github-repo/img/gemini/multimodality_usecases_overview/photography.mp4" type="video/mp4">
</video>

In [7]:
prompt = """
Answer the following questions using the video only:
- What is in the video?
- What is the action in the video?
- Provide 10 best tags for this video?
"""
video = Part.from_uri(
    uri="gs://github-repo/img/gemini/multimodality_usecases_overview/photography.mp4",
    mime_type="video/mp4",
)
contents = [prompt, video]

responses = multimodal_model.generate_content(contents, stream=True)

print("\n-------Response--------")
for response in responses:
    print(response.text, end="")


-------Response--------
 - The video shows a man in a hat taking pictures of some artifacts on a table.
- The man is taking pictures of the artifacts.
- #photography, #art, #travel, #vacation, #beach, #sun, #sand, #water, #nature, #explore

### Retrieving extra information beyond the video

Lets ask which all towns are nearby where the car is heading.

<video width="500" height="500" controls>
  <source src="./car_in_mountainous_area.mp4" type="video/mp4">
</video>

In [8]:
prompt_1 = "Which location is this?"
prompt_2 = "Where the lane will lead to the car?"
prompt_3 = "List all of the nearby towns that is ahead of the vehicle."

video = convert_file_to_data(
    file_path="./artifacts/sample_videos/car_in_mountainous_area.mp4",
    mime_type="video/mp4",
)
contents = [prompt_1, prompt_2, prompt_3, video]

responses = multimodal_model.generate_content(contents, stream=True)

print("\n-------Response--------")
for response in responses:
    print(response.text, end="")


-------Response--------
 The location is Norway. The road is surrounded by mountains and a lake. The car is driving on a road that is next to a lake. The road is narrow and winding. The car is driving in the direction of the arrow. The nearby towns ahead of the vehicle are:

- Geiranger
- Hellesylt
- Stranda
- Sykkylven
- Ålesund