# Prerequisite:
## Set Up Google Cloud account (billing, service, credentials)
- If you are new to Google Cloud, create an account. 
- In the Google Cloud console, create a Google Cloud project. 
- Select IAM & Admin service from within this project in the Google Cloud dashboard, create service account and generate service account key. 

## [Set up a Python development environment that run on Google Cloud](https://cloud.google.com/python/docs/setup)
- Install Python;
- Use `venv` to create isolated Python environment;  
- Install the Cloud Client Libraries for the Python development environment;
- [Install and initialize the gcloud CLI.](https://cloud.google.com/sdk/docs/install-sdk)

## [Install Vertex AI SDK for Python](https://cloud.google.com/vertex-ai/docs/python-sdk/use-vertex-ai-python-sdk) 
- Vertex AI is a comprehensive, managed machine learning platform by Google Cloud that allows ML engineers to develop and maintain ML/AL models.
- ```
  pip install --upgrade google-cloud-aiplatform
  ```
- Enable all recommended Vertex AI APIs from Vertex AI dashboard, e.g. Gemini Pro via the Gemini API.     

  
## [Quickstart with Gemini API with Python](https://ai.google.dev/tutorials/python_quickstart)
- You can either run Jupyter notebook on your local editor or run codes in Google Colab. 
- Getting-started/intro_gemini_python.ipynb [github](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/getting-started/intro_gemini_python.ipynb) | [More Ex1](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/send-multimodal-prompts?_gl=1*17is47x*_ga*MTY5MzM3MDc0My4xNjc3NjE4OTE4*_ga_WH2QY8WWF5*MTcxMTc1Mjg3Ni45LjEuMTcxMTc1Mzk1OS4wLjAuMA..&_ga=2.179362852.-1693370743.1677618918) | [Ex2](https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart-multimodal)

# Import Libraries

In [106]:
import IPython
import time
import re 
import http.client
import typing
import urllib.request
import pathlib
import textwrap
import IPython.display
from IPython.display import display
from IPython.display import Markdown
from IPython.display import Video

from PIL import Image as PIL_Image
from PIL import ImageOps as PIL_ImageOps

from vertexai.generative_models import (
    GenerationConfig,
    GenerativeModel,
    Image,
    Part,
)
from vertexai.vision_models import (
    Video
)

# Load the Gemini 1.0 Pro Vision Model
`geminii-1.0-pro-vision` is a multimodal model that supports multimodal prompts. You can include text, image(s) and video in your prompt requests and get text or code response. 


In [2]:
multimodal_model = GenerativeModel("gemini-1.0-pro-vision")

# Helper functions for handling videos contents 
* Load video from uri
* Display video inline
* Generate response based on prompt  
* Display response in markdown

In [127]:
def load_video_from_gcs(gcs_uri: str):
    '''
    Load video from uri, which can be found from Google Cloud Storage. 
    :param gci: must be Google Clous URI (starting with 'gs://') because `Part.from_uri(uri="gs://....)` only takes gc uri. The bucket that stores the file must be in the same Google Cloud project that is sending the request. 
    :return: Part (vertexai.generative_mdels.Part. A part of a multi-part Content message.)
    '''
    video_uri = "gs://vlm-videos/dump_trash.mp4"
    return Part.from_uri(video_uri, mime_type="video/mp4")
    

def display_video_from_gcs(gcs_uri: str):
    # Display video in-line in the Jupyter notebook for interactive purpose.
    video_url = "https://storage.cloud.google.com/" + gcs_uri.replace("gs://", "") + "?authuser=1"
    return IPython.display.Video(video_url)  # must return b/c the displayhook only display the last "result" of the cell.
    
    
def generate_response_from_video(prompt: str, video: Part):
    content = [prompt, video]
    return multimodal_model.generate_content(content)


def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))    

# Load and display video in-line

In [123]:
gcsURI = "gs://vlm-videos/dump_trash.mp4" 
video = load_video_from_gcs(gcsURI)
display_video_from_gcs(gcsURI)


# Generate text response from a video file 
### Insight 1: response is not always the same. Here are some examples of responses:
* `['A hand came into the scene and threw the tomatoes into the trash one by one',
 'Then the hand threw the empty bottle into the trash',
 'Finally, the hand balled up the paper towel and threw it into the trash']`

* `A person is throwing 2 tomatoes into a trash can.`

### Insight 2: prompt engineering by breaking into sub-questions  

In [128]:
prompt = """
Answer the following questions using the video only:
What action was performed? 
"""
response = generate_response_from_video(prompt, video)
to_markdown(response.text)

>  A person is throwing 2 tomatoes into a trash can.

In [129]:
prompt = """
Answer the following questions using the video only:
What action was performed?
How many objects are there in the video?
Which objects stay stationary in the video?
Which objects change their positions in the video? 
"""
contents = [prompt, video]
responses = multimodal_model.generate_content(contents)

In [154]:
to_markdown(responses.text)

>  1. A hand enters the scene and picks up a clear plastic bottle.
> 2. There are 5 objects in the scene: 
>  - a clear plastic bottle
>  - a can
>  - 2 tomatoes
>  - a paper towel
> 3. The clear plastic bottle, the can, and the paper towel stay stationary.
> 4. The hand and the tomatoes change their positions.

In [156]:
parts = [r.strip() for r in re.split(r'\n?\d\.\s', responses.text) if r.strip()]
parts

['A hand enters the scene and picks up a clear plastic bottle.',
 'There are 5 objects in the scene: \n - a clear plastic bottle\n - a can\n - 2 tomatoes\n - a paper towel',
 'The clear plastic bottle, the can, and the paper towel stay stationary.',
 'The hand and the tomatoes change their positions.']