# Multimodal Prompt Engineering with Google Gemini

## Objective:

### Here we will:
- Analyze images for insights
- Analyze audio for insights
- Understand videos including their audio components
- Extract relevant information from PDF documents
- Process images, videos, audios, and texts simultaneously

### Install Google GenAI Library for Python

In [None]:
!pip install google-generativeai==0.8.3

### Enter Gemini API Key

In [None]:
from getpass import getpass

GOOGLE_KEY = getpass('Enter Gemini API Key: ')

### Import Libraries

In [None]:
import google.generativeai as genai

genai.configure(api_key=GOOGLE_KEY)

for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

### Load the Gemini 2.5 Flash Model

In [None]:
generation_config = genai.types.GenerationConfig(
    temperature=0
)
gemini = genai.GenerativeModel(model_name='gemini-2.5-flash',
                               generation_config=generation_config)

### Testing the LLM for Basic Usage

In [None]:
from IPython.display import Markdown, display

prompt = """
Explain the difference between Generative AI and Agentic AI in 3 bullet points
"""

response = gemini.generate_content(contents=prompt)
display(Markdown(response.text))

### Image Analysis

In [None]:
# Download images using curl
!curl https://i.imgur.com/6b9jwkk.png -o image1.png
!curl https://i.imgur.com/9CWuU2q.png -o image2.png

In [None]:
from IPython.display import Image as ImageDisp, display

display(ImageDisp('image1.png'))

In [None]:
display(ImageDisp('image2.png'))

In [None]:
from PIL import Image

image1 = Image.open('image1.png')
image2 = Image.open('image2.png')

In [None]:
# Using multimodal task-oriented prompting

prompt = """
Given the following images which contrain graphs, tables, and text, analyze
all of them to answer the following questions:

Tell me about the top 5 years with the largest wildfires
"""

contents = [image1, image2, prompt]
response = gemini.generate_content(contents)
display(Markdown(response.text))

In [None]:
# Another example of multimodal task-oriented prompting

prompt = """
Given the following images which contrain graphs, tables, and text, analyze
all of them to answer the following questions:

Tell me about the trend of wildfires in terms of acreage burned by region and
ownership
"""

contents = [image1, image2, prompt]
response = gemini.generate_content(contents)
display(Markdown(response.text))

### PDF Document Analysis

In [None]:
!wget https://arxiv.org/pdf/1706.03762.pdf

In [None]:
pdf_reference = genai.upload_file(path='./1706.03762.pdf')
pdf_reference

In [None]:
prompt = """
Given the PDF file, use it to answer the following question:

Tell me about the research paper mentioned here
"""

contents = [pdf_reference, prompt]
response = gemini.generate_content(contents)
Markdown(response.text)

### Audio Understanding

In [None]:
!wget "https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/pixel.mp3"

In [None]:
audio_file = genai.upload_file(path='./pixel.mp3')

In [None]:
import IPython

IPython.display.Audio('./pixel.mp3')

In [None]:
audio_file

### Title Generation

In [None]:
prompt = """
Please provide a summary for the audio. Provide chapter titles with timestamps,
be concise and to the point, no need to provide chapter summaries. Do not make
up any information that is not part of the audio and do not be verbose.
"""

contents = [audio_file, prompt]
response = gemini.generate_content(contents=contents)
print(response.text)

### Audio Transcriptions

In [None]:
prompt = """
Can you transcribe this interview, in the format of [timecode] - [speaker] : caption.
use speaker A, speaker B, etc. to identify the speakers. Map each speaker to their
real name at the start of the output. Each speaker should have a single caption
based on their timestamp. Do nmot break up the transcript into multiple timestamps
for the same speaker. Show the output only for the part of the conversation about
the pixel watch and follow the format mentiond
"""

contents = [audio_file, prompt]
response = gemini.generate_content(contents=contents)
print(response.text)

### Audio Summarization

In [None]:
prompt = """
Given the audio file, generate a comprehensive summary of:
- Key speakers
- Key products and features discussed
- Any other notale discussions
"""

contents = [audio_file, prompt]
response = gemini.generate_content(contents=contents)
display(Markdown(response.text))

### Video with Audio Understanding

In [None]:
!wget "https://storage.googleapis.com/cloud-samples-data/generative-ai/video/pixel8.mp4"

In [None]:
IPython.display.Video('pixel8.mp4', embed=True, width=450)

In [None]:
video_file = genai.upload_file(path='./pixel8.mp4')

In [None]:
prompt = """
Provide a comprehensive summary of the video. The sumamry should also include
anything important which people discuss in the video.
"""

contents = [video_file, prompt]
response = gemini.generate_content(contents=contents)
display(Markdown(response.text))

### All modalities at once (images, video, audio, and text)

In [None]:
!wget 'https://storage.googleapis.com/cloud-samples-data/generative-ai/video/behind_the_scenes_pixel.mp4'

In [None]:
IPython.display.Video('behind_the_scenes_pixel.mp4', embed=True, width=450)

In [None]:
!wget 'https://storage.googleapis.com/cloud-samples-data/generative-ai/image/a-man-and-a-dog.png'

In [None]:
IPython.display.Image('a-man-and-a-dog.png', width=450)

In [None]:
video_file = genai.upload_file(path='./behind_the_scenes_pixel.mp4')
image_file = genai.upload_file(path='./a-man-and-a-dog.png')

In [None]:
prompt = """
Look through each frame of the video carefully and asnwer the following
questions. Only base your answers on what information is available in the
video attached. Do not make up any information that is not part of the
and summarize your answer in three bullets max.

Questions:
- What part of the video does the image provided occur? Provide a timestamp.
- What is the context of this moment and what does the narrator say about it? Be
very specific.
"""

contents = [video_file, image_file, prompt]
response = gemini.generate_content(contents=contents)
display(Markdown(response.text))