<a href="https://colab.research.google.com/github/davidwolfhart/android-fundamentals-apps-v2/blob/master/quickstarts/Video_understanding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2025 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Video understanding with Gemini

<a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Video_understanding.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=30/></a>

Gemini has from the begining been a multimodal model, capable of analyzing all sorts of medias using its [long context window](https://developers.googleblog.com/en/new-features-for-the-gemini-api-and-google-ai-studio/).

[Gemini 2.0](https://ai.google.dev/gemini-api/docs/models/gemini-v2) and later bring video analysis to a whole new level as illustrated in [this video](https://www.youtube.com/watch?v=Mot-JEU26GQ):


In [None]:
#@title Building with Gemini 2.0: Video understanding
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/Mot-JEU26GQ?si=pcb7-_MZTSi_1Zkw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

This notebook will show you how to easily use Gemini to perform the same kind of video analysis. Each of them has different prompts that you can select using the dropdown, also feel free to experiment with your own.

You can also check the [live demo](https://aistudio.google.com/starter-apps/video) and try it on your own videos on [AI Studio](https://aistudio.google.com/starter-apps/video).

## Setup

This section install the SDK, set it up using your [API key](../quickstarts/Authentication.ipynb), imports the relevant libs, downloads the sample videos and upload them to Gemini.

Expand the section if you are curious, but you can also just run it (it should take a couple of minutes since there are large files) and go straight to the examples.

### Install SDK

The new **[Google Gen AI SDK](https://ai.google.dev/gemini-api/docs/sdks)** provides programmatic access to Gemini 2.0 (and previous models) using both the [Google AI for Developers](https://ai.google.dev/gemini-api/docs) and [Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/overview) APIs. With a few exceptions, code that runs on one platform will run on both. This means that you can prototype an application using the Developer API and then migrate the application to Vertex AI without rewriting your code.

More details about this new SDK on the [documentation](https://ai.google.dev/gemini-api/docs/sdks) or in the [Getting started](../quickstarts/Get_started.ipynb) notebook.

In [16]:
%pip install -U -q 'google-genai'

### Setup your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](../quickstarts/Authentication.ipynb) for an example.

In [17]:
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY_2025')

### Initialize SDK client

With the new SDK you now only need to initialize a client with you API key (or OAuth if using [Vertex AI](https://cloud.google.com/vertex-ai)). The model is now set in each call.

In [18]:
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

### Select the Gemini model

Video understanding works best Gemini 2.5 pro model. You can also select former models to compare their behavior but it is recommended to use at least the 2.0 ones.

For more information about all Gemini models, check the [documentation](https://ai.google.dev/gemini-api/docs/models/gemini) for extended information on each of them.


In [20]:
model_name = "gemini-2.5-flash-preview-04-17" # @param ["gemini-1.5-flash-latest","gemini-2.0-flash-lite","gemini-2.0-flash","gemini-2.5-flash-preview-04-17","gemini-2.5-pro-exp-05-06"] {"allow-input":true, isTemplate: true}

### Get sample videos

You will start with uploaded videos, as it's a more common use-case, but you will also see later that you can also use Youtube videos.

In [21]:
# Load sample images
!wget https://storage.googleapis.com/generativeai-downloads/videos/Pottery.mp4 -O Pottery.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/Jukin_Trailcam_Videounderstanding.mp4 -O Trailcam.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4 -O Post_its.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/user_study.mp4 -O User_study.mp4 -q

### Upload the videos

Upload all the videos using the File API. You can find modre details about how to use it in the [Get Started](../quickstarts/Get_started.ipynb#scrollTo=KdUjkIQP-G_i) notebook.

This can take a couple of minutes as the videos will need to be processed and tokenized.

In [22]:
import time

def upload_video(video_file_name):
  video_file = client.files.upload(file=video_file_name)

  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name)

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)
  print(f'Video processing complete: ' + video_file.uri)

  return video_file

pottery_video = upload_video('Pottery.mp4')
trailcam_video = upload_video('Trailcam.mp4')
post_its_video = upload_video('Post_its.mp4')
user_study_video = upload_video('User_study.mp4')

Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/ndvzj4blgxo3
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/21b05ndbv8mg
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/rt570xd5e1v2
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/mt29kj60n8m1


### Imports

In [23]:
import json
from PIL import Image
from IPython.display import display, Markdown, HTML

# Search within videos

First, try using the model to search within your videos and describe all the animal sightings in the trailcam video.

<video controls width="500"><source src="https://storage.googleapis.com/generativeai-downloads/videos/Jukin_Trailcam_Videounderstanding.mp4" type="video/mp4"></video>

In [24]:
prompt = "For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video."  # @param ["For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video.", "Organize all scenes from this video in a table, along with timecode, a short description, a list of objects visible in the scene (with representative emojis) and an estimation of the level of excitement on a scale of 1 to 10"] {"allow-input":true}

video = trailcam_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

```json
[
  {
    "timecode": "00:00",
    "caption": "A camera view of the ground and leaves."
  },
  {
    "timecode": "00:01",
    "caption": "Two gray foxes are on the forest floor near rocks, sniffing the ground."
  },
  {
    "timecode": "00:17",
    "caption": "A mountain lion walks through the forest at night, sniffing the ground."
  },
  {
    "timecode": "00:28",
    "caption": "The mountain lion lifts its head and looks around."
  },
  {
    "timecode": "00:33",
    "caption": "The mountain lion walks away."
  },
  {
    "timecode": "00:35",
    "caption": "Two gray foxes are on the ground at night. One suddenly jumps up."
  },
  {
    "timecode": "00:45",
    "caption": "The two foxes interact briefly and then one runs off."
  },
  {
    "timecode": "00:50",
    "caption": "A loud noise is heard, and the camera lens flashes."
  },
  {
    "timecode": "00:51",
    "caption": "Several gray foxes are seen reacting to the noise, some running up rocky terrain."
  },
  {
    "timecode": "00:57",
    "caption": "A fox looks towards the camera with glowing eyes."
  },
  {
    "timecode": "01:00",
    "caption": "Foxes continue moving around on the rocks."
  },
  {
    "timecode": "01:05",
    "caption": "A mountain lion walks into view at night and looks around."
  },
  {
    "timecode": "01:10",
    "caption": "The mountain lion walks away."
  },
  {
    "timecode": "01:18",
    "caption": "Two mountain lions are walking on rocks at night, with their eyes glowing."
  },
  {
    "timecode": "01:24",
    "caption": "One mountain lion walks off a rock and past the camera."
  },
  {
    "timecode": "01:29",
    "caption": "A bobcat stands and looks around at night, then sniffs the ground."
  },
  {
    "timecode": "01:41",
    "caption": "The bobcat sits down briefly then stands back up, looking around."
  },
  {
    "timecode": "01:45",
    "caption": "The bobcat sniffs the ground again."
  },
  {
    "timecode": "01:51",
    "caption": "A black bear walks into view during the day, sniffing the ground."
  },
  {
    "timecode": "01:54",
    "caption": "The bear walks away."
  },
  {
    "timecode": "01:57",
    "caption": "A mountain lion walks through the forest at dusk/night, with glowing eyes."
  },
  {
    "timecode": "02:04",
    "caption": "The mountain lion walks out of frame."
  },
  {
    "timecode": "02:05",
    "caption": "A close view of a bear's fur as it walks past the camera."
  },
  {
    "timecode": "02:08",
    "caption": "The bear continues walking away."
  },
  {
    "timecode": "02:12",
    "caption": "Two young bears (cubs) are on the ground, sniffing."
  },
  {
    "timecode": "02:18",
    "caption": "The two cubs continue sniffing the ground."
  },
  {
    "timecode": "02:23",
    "caption": "A gray fox is on a hillside at night with city lights in the distance, sniffing the ground."
  },
  {
    "timecode": "02:31",
    "caption": "The fox sits and looks out at the city lights."
  },
  {
    "timecode": "02:35",
    "caption": "A black bear walks past the camera on the hillside at night."
  },
  {
    "timecode": "02:42",
    "caption": "A mountain lion walks into view on the hillside at night, sniffing the ground."
  },
  {
    "timecode": "02:49",
    "caption": "The mountain lion walks out of frame."
  },
  {
    "timecode": "02:52",
    "caption": "A mountain lion is sniffing the ground near a tree at night."
  },
  {
    "timecode": "03:00",
    "caption": "The mountain lion stands up and looks around."
  },
  {
    "timecode": "03:04",
    "caption": "The mountain lion walks away."
  },
  {
    "timecode": "03:05",
    "caption": "A black bear stands on the forest floor during the day, looking around and making sounds."
  },
  {
    "timecode": "03:13",
    "caption": "The bear looks to the side."
  },
  {
    "timecode": "03:22",
    "caption": "A black bear walks into view and sniffs the ground. Another bear joins it."
  },
  {
    "timecode": "03:30",
    "caption": "Both bears are sniffing the ground."
  },
  {
    "timecode": "03:39",
    "caption": "One bear walks close to the camera."
  },
  {
    "timecode": "03:41",
    "caption": "The two bears continue sniffing the ground."
  },
  {
    "timecode": "03:51",
    "caption": "One bear sits down and scratches."
  },
  {
    "timecode": "03:58",
    "caption": "The sitting bear stands up and walks away, followed by the other bear."
  },
  {
    "timecode": "04:02",
    "caption": "Two lighter colored bears walk towards the camera, sniffing the ground."
  },
  {
    "timecode": "04:12",
    "caption": "One lighter colored bear walks past the camera."
  },
  {
    "timecode": "04:22",
    "caption": "A bobcat sits and looks at the camera at night."
  },
  {
    "timecode": "04:25",
    "caption": "The bobcat walks off along a log."
  },
  {
    "timecode": "04:31",
    "caption": "A gray fox walks into view at night, looking at the camera and sniffing."
  },
  {
    "timecode": "04:40",
    "caption": "The fox walks away."
  },
  {
    "timecode": "04:44",
    "caption": "Another gray fox walks into view at night, looking at the camera."
  },
  {
    "timecode": "04:48",
    "caption": "The fox walks away."
  },
  {
    "timecode": "04:57",
    "caption": "A mountain lion is sniffing the ground near a tree at night."
  },
  {
    "timecode": "05:03",
    "caption": "The mountain lion looks around."
  },
  {
    "timecode": "05:06",
    "caption": "The mountain lion walks away."
  }
]
```

The prompt used is quite a generic one, but you can get even better results if you cutomize it to your needs (like asking specifically for foxes).

The [live demo on AI Studio](https://aistudio.google.com/starter-apps/video) shows how you can postprocess this output to jump directly to the the specific part of the video by clicking on the timecodes. If you are interested, you can check the [code of that demo on Github](https://github.com/google-gemini/starter-applets/tree/main/video).

# Extract and organize text

Gemini can also read what's in the video and extract it in an organized way. You can even use Gemini reasoning capabilities to generate new ideas for you.

<video controls width="400"><source src="https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4" type="video/mp4"></video>

In [25]:
prompt = "Transcribe the sticky notes, organize them and put it in a table. Can you come up with a few more ideas?" # @param ["Transcribe the sticky notes, organize them and put it in a table. Can you come up with a few more ideas?", "Which of those names who fit an AI product that can resolve complex questions using its thinking abilities?"] {"allow-input":true}

video = post_its_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

Okay, here are the transcribed project name ideas from the sticky notes, organized in a table, along with a few additional ideas based on similar themes.

**Brainstorm: Project Name Ideas from Sticky Notes**

| Project Name        | Potential Category/Theme        |
| :------------------ | :------------------------------ |
| Aether              | Mythology, Space                |
| Andromeda's Reach   | Space, Astronomy                |
| Astral Forge        | Space, Abstract, Fantasy        |
| Athena              | Mythology                       |
| Athena's Eye        | Mythology, Abstract             |
| Bayes Theorem       | Math, Science                   |
| Canis Major         | Space, Astronomy (Constellation) |
| Celestial Drift     | Space, Abstract                 |
| Centaurus           | Mythology, Astronomy            |
| Cerberus            | Mythology                       |
| Chaos Field         | Science, Abstract               |
| Chaos Theory        | Math, Science                   |
| Chimera Dream       | Mythology, Abstract, Fantasy    |
| Comets Tail         | Space, Astronomy                |
| Convergence         | Math, Science, Abstract         |
| Delphinus           | Mythology, Astronomy            |
| Draco               | Mythology, Astronomy            |
| Echo                | Mythology, Abstract             |
| Equilibrium         | Math, Science, Abstract         |
| Euler's Path        | Math, Science                   |
| Fractal             | Math, Science, Abstract         |
| Galactic Core       | Space, Astronomy                |
| Golden Ratio        | Math, Science, Abstract         |
| Hera                | Mythology                       |
| Infinity Loop       | Math, Science, Abstract         |
| Leo Minor           | Space, Astronomy (Constellation) |
| Lunar Eclipse       | Space, Astronomy                |
| Lyra                | Mythology, Astronomy            |
| Lynx                | Mythology, Astronomy            |
| Medusa              | Mythology                       |
| Odin                | Mythology                       |
| Orion's Belt        | Space, Astronomy (Constellation) |
| Orion's Sword       | Space, Astronomy (Constellation) |
| Pandora's Box       | Mythology, Abstract             |
| Persius Shield      | Mythology, Abstract             |
| Phoenix             | Mythology, Abstract             |
| Prometheus Rising   | Mythology, Abstract             |
| Riemann's Hypothesis| Math, Science                   |
| Sagitta             | Mythology, Astronomy            |
| Serpens             | Mythology, Astronomy            |
| Stellar Nexus       | Space, Astronomy                |
| Stokes Theorem      | Math, Science                   |
| Supernova Echo      | Space, Astronomy, Abstract      |
| Symmetry            | Math, Science, Abstract         |
| Taylor Series       | Math, Science                   |
| Titan               | Mythology, Space                |
| Vector              | Math, Science, Abstract         |
| Zephyr              | Mythology, Abstract             |

**A Few More Project Name Ideas (Based on similar themes):**

1.  **Singularity:** (Space, Math, Abstract - suggests a point of focus or origin)
2.  **Quantum Leap:** (Science, Abstract - suggests progress, advancement)
3.  **Aegis:** (Mythology, Abstract - suggests protection, defense)
4.  **Pulsar:** (Space, Astronomy, Science - suggests rhythm, power)
5.  **Chronos Engine:** (Mythology, Science, Abstract - suggests time, power, mechanics)
6.  **Valkyrie:** (Mythology - Norse, suggests strength, selection)
7.  **Horizon:** (Abstract, Space - suggests a boundary or future)

# Structure information

Gemini 2.0 is not only able to read text but also to reason and structure about real world objects. Like in this video about a display of ceramics with handwritten prices and notes.

<video controls width="500"><source src="https://storage.googleapis.com/generativeai-downloads/videos/Pottery.mp4" type="video/mp4"></video>

In [None]:
prompt = "Give me a table of my items and notes" # @param ["Give me a table of my items and notes", "Help me come up with a selling pitch for my potteries"] {"allow-input":true}

video = pottery_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ],
    config = types.GenerateContentConfig(
        system_instruction="Don't forget to escape the dollar signs",
    )
)

Markdown(response.text)

Okay, here is a table summarizing the items and notes shown in the image:

| Item          | Description / Notes                       | Price |
| :------------ | :---------------------------------------- | :---- |
| Tumblers      | Glaze: #5 Artichoke double dip<br>4"h x 3"d (-ish) | \$20  |
| Small bowls   | 3.5"h x 6.5"d                             | \$35  |
| Med bowls     | 4"h x 7"d                                 | \$40  |
| *Glaze Info*  | #5 Artichoke double dip (Test tile shown) | N/A   |
| *Glaze Info*  | #6 Gemini double dip, SLOW COOL (Test tile shown, marked 6rb) | N/A   |

**Note:** The glaze for the small and medium bowls appears to be the "#6 Gemini double dip" based on visual similarity to the test tile, although the notes next to the bowls don't explicitly state the glaze name.

As you can see, Gemini is able to grasp to with item corresponds each note, including the last one.

# Analyze screen recordings for key moments

You can also use the model to analyze screen recordings. Let's say you're doing user studies on how people use your product, so you end up with lots of screen recordings, like this one, that you have to manually comb through.
With just one prompt, the model can describe all the actions in your video.

<video controls width="400"><source src="https://storage.googleapis.com/generativeai-downloads/videos/user_study.mp4" type="video/mp4"></video>

In [None]:
prompt = "Generate a paragraph that summarizes this video. Keep it to 3 to 5 sentences with corresponding timecodes." # @param ["Generate a paragraph that summarizes this video. Keep it to 3 to 5 sentences with corresponding timecodes.", "Choose 5 key shots from this video and put them in a table with the timecode, text description of 10 words or less, and a list of objects visible in the scene (with representative emojis).", "Generate bullet points for the video. Place each bullet point into an object with the timecode of the bullet point in the video."] {"allow-input":true}

video = user_study_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

Here is a summary of the video:

(00:00-00:10) The video displays a mobile application called "My Garden App" showcasing various plants available for purchase.
(00:10-00:17) The user interacts with the app by clicking the "Like" button for the Rose Plant, Fern, and Cactus, turning the buttons red.
(00:13-00:25) They proceed to add the Fern, Cactus, and Hibiscus plants to the shopping cart, indicated by the "Add to Cart" button briefly changing to "Added!".
(00:29-00:34) The user navigates to the "Cart" tab, showing the three selected items and the total price, and then briefly views the "Profile" tab showing counts for liked plants and cart items.
(00:37-00:45) After returning to the home screen, the user unlikes the Hibiscus, likes the Snake Plant, and adds the Orchid to their cart.

# Analyze youtube videos

On top of using your own videos you can also ask Gemini to get a video from Youtube and analyze it. He's an example using the keynote from Google IO 2023. Guess what the main theme was?


In [26]:
response = client.models.generate_content(
    model=model_name,
    contents=types.Content(
        parts=[
            types.Part(text="What is the content of this video?"),
            types.Part(
                file_data=types.FileData(file_uri='https://youtu.be/trgzYWgTvuY')
            )
        ]
    )
)

Markdown(response.text)

Based on the video content from 00:00 onwards:

The video shows a player named David in the game **Digimon Card Battle (PS1)**. He is in the **Battle Arena**, specifically at the **Jungle City** location.

The player chooses to battle against an opponent named **Ninjamon**. Ninjamon uses a special "Switching Deck" which allows it to quickly switch between Digimon levels R and C.

The video then shows the battle:
1.  **Preparation Phase:** Both players draw cards and put Digimon into play (Ninjamon starts with Tentomon, David starts with Agumon).
2.  **Digivolve Phase:** Both players gain Digivolve Points (DP).
3.  **Battle Phase:**
    *   David uses a Support Card to boost Agumon's attack power (+300).
    *   Agumon attacks Tentomon with a "Deadly Attack".
    *   Ninjamon's Tentomon is defeated.
    *   Ninjamon brings out a new Digimon (Ninjamon C).
    *   David Digivolves Agumon into Flarelizamon (recovers HP).
    *   Ninjamon Digivolves into Palmon (R level).
    *   There is no Battle Phase as Ninjamon has no Digimon in play (Palmon was defeated).
    *   Ninjamon brings out Ninjamon (C level) again.
    *   David uses a Support Card ("Attack Chip") to boost Flarelizamon's attack power (+300).
    *   Flarelizamon attacks Ninjamon (C) with a "Deadly Attack".
    *   Ninjamon's Digimon is defeated.

The battle concludes with **David winning**. The post-battle screen shows David's record (7 Wins, 0 Losses) and Ninjamon's (0 Wins, 1 Loss). David earns experience points (with bonuses for winning using only Circle attacks, not discarding, and not losing) and receives new cards (BomberNanmon, Tsukaomon, Dokunemmon) and a Digi-Part (Counterattack).

The person in the picture-in-picture window is the player, providing commentary in Indonesian throughout the match, explaining his strategy and reacting to the opponent's moves and the outcome.

Once again, you can check the  [live demo on AI Studio](https://aistudio.google.com/starter-apps/video) shows an example on how to postprocess this output. Check the [code of that demo](https://github.com/google-gemini/starter-applets/tree/main/video) for more details.

# Next Steps

Try with you own videos using the [AI Studio's live demo](https://aistudio.google.com/starter-apps/video) or play with the examples from this notebook (in case you haven't seen, there are other prompts you can try in the dropdowns).

For more examples of the Gemini capabilities, check the other guide from the [Cookbook](https://github.com/google-gemini/cookbook/). You'll learn how to use the [Live API](../quickstarts/Get_started_LiveAPI.ipynb), juggle with [multiple tools](../quickstarts/Get_started_LiveAPI_tools.ipynb) or use Gemini 2.0 [spatial understanding](../quickstarts/Spatial_understanding.ipynb) abilities.

The [examples](https://github.com/google-gemini/cookbook/tree/main/examples/) folder from the cookbook is also full of nice code samples illustrating creative ways to use Gemini multimodal capabilities and long-context.