<a href="https://colab.research.google.com/github/felizzi/Video_Gemini/blob/main/Solving_Video_Transcription_with_Gemini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Solving Video Transcription with Gemini

|           |                                                  |
| --------- | ------------------------------------------------ |
| Author(s) | [Laurent Picard](https://github.com/PicardParis) |

> ![Lab Preview](https://github.com/PicardParis/cherry-on-py-pics/raw/main/misc/Solving-Video-Transcription-with-Gemini.gif)

---

## 🔥 Challenge

To fully transcribe a video, we're looking to answer the following questions:

- 1️⃣ What was said and when?
- 2️⃣ Who are the speakers?
- 3️⃣ Who said what?

Can we solve this problem in a straightforward and efficient way?

In other words, consider this challenge: Can we transcribe any video with just the following?

- 1 video
- 1 prompt
- 1 request

Let's try with Gemini…


---

## 🌟 State of the art

### 1️⃣ What was said and when?

This is a known problem with a known solution:

- **Speech-to-text** (STT) is a process that takes an audio input and transforms speech into text. STT can provide timestamps at the word level. It's also known as automatic speech recognition (ASR).

In the last decade, it's been best addressed by task-specific machine learning (ML) models.

### 2️⃣ Who are the speakers?

We can retrieve speaker names in a video from two sources:

- **What's written** (e.g., speakers can be introduced with an on-screen information when they first speak)
- **What's spoken** (e.g., "Hello Bob! Alice, how are you doing?")

Vision and natural-language-processing (NLP) models can help with the following features:

- Vision: **Optical character recognition** (OCR), also called text detection, extracts the text visible in images.
- Vision: **Person detection** lets you know if and where there are persons in an image.
- NLP: **Entity extraction** can identify named entities in text.

### 3️⃣ Who said what?

This is another known problem with a partial solution (complementary to speech-to-text):

- **Speaker diarization** (also known as speaker turn segmentation) is a process that splits an audio stream into segments for the different detected speakers ("Speaker A", "Speaker B", etc.).

Researchers have worked in this field for decades and, though machine learning (ML) models brought significant progress in the past years, this is still a very active field of research. Existing solutions come with some shortcomings as they generally require providing hints for the audio inputs (notably the language spoken and the number of speakers).

---

## 💡 A new problem-solving tool

Solving all of 1️⃣, 2️⃣, and 3️⃣ is really not obvious. This would probably involve setting up an elaborate supervised processing pipeline with a few state-of-the-art ML models. Additionally, as of early 2025, our challenge (advanced video transcription) doesn't look like a solved problem, so we may need days or weeks to set up such a pipeline, without any certainty to reach a viable solution.

### 🎬 Multimodal

Gemini is a natively multimodal, which means it can process the following inputs:

- text
- audio
- images
- videos
- documents

### 🌐 Multilingual

Gemini is also [multilingual](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#languages-gemini):

- It can process inputs and generate outputs in 100+ languages
- If we manage to reach a solution, this should also work for videos in different languages

### 🧰 A natural-language toolbox

Gemini allows for rapid prompt-based problem solving. With just text instructions, we can extract information and transform it into new information, in a straightforward and automated workflow. This lets us shift from relying on task-specific ML models to using a versatile large language model (LLM).

Being both multimodal and multilingual, Gemini lets us solve complex problems using natural language.

---

## 🏁 Setup


### 🐍 Python packages

We'll use the following packages:

- `google-genai`: the [Google Gen AI Python SDK](https://pypi.org/project/google-genai) lets us call Gemini with a few lines of code
- `pandas` for data visualization
- `tenacity` for request management

In [None]:
%pip install --quiet "google-genai>=1.2.0" "pandas[output-formatting]" tenacity

### 🔑 Authentication (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth as colab_auth  # type: ignore

    colab_auth.authenticate_user()

### ⚙️ Google Cloud settings

In this notebook, we'll use Vertex AI to send requests to Gemini.

To get started using Vertex AI, here are the requirements:
- An existing Google Cloud project
- The [Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com) must be enabled

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
import os

# ⚙️ Project
GOOGLE_CLOUD_PROJECT = ""  # @param {type: "string"}

if not GOOGLE_CLOUD_PROJECT:
    # Retrieve from environment (Colab Enterprise or Vertex AI Workbench)
    GOOGLE_CLOUD_PROJECT = os.environ.get("GOOGLE_CLOUD_PROJECT", "")
assert GOOGLE_CLOUD_PROJECT, "GOOGLE_CLOUD_PROJECT is not defined"

# ⚙️ API location
# Update if needed, see https://cloud.google.com/vertex-ai/generative-ai/docs/learn/locations
GOOGLE_CLOUD_REGION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

### 🤖 Gen AI SDK client


In [None]:
from google import genai

genai_client = genai.Client(
    vertexai=True,
    project=GOOGLE_CLOUD_PROJECT,
    location=GOOGLE_CLOUD_REGION,
)

### 🧠 Gemini model & configuration


Gemini comes in different [versions](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models).

Let's pick Gemini 2.0 Flash which offers both high performances and low latency:

```python
GEMINI_2_0_FLASH = "gemini-2.0-flash-001"
```

Gemini can be used in different ways, ranging from factual to creative mode. The problem we're trying to solve is a **data extraction** use case. We want results as factual and deterministic as possible. For this, we can change the [content generation parameters](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/content-generation-parameters).

Let's set the `temperature` and `seed` parameters to minimize randomness:

```python
from google.genai.types import GenerateContentConfig

DEFAULT_CONFIG = GenerateContentConfig(
    temperature=0.0,
    seed=42,
)
```

### 🛠️ Helper functions

In [None]:
import enum
from dataclasses import dataclass
from datetime import timedelta

import IPython.display
import tenacity
from google.genai.errors import ClientError
from google.genai.types import (
    FileData,
    FinishReason,
    GenerateContentConfig,
    GenerateContentResponse,
    Part,
    VideoMetadata,
)


class Model(enum.Enum):
    # Generally Available (GA)
    GEMINI_2_0_FLASH = "gemini-2.0-flash-001"
    GEMINI_2_0_FLASH_LITE = "gemini-2.0-flash-lite-001"
    GEMINI_1_5_PRO = "gemini-1.5-pro-002"
    # Preview or Experimental
    GEMINI_2_0_PRO = "gemini-2.0-pro-exp-02-05"
    # Default model
    DEFAULT = GEMINI_2_0_FLASH


# Default configuration for more deterministic outputs
DEFAULT_CONFIG = GenerateContentConfig(
    temperature=0.0,
    seed=42,
)

YOUTUBE_URL_PREFIX = "https://www.youtube.com/watch?v="


def youtube_url_from_id(youtube_id: str) -> str:
    return f"{YOUTUBE_URL_PREFIX}{youtube_id}"


class Video(enum.Enum):
    pass


class TestVideo(Video):
    GDM_PODCAST_TRAILER_0MIN_59S = youtube_url_from_id("0pJn3g8dfwk")
    JANE_GOODALL_2MIN_42S = "gs://cloud-samples-data/video/JaneGoodall.mp4"
    GDM_ALPHAFOLD_7MIN_54S = youtube_url_from_id("gg7WjuFs8F4")
    BRUT_FR_DOGS_WATER_LEAK_8MIN_28S = youtube_url_from_id("U_yYkb-ureI")
    GDM_VIEW_FROM_FRONTIER_34MIN_38S = youtube_url_from_id("SM_vpRtg2Ac")


class ShowResponseAs(enum.Enum):
    DONT_SHOW = enum.auto()
    TEXT = enum.auto()
    MARDKDOWN = enum.auto()


@dataclass
class VideoSegment:
    start: timedelta
    end: timedelta


@tenacity.retry(
    retry=tenacity.retry_if_exception_type(ClientError),
    wait=tenacity.wait_fixed(30),
    stop=tenacity.stop_after_attempt(5),
    reraise=True,
)
def generate_content(
    prompt: str,
    video: Video | None = None,
    video_segment: VideoSegment | None = None,
    model: Model = Model.DEFAULT,
    config: GenerateContentConfig = DEFAULT_CONFIG,
    show_as: ShowResponseAs = ShowResponseAs.TEXT,
) -> None:
    model_id = model.value
    prompt = prompt.strip()
    if video:
        contents = [content_part_from_video(video, video_segment), prompt]
        caption = f"{video.name} / {model_id}"
    else:
        contents = prompt
        caption = f"{model_id}"
    print(f" {caption} ".center(80, "-"))

    response = genai_client.models.generate_content(
        model=model_id,
        contents=contents,
        config=config,
    )
    show_response(response, show_as)


def content_part_from_video(
    video: Video,
    video_segment: VideoSegment | None = None,
) -> Part:
    def str_offset(offset: timedelta) -> str:
        return f"{offset.total_seconds():.0f}s"

    file_data = FileData(file_uri=video.value, mime_type="video/*")
    video_metadata = (
        None
        if video_segment is None
        else VideoMetadata(
            start_offset=str_offset(video_segment.start),
            end_offset=str_offset(video_segment.end),
        )
    )

    return Part(file_data=file_data, video_metadata=video_metadata)


def show_response(response: GenerateContentResponse, show_as: ShowResponseAs) -> None:
    if show_as == ShowResponseAs.DONT_SHOW:
        return
    if not response.candidates:
        print("❌ No `response.candidates`")
        return
    if response.candidates[0].finish_reason != FinishReason.STOP:
        print(f"❌ {response.candidates[0].finish_reason = }")
    if not (response_text := response.text):
        print("❌ No `response.text`")
        return
    response_text = response_text.strip()

    match show_as:
        case ShowResponseAs.TEXT:
            print(response_text)
        case ShowResponseAs.MARDKDOWN:
            display_markdown(response_text)


def display_markdown(markdown: str) -> None:
    IPython.display.display(IPython.display.Markdown(markdown))


def display_video(video: Video) -> None:
    video_url = video.value
    if video_url.startswith("gs://"):
        cloud_storage_path = video_url.removeprefix("gs://")
        video_url = f"https://storage.googleapis.com/{cloud_storage_path}"
    assert video_url.startswith("https://")

    video_width = 600
    if video_url.startswith(YOUTUBE_URL_PREFIX):
        youtube_id = video_url.removeprefix(YOUTUBE_URL_PREFIX)
        ipython_video = IPython.display.YouTubeVideo(youtube_id, width=video_width)
    else:
        ipython_video = IPython.display.Video(video_url, width=video_width)

    display_markdown(f"## Video ([source]({video_url}))")
    IPython.display.display(ipython_video)

---

## 🧪 Prototyping

### 🌱 Natural behavior

Before diving any deeper, it's interesting to see how Gemini responds to simple instructions, to develop some intuition about its natural behavior.

Let's first check what we get with minimalistic prompts and a short English video.

In [None]:
video = TestVideo.GDM_PODCAST_TRAILER_0MIN_59S
display_video(video)

In [None]:
prompt = "Transcribe the video"
generate_content(prompt, video)

Results:
- Gemini naturally outputs a list of `[timecode] transcript` lines.
- That's speech-to-text in a one-liner!
- It looks like we can answer 1️⃣ "What what said and when?".

Now, what about 2️⃣ "Who are the speakers?"

In [None]:
prompt = "List the people visible in the video"
generate_content(prompt, video)

Results:

- Gemini is able to consolidate the names visible on title cards during the video.
- That's OCR + entity extraction in a one-liner!
- 2️⃣ "Who are the speakers?" looks to be solved too!

### ⏩ Not so fast!

Then, the next natural reflex is to jump to final instructions to solve our problem once and for all.

In [None]:
prompt = """
Transcribe the video, including speaker names ("?" if not found).

Format example:
[00:02] John Doe: Hello Alice!
"""
generate_content(prompt, video)

This is almost fully correct: The first transcript is not attributed to the host, but everything else looks correct.

Nonetheless, we're not in real conditions:

- The video is very short (less than a minute)
- The video is also very simple (speakers alternate and are introduced by title cards)

Let's try with this 8 minute (and more complex) video:


In [None]:
generate_content(prompt, TestVideo.GDM_ALPHAFOLD_7MIN_54S)

This falls apart: Most transcripts have no speaker!

At this stage:

- We might conclude that we can't solve the problem with real-life videos.
- Persevering in trying more and more elaborate prompts for this unsolved problem may result in a waste of time.

Let's take a step back and think about what happens under the hood…


---

## ⚛️ Under the hood

### 🪙 Tokens

Tokens are the LLMs' building blocks. A token represents a piece of information.

Examples of Gemini multimodal tokens:

| content            | #tokens        | details                                |
| ------------------ | -------------- | -------------------------------------- |
| `hello`            | 1              | 1 token for common words/sequences     |
| `enthusiastic`     | 2              | `enthusi•astic`                        |
| `enthusiastically` | 3              | `enthusi•astic•ally`                   |
| image              | 258            |                                        |
| audio              | 32 per second  | Managed by the audio tokenizer         |
| video              | 263 per second | Sampled by the video tokenizer (1 fps) |


### 🧮 Probabilities all the way down

The ability of LLMs to exchange in flawless natural language is very impressive, but it's easy to get carried away and reach wrong assumptions.

Keep in mind how LLMs work:

- LLMs are trained on massive tokenized datasets: this represents the LLM knowledge
- During the training, their neural network learns token patterns
- When you send a request to an LLM, your inputs are transformed into tokens
- To answer your request, the LLM predicts, token by token, the next likely tokens
- Overall, LLMs are exceptionnal statistical token prediction machines, but nothing more

This has a few consequences:

- LLM outputs are just a logical follow-up to your inputs (based on the LLM knowledge)
- LLMs seem to be able to reason but it's just an appearance; they have no real understanding
- LLMs have no awareness: they learnt patterns but are completely ignorant of their inner workings
- LLMs have no conscience: they are designed to generate tokens and will do so based on your instructions
- Order matters: Tokens that are generated first will influence tokens that are generated next

For the next step, some methodical prompt engineering might help…

---

## 🏗️ Prompt engineering

### 🪜 Methodology

Prompt engineering is still a pretty recent field. It involves designing and refining text instructions to guide LLMs towards generating desired outputs. Like writing, it is both art and science, a skill everyone can develop with practice and discipline.

We can find countless reference materials about prompt engineering. For some of them, the prompts are very long, complex, and scary. Crafting prompts with a highly performant LLM like Gemini is a lot more simple. Here are key adjectives we can keep in mind:

- iterative
- precise
- concise

**Iterative**

Prompt engineering is typically a very iteractive process. Here are some recommendations:

- Craft your prompt step by step
- Keep track of your successive iterations
- At every iteration, make sure to measure what's working vs. not working
- If you reach a regression, backtrack to a successful iteration

**Precise**

Precision is key:

- Use words as specific as possible
- Words with different meanings can introduce variability, so use precise expressions
- Precision will influence probabilities in your favor

**Concise**

Concision has additional advantages:

- A short prompt is more straightforward to understand (and maintain!) for us
- The longer your prompt is, the more likely you are to introduce inconsistencies or even contradictions, resulting in variable interpretations of your instructions
- Test and trust the LLM knowledge: it's part of your context and can make your prompt shorter

Overall, this may seem contradictory but, if you take the time to be iterative, precise, and concise, you are likely to save a lot of time.

### 📚 Terminology

We're not experts in video transcription but we want Gemini to behave as one. Consequently, we'd like to write prompts as specific as possible to this use case. If LLMs can only understand instructions that are part of their training knowledge, they can also share this knowledge with us.

We can learn a lot by directly asking Gemini:

In [None]:
prompt = """
What is the terminology used for video transcriptions?
Please show a typical output example.
"""
generate_content(prompt)

### 📝 Strategy

So far, we've seen the following:

- We did not manage to get the full transcription with identified speakers all at once
- Order matters (because a generated token will influence the probabilities for the next tokens)

To tackle our challenge, we need Gemini to infer from the following multimodal information:

- text (our instructions + what may be written in the video)
- audio (what's heard in the video)
- visual (what's visible in the video)
- time (when things happen)

That is quite a bunch of mixed types of information!

Video transcription is a data extraction use case, which can be seen as creating a database. If we follow this logic, prompt engineering can then be seen as creating a database with related tables. In addition, our final goal is to have an automated workflow, so we can start reasoning in terms of JSON fields.

Let's split our instructions into steps (tables) and in a meaningful order…


### 💬 Transcripts

First of all, let's focus on getting the transcripts:

- It is central and independent information.
- Gemini showed to be natively good at it.

We've also seen what a typical transcription entry can look like:

```
00:02 speaker_1: Welcome!
```

But, right away, there can be some ambiguities in our multimodal use case:

- What is a speaker?
- Is it someone we see/hear?
- What if the person visible in the video is not the one speaking?
- What if the person speaking is never seen in the video?

How do we inconsciously identify who's speaking in a video?

- Probably first by identifying the different voices on the fly
- Then probably by consolidating additional audio and visual cues

Is Gemini able to understand voice characteristics?

In [None]:
prompt = """
List the following characteristics audible in the provided video:
- Voice pitches
- Accents
- Languages
"""
generate_content(prompt, TestVideo.GDM_PODCAST_TRAILER_0MIN_59S)

What about a French video?

In [None]:
generate_content(prompt, TestVideo.BRUT_FR_DOGS_WATER_LEAK_8MIN_28S)

⚠️ We have to be cautious about how we interpret the responses: they can consolidate multimodal info or even common knowledge. For example, if a person is famously known to be from the UK, a possible inference can be that they have a British accent.

Nonetheless, if you do more tests, especially on private content (not part of common knowledge), it looks like Gemini's audio tokenizer does wonders and extracts speech semantic info!

After a few iterations, we can reach a transcription prompt focusing on the audio and on voices:

In [None]:
prompt = """
Transcripts
- Transcribe the video's audio.
  - Split overlapping speech into different transcript entries.
  - Include start timecodes (MM:SS) and verbatim transcripts.
  - Identify matching voices and label each voice with a unique ID (voice_1, voice_2…).
- Output a JSON array where each object has the following fields:
  - `timecode`
  - `verbatim`
  - `voice_id`
"""
generate_content(prompt, TestVideo.GDM_PODCAST_TRAILER_0MIN_59S)

This is looking good! And if you test the instructions on more complex videos, you'll get similar promising results.

Notice how the prompt reuses cherry-picked terms from the previously requested terminology, while aiming for precision and concision:

- `timecode` is specific (`timestamp` has more meanings)
- `MM:SS` clarifies the timecode format
- `verbatim` is unambiguous ("spoken words" has more meanings)
- `voice_1, voice_2…` is an ellipse but we're trusting Gemini's pattern abilities

We're half way. Let's complete our database generation with a second step…

### 🧑 Speakers


The second step is pretty straightforward. We want to extract speaker information in a second table. The two tables are logically linked by the voice ID.

After a few iterations, we can reach a two-step prompt such as the following:

In [None]:
prompt = """
Step 1 - Transcripts
- Transcribe the video's audio.
  - Split overlapping speech into different transcript entries.
  - Include start timecodes (MM:SS) and verbatim transcripts.
  - Identify matching voices and label each voice with a unique ID (voice_1, voice_2…).
- Output a JSON array where each object has the following fields:
  - `timecode`
  - `verbatim`
  - `voice_id`

Step 2 - Speakers
- For each `voice_id` from Step 1, extract information about the speaker.
  - Only use information explicitly stated in the video (use "n/a" otherwise).
- Output a JSON array where each object has the following fields:
  - `voice_id`
  - `name`
  - `role`
"""
generate_content(prompt, TestVideo.GDM_PODCAST_TRAILER_0MIN_59S)

Test the prompt on more complex videos: This keeps looking good!

---

## 🧩 Structured output


We've iterated towards a precise and concise prompt. At this stage, we can focus on Gemini's response:

- It is a plain text response with fenced code blocks
- Instead, we'd like to get a structured output, so we receive consistently formatted responses
- Ideally, we'd also like to avoid having to parse the response (a maintenance burden)

Getting structured outputs is a feature also called "controlled generation". As we've already crafted our prompt in terms of data tables and JSON fields, this is now a formality. In our request, we can add the following parameters:

- `response_mime_type="application/json"`
- `response_schema="YOUR_JSON_SCHEMA"` ([see doc](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output#fields))

In Python, this gets even easier:

- Use the `pydantic` library
- Reflect your prompt structure with classes derived from `pydantic.BaseModel`

Our prompt (unchanged):

```txt
Step 1 - Transcripts
…
- Output a JSON array where each object has the following fields:
  - `timecode`
  - `verbatim`
  - `voice_id`

Step 2 - Speakers
…
- Output a JSON array where each object has the following fields:
  - `voice_id`
  - `name`
  - `role`
```

The corresponding Python classes:

```python
import pydantic

class Transcript(pydantic.BaseModel):
    timecode: str
    verbatim: str
    voice_id: str

class Speaker(pydantic.BaseModel):
    voice_id: str
    name: str
    role: str

class VideoTranscription(pydantic.BaseModel):
    step1_transcripts: list[Transcript] = pydantic.Field(default_factory=list)
    step2_speakers: list[Speaker] = pydantic.Field(default_factory=list)
```

Sending a request to Gemini:

```python
response = genai_client.models.generate_content(
    # …
    config=GenerateContentConfig(
        # …
        response_mime_type="application/json",
        response_schema=VideoTranscription,
        # …
    ),
)
```

Using the response from Gemini:

```python
if isinstance(response.parsed, VideoTranscription):
    video_transcription = response.parsed
else:
    video_transcription = VideoTranscription()  # Empty transcription
```

What's interesting with this approach:

- We don't need to change our prompt when using a response schema
- It's easy to change/maintain both prompt and classes in the same location
- The JSON schema is automatically generated from the class hierarchy and dispatched to Gemini
- The response is automatically parsed and serialized into the corresponding Python objects

To put words into action, let's add a `company` field for the speakers and finalize our code…


In [None]:
import re

import pydantic
from google.genai.types import MediaResolution


UNKNOWN_DATA = "unknown"

VIDEO_TRANSCRIPTION_PROMPT = f"""
Step 1 - Transcripts
- Task:
  - Transcribe the video's audio verbatim, including all spoken words.
  - Include start timecodes for each turn in MM:SS format.
  - A "turn" is defined as everything heard from a single voice before another voice starts speaking.
  - Separate overlapping speech into distinct turns.
  - Assign a unique identifier to each distinct voice (voice_1, voice_2…).
- Output a JSON array where each object has the following fields:
  - `timecode`
  - `verbatim`
  - `voice_id`

Step 2 - Speakers
- Task:
  - For each `voice_id` from Step 1, extract information about the speaker.
  - Only use information explicitly stated in the video (use "{UNKNOWN_DATA}" otherwise).
- Output a JSON array where each object has the following fields:
  - `voice_id`
  - `name`
  - `role`
  - `company`
"""


class Transcript(pydantic.BaseModel):
    timecode: str
    verbatim: str
    voice_id: str


class Speaker(pydantic.BaseModel):
    voice_id: str
    name: str
    role: str
    company: str


class VideoTranscription(pydantic.BaseModel):
    step1_transcripts: list[Transcript] = pydantic.Field(default_factory=list)
    step2_speakers: list[Speaker] = pydantic.Field(default_factory=list)


def video_transcription_from_response(
    response: GenerateContentResponse,
) -> VideoTranscription:
    empty_transcription = VideoTranscription()

    if not response.candidates:
        print(f"❌ No `response.candidates`")
        return empty_transcription

    if response.candidates[0].finish_reason != FinishReason.STOP:
        print(f"❌ {response.candidates[0].finish_reason = }")
        return empty_transcription

    if not isinstance(response.parsed, VideoTranscription):
        print("❌ Could not parse the JSON response")
        return empty_transcription

    return response.parsed


def media_resolution_for_video(video: Video) -> MediaResolution:
    if not (match := re.search(r"_(\d+)MIN_(\d+)S$", video.name)):
        print(
            f"⚠️ No duration info in video enum: {video.name} (expected end: _?MIN_?S)"
        )
        return MediaResolution.MEDIA_RESOLUTION_MEDIUM

    # Arbitrary heuristic: reduce prevalence of video tokens for 5min+ videos
    if 60 * 5 <= 60 * int(match.group(1)) + int(match.group(2)):
        return MediaResolution.MEDIA_RESOLUTION_LOW
    else:
        return MediaResolution.MEDIA_RESOLUTION_MEDIUM


@tenacity.retry(
    retry=tenacity.retry_if_exception_type(ClientError),
    wait=tenacity.wait_fixed(30),
    stop=tenacity.stop_after_attempt(5),
    reraise=True,
)
def get_video_transcription(
    video: Video,
    video_segment: VideoSegment | None = None,
    prompt: str = VIDEO_TRANSCRIPTION_PROMPT,
    model: Model = Model.DEFAULT,
) -> VideoTranscription:
    model_name = model.value
    contents = [content_part_from_video(video, video_segment), prompt.strip()]
    config = GenerateContentConfig(
        temperature=0.0,
        seed=42,
        response_mime_type="application/json",
        response_schema=VideoTranscription,
        media_resolution=media_resolution_for_video(video),
    )

    print(f" {video.name} / {model_name} ".center(80, "-"))
    response = genai_client.models.generate_content(
        model=model_name,
        contents=contents,
        config=config,
    )

    return video_transcription_from_response(response)


print(VIDEO_TRANSCRIPTION_PROMPT)

Let's test it:

In [None]:
transcription = get_video_transcription(TestVideo.GDM_PODCAST_TRAILER_0MIN_59S)

print(f"# Transcripts: {len(transcription.step1_transcripts)}")
print(f"# Speakers:    {len(transcription.step2_speakers)}")
for speaker in transcription.step2_speakers:
    print(f"  - {speaker}")

---

## 📊 Data visualization


We started prototyping in natural language, crafted a prompt, and generated a structured output. As reading raw data can be painful, we can now present video transcriptions in a more pleasant manner.

Here's a possible orchestrator function:

```python
def transcribe_video(video: Video):
    display_video(video)
    transcription = get_video_transcription(video)
    display_speakers(transcription)
    display_transcripts(transcription)
```

 Let's add some data visualization functions…

In [None]:
from typing import Iterator

from pandas import DataFrame, Series
from pandas.io.formats.style import Styler
from pandas.io.formats.style_render import CSSDict


def yield_known_speaker_color() -> Iterator[str]:
    COLS_40 = ("#669DF6", "#EE675C", "#FCC934", "#5BB974")
    COLS_30 = ("#8AB4F8", "#F28B82", "#FDD663", "#81C995")
    COLS_20 = ("#AECBFA", "#F6AEA9", "#FDE293", "#A8DAB5")
    COLS_10 = ("#D2E3FC", "#FAD2CF", "#FEEFC3", "#CEEAD6")
    COLS_05 = ("#E8F0FE", "#FCE8E6", "#FEF7E0", "#E6F4EA")
    while True:
        yield from [*COLS_40, *COLS_30, *COLS_20, *COLS_10, *COLS_05]


def yield_unknown_speaker_color() -> Iterator[str]:
    GRAYS = ["#80868B", "#9AA0A6", "#BDC1C6", "#DADCE0", "#E8EAED", "#F1F3F4"]
    while True:
        yield from GRAYS


def color_for_voice_id_mapping(speakers: list[Speaker]) -> dict[str, str]:
    known_speaker_color = yield_known_speaker_color()
    unknown_speaker_color = yield_unknown_speaker_color()

    mapping: dict[str, str] = {}
    for speaker in speakers:
        if speaker.name != UNKNOWN_DATA or speaker.role != UNKNOWN_DATA:
            mapping[speaker.voice_id] = next(known_speaker_color)
        else:
            mapping[speaker.voice_id] = next(unknown_speaker_color)

    return mapping


def get_table_styler(df: DataFrame) -> Styler:
    def join_styles(styles: list[str]) -> str:
        return ";".join(styles)

    table_css = [
        "color: #202124",
        "border-collapse: collapse",
        "border: solid 0.25em #BDC1C6",
        "border-radius: 0.25em",
        "outline: 0.25em solid #BDC1C6",
    ]
    th_css = ["background-color:#E8EAED", "text-align:center"]
    th_td_css = ["padding:0.5ex 1ex", "text-align:left"]
    table_styles = [
        CSSDict(selector="", props=join_styles(table_css)),
        CSSDict(selector="th", props=join_styles(th_css)),
        CSSDict(selector="th,td", props=join_styles(th_td_css)),
    ]

    return df.style.set_table_styles(table_styles).hide()


def display_speakers(transcription: VideoTranscription) -> None:
    def sanitize_field(s: str, symbol_if_unknown: str) -> str:
        return symbol_if_unknown if s == UNKNOWN_DATA else s

    def yield_row() -> Iterator[list[str]]:
        yield ["voice_id", "name", "role", "company"]
        for speaker in transcription.step2_speakers:
            name = sanitize_field(speaker.name, "?")
            role = sanitize_field(speaker.role, "?")
            company = sanitize_field(speaker.company, "")
            yield [speaker.voice_id, name, role, company]

    def speaker_bgcolor(row: Series) -> list[str]:
        color = color_for_voice_id[row["voice_id"]]
        return [f"background-color:{color}"] * len(row)

    data = yield_row()
    color_for_voice_id = color_for_voice_id_mapping(transcription.step2_speakers)
    columns = next(data)

    df = DataFrame(columns=columns, data=data)
    styler = get_table_styler(df)
    styler.apply(speaker_bgcolor, axis=1)

    display_markdown(f"## Speakers ({len(transcription.step2_speakers)})")
    IPython.display.display(styler)


def display_transcripts(transcription: VideoTranscription) -> None:
    def yield_row() -> Iterator[list[str]]:
        yield ["voice_id", "timecode", "speaker", "transcript"]

        speaker_by_id = {
            speaker.voice_id: speaker for speaker in transcription.step2_speakers
        }
        previous_voice_id = None
        for transcript in transcription.step1_transcripts:
            speaker = speaker_by_id.get(transcript.voice_id, None)
            speaker_label = ""
            if speaker:
                if speaker.name != UNKNOWN_DATA:
                    speaker_label = speaker.name
                elif speaker.role != UNKNOWN_DATA:
                    speaker_label = f"[{speaker.role}]"
            if not speaker_label:
                speaker_label = f"[{transcript.voice_id}]"
            yield [
                transcript.voice_id,
                transcript.timecode,
                speaker_label if transcript.voice_id != previous_voice_id else '"',
                transcript.verbatim,
            ]
            previous_voice_id = transcript.voice_id

    def speaker_bgcolor(row: Series) -> list[str]:
        color = color_for_voice_id[row["voice_id"]]
        speaker_bgcolor = f"background-color:{color}"
        return [speaker_bgcolor] * len(row)

    data = yield_row()
    color_for_voice_id = color_for_voice_id_mapping(transcription.step2_speakers)
    df = DataFrame(columns=next(data), data=data)

    styler = get_table_styler(df)
    styler.apply(speaker_bgcolor, axis=1)
    styler.hide(["voice_id"], axis="columns")

    display_markdown(f"## Transcripts ({len(transcription.step1_transcripts)})")
    IPython.display.display(styler)


def transcribe_video(video: Video, video_segment: VideoSegment | None = None) -> None:
    display_video(video)
    transcription = get_video_transcription(video, video_segment)
    display_speakers(transcription)
    display_transcripts(transcription)

---

## ✅ Tests

### 🎞️ Short video with 6 speakers

In [None]:
transcribe_video(TestVideo.GDM_PODCAST_TRAILER_0MIN_59S)

### 🎞️ Video with no visible speaker

In [None]:
transcribe_video(TestVideo.JANE_GOODALL_2MIN_42S)

### 🎞️ French video with many speakers

In [None]:
transcribe_video(TestVideo.BRUT_FR_DOGS_WATER_LEAK_8MIN_28S)

### 🎞️ English video with many speakers

In [None]:
transcribe_video(TestVideo.GDM_ALPHAFOLD_7MIN_54S)

### 🎞️ Long video

In [None]:
transcribe_video(
    TestVideo.GDM_VIEW_FROM_FRONTIER_34MIN_38S,
    VideoSegment(
        start=timedelta(seconds=0),
        end=timedelta(minutes=19),
    ),
)

### 🎞️ Test your own videos

In [None]:
class MyVideo(Video):
    A_xMIN_yS = youtube_url_from_id("")
    B_xMIN_yS = "gs://bucket/path/to/video.*"
    C_xMIN_yS = "https://path/to/video.*"
    pass

# transcribe_video(MyVideo.)