In [1]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Introduction to Long Context Window with Gemini on Vertex AI

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/long-context/intro_long_context.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Flong-context%2Fintro_long_context.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/long-context/intro_long_context.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/long-context/intro_long_context.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>


| | |
|-|-|
|Author(s) | [Holt Skinner](https://github.com/holtskinner) |

## Overview

Historically, large language models (LLMs) were significantly limited by the amount of text (or tokens) that could be passed to the model at one time. Gemini 1.5 Flash and Gemini 1.5 Pro support a 1 million token context window, with [near-perfect retrieval (>99%)](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf), which unlocks many new use cases and developer paradigms.

In practice, 1 million tokens would look like:

-   50,000 lines of code (with the standard 80 characters per line)
-   All the text messages you have sent in the last 5 years
-   8 average length English novels
-   Transcripts of over 200 average length podcast episodes
-   1 hour of video
-   ~45 minutes of video with audio
-   9.5 hours of audio

While the standard use case for most generative models is still text input, the Gemini 1.5 model family enables a new paradigm of multimodal use cases. These models can natively understand text, video, audio, and images.

In this notebook, we'll explore multimodal use cases of the long context window.

For more information, refer to the [Gemini documentation about long context](https://ai.google.dev/gemini-api/docs/long-context).

## Tokens

Tokens can be single characters like `z` or whole words like `cat`. Long words
are broken up into several tokens. The set of all tokens used by the model is
called the vocabulary, and the process of splitting text into tokens is called
_tokenization_.

> **Important:** For Gemini models, a token is equivalent to about 4 characters. 100 tokens is equal to about 60-80 English words.

For multimodal input, this is how tokens are calculated regardless of display or file size:

* Images: `258` tokens
* Video: `263` tokens per second
* Audio: `32` tokens per second

## Why is the long context window useful?

The basic way you use the Gemini models is by passing information (context)
to the model, which will subsequently generate a response. An analogy for the
context window is short term memory. There is a limited amount of information
that can be stored in someone's short term memory, and the same is true for
generative models.

You can read more about how models work under the hood in our [generative models guide](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/overview).

Even though the models can take in more and more context, much of the
conventional wisdom about using large language models assumes this inherent
limitation on the model, which as of 2024, is no longer the case.

Some common strategies to handle the limitation of small context windows
included:

-   Arbitrarily dropping old messages / text from the context window as new text
    comes in
-   Summarizing previous content and replacing it with the summary when the
    context window gets close to being full
-   Using RAG with semantic search to move data out of the context window and
    into a vector database
-   Using deterministic or generative filters to remove certain text /
    characters from prompts to save tokens

While many of these are still relevant in certain cases, the default place to start is now just putting all of the tokens into the context window. Because Gemini 1.5 models were purpose-built with a long context window, they are much more capable of in-context learning. This means that instructional materials provided in context can be highly effective for handling inputs that are not covered by the model's training data.

## Getting Started

### Install Vertex AI SDK for Python


In [2]:
# %pip install --upgrade --user --quiet google-cloud-aiplatform

### Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

In [3]:
# import sys

# if "google.colab" in sys.modules:
#     import IPython

#     app = IPython.Application.instance()
#     app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>‚ö†Ô∏è The kernel is going to restart. Please wait until it is finished before continuing to the next step. ‚ö†Ô∏è</b>
</div>


### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the cell below to authenticate your environment.


In [4]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [5]:
import os
from dotenv import load_dotenv
load_dotenv("config.env")
os.environ["PROGECT_ID"] = os.getenv("PROJECT_ID")

# Define project information
PROJECT_ID = os.environ["PROGECT_ID"]  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries


In [6]:
from IPython.display import Markdown, display
from vertexai.generative_models import GenerationConfig, GenerativeModel, Part

### Load the Gemini 1.5 Flash model

To learn more about all [Gemini API models on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models).


In [7]:
MODEL_ID = "gemini-1.5-flash-001"  # @param {type:"string"}

model = GenerativeModel(
    MODEL_ID, generation_config=GenerationConfig(max_output_tokens=8192)
)

## Long-form text

Text has proved to be the layer of intelligence underpinning much of the momentum around LLMs. As mentioned earlier, much of the practical limitation of LLMs was because of not having a large enough context window to do certain tasks. This led to the rapid adoption of retrieval augmented generation (RAG) and other techniques which dynamically provide the model with relevant
contextual information.

Some emerging and standard use cases for text based long context include:

-   Summarizing large corpuses of text
    -   Previous summarization options with smaller context models would require
        a sliding window or another technique to keep state of previous sections
        as new tokens are passed to the model
-   Question and answering
    -   Historically this was only possible with RAG given the limited amount of
        context and models' factual recall being low
-   Agentic workflows
    -   Text is the underpinning of how agents keep state of what they have done
        and what they need to do; not having enough information about the world
        and the agent's goal is a limitation on the reliability of agents

[War and Peace by Leo Tolstoy](https://en.wikipedia.org/wiki/War_and_Peace) is considered one of the greatest literary works of all time; however, it is over 1,225 pages and the average reader will spend 37 hours and 48 minutes reading this book at 250 WPM (words per minute). üòµ‚Äçüí´ The text alone takes up 3.4 MB of storage space. However, the entire novel consists of less than 900,000 tokens, so it will fit within the Gemini context window.

We are going to pass in the entire text into Gemini 1.5 Flash and get a detailed summary of the plot. For this example, we have the text of the novel from [Project Gutenberg](https://www.gutenberg.org/ebooks/2600) stored in a public Google Cloud Storage bucket.

First, we will use the `count_tokens()` method to examine the token count of the full prompt, then send the prompt to Gemini.

In [8]:
# Set contents to send to the model
contents = [
    "Provide a detailed summary of the following novel.",
    Part.from_uri(
        "gs://github-repo/generative-ai/gemini/long-context/WarAndPeace.txt",
        mime_type="text/plain",
    ),
]

# Counts tokens
print(model.count_tokens(contents))

# Prompt the model to generate content
response = model.generate_content(
    contents,
)

# Print the model response
print(f"\nUsage metadata:\n{response.usage_metadata}")

display(Markdown(response.text))

total_tokens: 839583
total_billable_characters: 43


Usage metadata:
prompt_token_count: 839583
candidates_token_count: 1335
total_token_count: 840918



## War and Peace: A Detailed Summary

Leo Tolstoy's epic novel, *War and Peace*, is a sprawling tale encompassing the lives of five aristocratic Russian families during the Napoleonic Wars. It explores themes of love, marriage, war, peace, and the individual's role within history. 

**The Story:**

* **Book One (1805):** The story begins in St. Petersburg where we meet Anna P√°vlovna Sch√©rer, a socialite hosting a gathering of high society. Among the guests are Prince Vas√≠li Kur√°gin, a scheming nobleman seeking a political appointment for his son; the beautiful H√©l√®ne, his daughter; and Prince Andrew Bolk√≥nski, a disillusioned and cynical young aristocrat. 
* **Book Two (1805):** We witness the Russian army's arrival in Austria and the burgeoning romance between young Nat√°sha Rost√≥va, a lively and vibrant girl, and Bor√≠s Drubetsk√≥y, an officer seeking a position on Kut√∫zov‚Äôs staff. 
* **Book Three (1805):** The story shifts to the battle of Austerlitz, where Prince Andrew witnesses the brutality of war and his hero, Napoleon, in a less-than-ideal light. He suffers a serious wound and nearly dies, undergoing a spiritual transformation.
* **Book Four (1806):** Pierre Bez√∫khov, an illegitimate son, inherits a vast fortune after his father's death, leading to a whirlwind of social expectations and manipulation by Prince Vas√≠li, who orchestrates his marriage to H√©l√®ne. 
* **Book Five (1806-07):** Pierre becomes a Freemason and embarks on a journey of self-improvement, ultimately seeking a more virtuous and meaningful life. He also divorces H√©l√®ne and distances himself from the Kur√°gins.
* **Book Six (1808-10):** Prince Andrew lives a more introspective and detached life in the countryside, focusing on his estates and writing. He encounters Pierre again, who tries to introduce him to Freemasonry, but Prince Andrew is still grappling with his disillusionment.
* **Book Seven (1810-11):** As Napoleon‚Äôs armies prepare for the invasion of Russia, we see both the political machinations and the social anxieties leading up to the war. Prince Andrew returns to the military, seeking a way to make a difference.
* **Book Eight (1811-12):** The invasion of Russia begins. We witness the battles of Austerlitz, Sch√∂n Grabern, and the capture of the Thabor Bridge at Vienna. Rost√≥v joins the army and experiences the horrors of war firsthand. 
* **Book Nine (1812):** The narrative centers on the Battle of Borodin√≥, a turning point in the war.  Prince Andrew is mortally wounded, and Pierre witnesses the horrors of the battlefield, leading him to a desire for change. 
* **Book Ten (1812):** Moscow is abandoned and burned, and Napoleon‚Äôs army retreats through Russia. The novel explores the Russian people‚Äôs response to the invasion and the impact of the war on both individuals and society.
* **Book Eleven (1812):** Pierre becomes a "citizen-soldier" after experiencing a profound spiritual transformation in captivity. He seeks to make a difference in the world.
* **Book Twelve (1812):** The French army flees from Russia in a disastrous retreat, with Napoleon making various misguided decisions. The novel highlights the importance of the guerrilla warfare carried out by the Russians.
* **Book Thirteen (1812):** The retreat continues, with the Russian army pursuing the weakened French forces. Prince Andrew dies in the arms of Nat√°sha, who is deeply affected by his passing.
* **Book Fourteen (1812):** The novel explores the impact of the war on Russian society and the conflicting views about how to best defend the country. 
* **Book Fifteen (1812-13):** Nat√°sha and Princess Mary find solace and support in each other after Prince Andrew‚Äôs death. Nicholas, after being freed from his commitment to S√≥nya, finds himself drawn to Princess Mary. Pierre's journey of self-discovery continues. 
* **Epilogue (1813-20):**  We see the long-term consequences of the war on the characters and their families. Nat√°sha marries Pierre, and Nicholas marries Princess Mary. They find love and happiness, but the scars of war and the loss of loved ones are still felt deeply. The novel concludes with the lives of the characters evolving, embracing peace and new beginnings, but also acknowledging the enduring impact of the events they lived through.

**Themes:**

* **War and Peace:** The novel explores the profound impact of war on both individuals and society, contrasting the brutality of battle with the yearning for peace.
* **Love and Family:** Love, in all its forms‚Äîromantic, familial, and spiritual‚Äîis a central theme. The novel explores the complexities of marriage, the bond between siblings, and the importance of family.
* **Fate vs. Free Will:** The novel grapples with the interplay between fate and free will, considering how individual choices intersect with historical forces beyond our control.
* **Faith and Spirituality:**  The novel examines the role of faith in individual lives and the search for meaning in a chaotic and often unjust world.
* **Social Change:** The novel reflects the social upheaval of the early 19th century, exploring the dynamics of power, class, and the evolving roles of women in society.

**Significance:**

* War and Peace is considered one of the greatest novels ever written, lauded for its scope, realism, character development, and philosophical depth.
* Its exploration of universal themes, combined with its historical context, makes it a timeless work of literature relevant to readers today.
* The novel challenges traditional conceptions of heroism and war, questioning the role of individual actors in history and emphasizing the importance of the collective experience of humanity.

**In Conclusion:** 

War and Peace is a profound and moving exploration of the human condition, offering a rich tapestry of individual lives interwoven with the tapestry of history itself. Its insights into the nature of love, war, and human destiny continue to resonate with readers today. 


## Long-form video

Video content has been difficult to process due to constraints of the format itself.
It was hard to skim the content, transcripts often failed to capture the nuance of a video, and most tools don't process images, text, and audio together.
The Gemini 1.5 long context window allows the ability to reason and answer questions about multimodal inputs with
sustained performance.

When tested on the needle in a video haystack problem with 1M tokens, Gemini 1.5 Flash obtained >99.8% recall of the video in the context window, and Gemini 1.5 Pro reached state of the art performance on the [Video-MME benchmark](https://video-mme.github.io/home_page.html).

Some emerging and standard use cases for video long context include:

-   Video question and answering
-   Video memory, as shown with [Google's Project Astra](https://deepmind.google/technologies/gemini/project-astra/)
-   Video captioning
-   Video recommendation systems, by enriching existing metadata with new
    multimodal understanding
-   Video customization, by looking at a corpus of data and associated video
    metadata and then removing parts of videos that are not relevant to the
    viewer
-   Video content moderation
-   Real-time video processing

[Google I/O](https://io.google/) is one of the major events when Google's developer tools are announced. Workshop sessions and are filled with a lot of material, so it can be difficult to keep track all that is discussed.

We are going to use a video of a session from Google I/O 2024 focused on [Grounding for Gemini](https://www.youtube.com/watch?v=v4s5eU2tfd4) to calculate tokens and process the information presented. We will ask a specific question about a point in the video and ask for a general summary.

In [9]:
# Set contents to send to the model
video = Part.from_uri(
    "gs://github-repo/generative-ai/gemini/long-context/GoogleIOGroundingRAG.mp4",
    mime_type="video/mp4",
)

contents = ["At what time in the following video is the Cymbal Starlight demo?", video]

# Counts tokens
print(model.count_tokens(contents))

# Prompt the model to generate content
response = model.generate_content(
    contents,
)

# Print the model response
print(f"\nUsage metadata:\n{response.usage_metadata}")

display(Markdown(response.text))

total_tokens: 628364
total_billable_characters: 54


Usage metadata:
prompt_token_count: 628364
candidates_token_count: 16
total_token_count: 628380



The Cymbal Starlight demo begins at 24:54. 


In [10]:
contents = [
    "Provide an enthusiastic summary of the video, tailored for software developers.",
    video,
]

# Counts tokens
print(model.count_tokens(contents))

# Prompt the model to generate content
response = model.generate_content(contents)

# Print the model response
print(f"\nUsage metadata:\n{response.usage_metadata}")

display(Markdown(response.text))

total_tokens: 628363
total_billable_characters: 69


Usage metadata:
prompt_token_count: 628363
candidates_token_count: 113
total_token_count: 628476



This video is super exciting for any software developer interested in using Google Cloud's Vertex AI! The speaker dives into the fascinating world of Grounding for Gemini, exploring the capabilities and benefits of using Vertex AI Search and DIY RAG. Get ready to learn how to create custom search engines, build chatbots with real-world information, and even dive into the advanced world of multimodal retrieval augmented generation (RAG) using LangChain. This video is full of practical examples and valuable insights, making it a must-watch for anyone looking to enhance their AI development skills! 

## Long-form audio

In order to process audio, developers have typically needed to string together multiple models, like a speech-to-text model and a text-to-text model, in order to process audio. This led to additional latency due to multiple round-trip requests, and the context of the audio itself could be lost.

The Gemini 1.5 models were the first natively multimodal large language models that could understand audio.

On standard audio-haystack evaluations, Gemini 1.5 Pro is able to find the hidden audio in 100% of the tests and Gemini 1.5 Flash is able to find it in 98.7% [of the tests](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf). Further, on a test set of 15-minute audio clips, Gemini 1.5 Pro archives a word error rate (WER) of ~5.5%, much lower than even specialized speech-to-text models, without the added complexity of extra input segmentation and pre-processing.

The long context window accepts up to 9.5 hours of audio in a single request.

Some emerging and standard use cases for audio context include:

-   Real-time transcription and translation
-   Podcast / video question and answering
-   Meeting transcription and summarization
-   Voice assistants

Podcasts are a great way to learn about the latest news in technology, but there are so many out there that it can be difficult to follow them all. It's also challenging to find a specific episode with a given topic or a quote.

In this example, we will process 9 episodes of the [Google Kubernetes Podcast](https://cloud.google.com/podcasts/kubernetespodcast) and ask specific questions about the content.

In [11]:
# Set contents to send to the model
contents = [
    "According to the following podcasts, what can you tell me about AI/ML workloads on Kubernetes?",
    Part.from_uri(
        "gs://github-repo/generative-ai/gemini/long-context/20240417-kpod223.mp3",
        mime_type="audio/mpeg",
    ),
    Part.from_uri(
        "gs://github-repo/generative-ai/gemini/long-context/20240430-kpod224.mp3",
        mime_type="audio/mpeg",
    ),
    Part.from_uri(
        "gs://github-repo/generative-ai/gemini/long-context/20240515-kpod225.mp3",
        mime_type="audio/mpeg",
    ),
    Part.from_uri(
        "gs://github-repo/generative-ai/gemini/long-context/20240529-kpod226.mp3",
        mime_type="audio/mpeg",
    ),
    Part.from_uri(
        "gs://github-repo/generative-ai/gemini/long-context/20240606-kpod227.mp3",
        mime_type="audio/mpeg",
    ),
    Part.from_uri(
        "gs://github-repo/generative-ai/gemini/long-context/20240611-kpod228.mp3",
        mime_type="audio/mpeg",
    ),
    Part.from_uri(
        "gs://github-repo/generative-ai/gemini/long-context/20240625-kpod229.mp3",
        mime_type="audio/mpeg",
    ),
    Part.from_uri(
        "gs://github-repo/generative-ai/gemini/long-context/20240709-kpod230.mp3",
        mime_type="audio/mpeg",
    ),
    Part.from_uri(
        "gs://github-repo/generative-ai/gemini/long-context/20240723-kpod231.mp3",
        mime_type="audio/mpeg",
    ),
]

# Counts tokens
print(model.count_tokens(contents))

# Prompt the model to generate content
response = model.generate_content(
    contents,
)

# Print the model response
print(f"\nUsage metadata:\n{response.usage_metadata}")

display(Markdown(response.text))

total_tokens: 1012279
total_billable_characters: 80


Usage metadata:
prompt_token_count: 1012279
candidates_token_count: 110
total_token_count: 1012389



The podcasts you provided do not mention anything about AI/ML workloads on Kubernetes, but they do discuss other related topics such as:

- Kubernetes as a platform for building platforms
- The use of operators for managing Kubernetes workloads
- The importance of observability for complex systems
- The challenges of scaling Kubernetes and the need for new solutions
- The role of the community in driving innovation in Kubernetes

The podcasts also discuss the importance of finding a way to monetize open source projects so that they can continue to be developed and maintained. 


## Code

For a long context window use case involving ingesting an entire GitHub repository, check out [Analyze a codebase with Vertex AI Gemini 1.5 Pro](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/analyze_codebase_with_gemini_1_5_pro.ipynb)

## Context caching

[Context caching](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview) allows developers to reduce the time and cost of repeated requests using the large context window.
For examples on how to use Context Caching with Gemini on Vertex AI, refer to [Intro to Context Caching with Gemini on Vertex AI](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/context-caching/intro_context_caching.ipynb)