##### Copyright 2025 Google LLC.

In [None]:
# Upload the file using the API
file_upload = client.files.upload(file=text_path)

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        file_upload,
        "Can you give me a summary of this information please?",
    ]
)

Markdown(response.text)



In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

### Upload a PDF file

This PDF page is an article titled [Smoothly editing material properties of objects](https://research.google/blog/smoothly-editing-material-properties-of-objects-with-text-to-image-models-and-synthetic-data/) with text-to-image models and synthetic data available on the Google Research Blog.

In [None]:
# Prepare the file to be uploaded
PDF = "https://storage.googleapis.com/generativeai-downloads/data/Smoothly%20editing%20material%20properties%20of%20objects%20with%20text-to-image%20models%20and%20synthetic%20data.pdf"  # @param {type: "string"}
pdf_bytes = requests.get(PDF).content

pdf_path = pathlib.Path('article.pdf')
pdf_path.write_bytes(pdf_bytes)

6695391

In [None]:
# Upload the file using the API
file_upload = client.files.upload(file=pdf_path)

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        file_upload,
        "Can you summarize this file as a bulleted list?",
    ]
)

Markdown(response.text)

Here is a summary of the article as a bulleted list:

*   The article presents a method called "Alchemist" for smoothly and parametrically editing the material properties (like color, shininess, or transparency) of objects in photographs.
*   The goal is to achieve photorealistic edits while preserving the object's shape and the original scene lighting.
*   Existing methods, such as intrinsic image decomposition or direct text-to-image model editing, struggle with the ambiguity of material properties or fail to disentangle material from shape.
*   The proposed method leverages the photorealistic capabilities of generative text-to-image (T2I) models by fine-tuning them on a large synthetic dataset.
*   The synthetic dataset is created by rendering 3D models of objects with varying material attributes and systematically changing one attribute at a time (e.g., roughness, transparency) according to a scalar "edit strength" value.
*   A modified Stable Diffusion 1.5 model is trained to accept an input image, an edit instruction, and the desired edit strength, learning to translate these inputs into an output image with the edited material property.
*   The model successfully generalizes to real-world images, producing photorealistic material changes while largely maintaining the original object's geometry and lighting.
*   It can realistically render complex effects like backgrounds visible through transparent objects and caustic lighting effects.
*   A user study showed that the method's edits were significantly more photorealistic and preferred over a baseline method (InstructPix2Pix).
*   Potential applications include creating product mock-ups and enabling 3D consistent material editing when combined with techniques like NeRF.
*   The research was presented in a paper at CVPR 2024.

### Upload an audio file

In this case, you'll use a [sound recording](https://www.jfklibrary.org/asset-viewer/archives/jfkwha-006) of President John F. Kennedy’s 1961 State of the Union address.

In [None]:
# Prepare the file to be uploaded
AUDIO = "https://storage.googleapis.com/generativeai-downloads/data/State_of_the_Union_Address_30_January_1961.mp3"  # @param {type: "string"}
audio_bytes = requests.get(AUDIO).content

audio_path = pathlib.Path('audio.mp3')
audio_path.write_bytes(audio_bytes)

41762063

In [None]:
# Upload the file using the API
file_upload = client.files.upload(file=audio_path)

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        file_upload,
        "Listen carefully to the following audio file. Provide a brief summary",
    ]
)

Markdown(response.text)

This audio is President John F. Kennedy's first State of the Union address, delivered on January 30, 1961.

In the speech, Kennedy provides a frank assessment of the nation's situation, highlighting both domestic and international challenges. Domestically, he details a struggling economy with high unemployment, low growth, and issues in housing, education, and healthcare, proposing immediate actions to address them. Internationally, he discusses the concerning balance of payments deficit and the growing threats posed by the Cold War and communism in various regions (Asia, Africa, Latin America). He calls for strengthening military, economic, and diplomatic capabilities, emphasizing the need for robust alliances, international cooperation (including in science and space), and a reformed, more decisive public service. The speech stresses the importance of facing difficulties realistically, preparing for future challenges, and requires dedication from all citizens to secure freedom and progress worldwide.

### Upload a video file

In this case, you'll use a short clip of [Big Buck Bunny](https://peach.blender.org/about/).

In [None]:
# Download the video file
VIDEO_URL = "https://download.blender.org/peach/bigbuckbunny_movies/BigBuckBunny_320x180.mp4"  # @param {type: "string"}
video_file_name = "BigBuckBunny_320x180.mp4"
!wget -O {video_file_name} $VIDEO_URL

--2025-04-18 12:09:07--  https://download.blender.org/peach/bigbuckbunny_movies/BigBuckBunny_320x180.mp4
Resolving download.blender.org (download.blender.org)... 172.67.14.163, 104.22.65.163, 104.22.64.163, ...
Connecting to download.blender.org (download.blender.org)|172.67.14.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 64657027 (62M) [video/mp4]
Saving to: ‘BigBuckBunny_320x180.mp4’


2025-04-18 12:09:08 (141 MB/s) - ‘BigBuckBunny_320x180.mp4’ saved [64657027/64657027]



Let's start by uploading the video file.

In [None]:
# Upload the file using the API
video_file = client.files.upload(file=video_file_name)
print(f"Completed upload: {video_file.uri}")

Completed upload: https://generativelanguage.googleapis.com/v1beta/files/prqn913jn9t8


The state of the video is important. The video must finish processing, so do check the state. Once the state of the video is `ACTIVE`, you are able to pass it into `generate_content`.

In [None]:
import time

# Check the file processing state
while video_file.state == "PROCESSING":
    print('Waiting for video to be processed.')
    time.sleep(10)
    video_file = client.files.get(name=video_file.name)

if video_file.state == "FAILED":
  raise ValueError(video_file.state)
print(f'Video processing complete: ' + video_file.uri)

Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/prqn913jn9t8


In [None]:
print(video_file.state)

FileState.ACTIVE


In [None]:
# Ask Gemini about the video
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        video_file,
        "Describe this video.",
    ]
)

Markdown(response.text)

The video is a clip from the open-source animated short film "Big Buck Bunny" (produced by the Blender Foundation). It opens with a peaceful pastoral scene: rolling green hills, scattered trees (including pine and deciduous), rocks, flowers, and a stream, under a bright sky with fluffy pink clouds.

A small, plump, grey bird is perched on a branch, yawning and stretching, but is soon knocked off.

The camera then focuses on a large burrow entrance under a tree root, where a very large, fluffy, grey rabbit is sleeping. It wakes up, stretches, emerges from the burrow, and smiles contentedly at the sunny morning.

The rabbit enjoys the day, sniffing large white flowers and watching a beautiful pink butterfly land on its head. An apple falls from a tree, but the rabbit's attention is drawn back to the butterfly.

Hiding behind a tree root are three smaller rodent characters: two squirrels (one brown, one reddish-brown and spikier) and a grey chinchilla/hamster, all looking mischievous. The chinchilla holds a nut.

The squirrels begin to torment the rabbit by throwing small objects at it (rocks, nuts, and spiky chestnuts). The rabbit is initially startled and confused, but quickly becomes annoyed and then angry.

Driven by vengeance, Big Buck Bunny decides to retaliate. He prepares by sharpening a stick with a rock and creating a large spear using a vine as a bowstring. He takes aim at the squirrels hiding behind a tree and shoots the spear, which punctures the tree trunk.

Undeterred, the squirrels continue their harassment. Big Buck then sets up a trap: a series of sharpened sticks concealed under leaves on the ground, connected by a vine which he pulls taut like a tripwire.

The angry flying squirrel tries to knock a peach from a tree but ends up knocking it towards the stakes, where it gets impaled. Big Buck then catches the flying squirrel.

In the final scene before the credits, Big Buck Bunny is seen happily flying the terrified flying squirrel like a kite.

The credits roll, featuring brief animated appearances of the chinchilla and the red squirrel interacting with the text, and finally the little bird flying the flying squirrel (still as a kite) past the credits.

### Process a YouTube link

For YouTube links, you don't need to explicitly upload the video file content, but you do need to explicitly declare the video URL you want the model to process as part of the `contents` of the request. For more information see the [vision](https://ai.google.dev/gemini-api/docs/vision?lang=python#youtube) documentation including the features and limits.

> **Note:** You are only able to submit up to one YouTube link per `generate_content` request.

> **Note:** YouTube links included as part of the text input won't being processed in the request, an can lead to incorrect responses. You must explicitly the URL using the `file_uri` argument of `FileData`.

The following example shows how you can use the model to summarize the video. In this case use a summary video of [Google I/O 2024]("https://www.youtube.com/watch?v=WsEQjeZoEng").

In [None]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents= types.Content(
        parts=[
            types.Part(text="Summarize this video."),
            types.Part(
                file_data=types.FileData(file_uri='https://www.youtube.com/watch?v=WsEQjeZoEng')
            )
        ]
    )
)

Markdown(response.text)

Based on the video, here is a summary of the Google I/O 2024 keynote:

The keynote highlights Google's progress in the "Gemini era," integrating their multimodal AI models across their products and introducing new capabilities and models. Key announcements and features include:

1.  **Gemini Integration:** Gemini is now integrated into all of Google's 2 billion user products, enhancing existing features.
2.  **Gemini 1.5 Pro in Workspace:** Available today in Workspace Labs, it can summarize long emails and potentially other documents. It can also summarize recorded Google Meet meetings.
3.  **Gemini in Google Photos:** Enables deeper search capabilities, allowing users to find specific memories or track progress over time by understanding the content within photos and videos.
4.  **Expanded Context Window:** Gemini 1.5 Pro's context window is expanded to 2 million tokens, allowing it to process much larger amounts of information simultaneously (e.g., summarizing very long documents or videos).
5.  **Project Astra:** A prototype for a universal AI agent that is truly helpful in everyday life. Demos show the agent understanding real-time visual and audio input to explain code, remember object locations, and even suggest creative ideas (like a band name for a dog and a toy).
6.  **Gemini 1.5 Flash:** A new, lighter-weight, faster, and more cost-efficient multimodal model designed for scaling, while still retaining strong reasoning and long-context capabilities.
7.  **Veo:** A new, highly capable generative video model that creates high-quality 1080p videos from text, image, and video prompts.
8.  **Trillium TPUs:** The 6th generation of Google's custom chips for AI/ML, delivering a 4.7x improvement in compute performance per chip over the previous generation.
9.  **Generative AI in Google Search:** AI Overviews are becoming more powerful, able to handle complex, multi-part questions and provide quick answers and summaries. This is coming to over 1 billion people by the end of the year.
10. **Google Lens Integration:** Soon, users can ask questions about a video by pointing Google Lens at it, getting relevant information instantly (e.g., troubleshooting a turntable based on visual input).
11. **Gems:** Customizable personal AI experts within Gemini, available for Gemini Advanced subscribers. Users can create specific assistants for their needs by providing instructions, which can then handle complex tasks and answer questions across multiple uploaded files (up to 1500 pages per PDF or multiple files for project insights). Gemini Advanced offers a 1 million token context window for this.
12. **AI in Android:** Gemini is being reimagined at the core of Android to be more context-aware, anticipating user needs and providing helpful suggestions in the moment. Gemini Nano with Multimodality will enable the phone to understand the world through sight, sound, and spoken language.
13. **Gemma & PaliGemma:** Expansion of the open model family. PaliGemma is the first vision-language open model, available now.
14. **Gemma 2:** The next generation of Gemma, including a new 27 billion parameter model, will be available in June for driving AI innovation responsibly.
15. **LearnLM:** A new family of models based on Gemini and fine-tuned for learning. A new feature in YouTube uses LearnLM to make educational videos more interactive, allowing users to ask clarifying questions, get explanations, and take quizzes.
16. **Responsible AI:** Google emphasizes its commitment to building AI responsibly through practices like red teaming to identify and address potential risks while maximizing benefits for society.

Overall, the keynote showcases Google's focus on making AI, particularly through the Gemini family of models, more powerful, multimodal, context-aware, and helpful across its platforms and products, while also emphasizing responsible development.

## Use context caching

[Context caching](https://ai.google.dev/gemini-api/docs/caching?lang=python) lets you to store frequently used input tokens in a dedicated cache and reference them for subsequent requests, eliminating the need to repeatedly pass the same set of tokens to a model.

Context caching is only available for stable models with fixed versions (for example, `gemini-1.5-flash-002`). You must include the version postfix (for example, the `-002` in `gemini-1.5-flash-002`). You can find more caching examples [here](https://github.com/google-gemini/cookbook/blob/main/quickstarts/Caching.ipynb).

#### Create a cache

In [None]:
system_instruction = """
  You are an expert researcher who has years of experience in conducting systematic literature surveys and meta-analyses of different topics.
  You pride yourself on incredible accuracy and attention to detail. You always stick to the facts in the sources provided, and never make up new facts.
  Now look at the research paper below, and answer the following questions in 1-2 sentences.
"""

urls = [
    'https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf',
    "https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
]

In [None]:
# Download files
pdf_bytes = requests.get(urls[0]).content
pdf_path = pathlib.Path('2312.11805v3.pdf')
pdf_path.write_bytes(pdf_bytes)

pdf_bytes = requests.get(urls[1]).content
pdf_path = pathlib.Path('2403.05530.pdf')
pdf_path.write_bytes(pdf_bytes)

7228817

In [None]:
# Upload the PDFs using the File API
uploaded_pdfs = []
uploaded_pdfs.append(client.files.upload(file='2312.11805v3.pdf'))
uploaded_pdfs.append(client.files.upload(file='2403.05530.pdf'))

In [None]:
# Create a cache with a 60 minute TTL
cached_content = client.caches.create(
    model=MODEL_ID,
    config=types.CreateCachedContentConfig(
      display_name='research papers', # used to identify the cache
      system_instruction=system_instruction,
      contents=uploaded_pdfs,
      ttl="3600s",
  )
)

cached_content

CachedContent(name='cachedContents/ql5fbzexj5rl', display_name='research papers', model='models/gemini-2.5-flash-preview-04-17', create_time=datetime.datetime(2025, 4, 18, 12, 10, 43, 598484, tzinfo=TzInfo(UTC)), update_time=datetime.datetime(2025, 4, 18, 12, 10, 43, 598484, tzinfo=TzInfo(UTC)), expire_time=datetime.datetime(2025, 4, 18, 13, 10, 42, 12326, tzinfo=TzInfo(UTC)), usage_metadata=CachedContentUsageMetadata(audio_duration_seconds=None, image_count=None, text_count=None, total_token_count=43167, video_duration_seconds=None))

#### Listing available cache objects

In [None]:
for cache in client.caches.list():
  print(cache)

name='cachedContents/ql5fbzexj5rl' display_name='research papers' model='models/gemini-2.5-flash-preview-04-17' create_time=datetime.datetime(2025, 4, 18, 12, 10, 43, 598484, tzinfo=TzInfo(UTC)) update_time=datetime.datetime(2025, 4, 18, 12, 10, 43, 598484, tzinfo=TzInfo(UTC)) expire_time=datetime.datetime(2025, 4, 18, 13, 10, 42, 12326, tzinfo=TzInfo(UTC)) usage_metadata=CachedContentUsageMetadata(audio_duration_seconds=None, image_count=None, text_count=None, total_token_count=43167, video_duration_seconds=None)


#### Use a cache

In [None]:
response = client.models.generate_content(
  model=MODEL_ID,
  contents="What is the research goal shared by these research papers?",
  config=types.GenerateContentConfig(cached_content=cached_content.name)
)

Markdown(response.text)

Both research papers share the goal of developing and presenting the Gemini family of highly capable multimodal models. These models aim to understand and reason across image, audio, video, and text data.

#### Delete a cache

In [None]:
result = client.caches.delete(name=cached_content.name)

## Get text embeddings

You can get text embeddings for a snippet of text by using `embed_content` method and using the `gemini-embedding-exp-03-07` model.



The Gemini Embeddings model produces an output with 3072 dimensions by default. However, you have the option to choose an output dimensionality between 1 and 3072. See the [embeddings guide](https://ai.google.dev/gemini-api/docs/embeddings) for more details.

In [None]:
TEXT_EMBEDDING_MODEL_ID = "gemini-embedding-exp-03-07"

In [None]:
response = client.models.embed_content(
    model=TEXT_EMBEDDING_MODEL_ID,
    contents=[
        "How do I get a driver's license/learner's permit?",
        "How do I renew my driver's license?",
        "How do I change my address on my driver's license?"
        ],
    config=types.EmbedContentConfig(output_dimensionality=512)
)

print(response.embeddings)

[ContentEmbedding(values=[-0.0010864572, 0.0069392114, 0.017009795, -0.010305981, -0.009999484, -0.0064486223, 0.0041451487, -0.005906698, 0.022229617, -0.018305639, -0.018174557, 0.022160593, -0.013604425, -0.0027964567, 0.12966625, 0.028866312, 0.0014726851, 0.03537643, -0.015166075, -0.013479812, -0.019288255, 0.010106378, -0.0043296088, 0.018035924, 0.00295039, -0.007934979, -0.005416007, -0.0095809875, 0.040398005, -0.0020784356, 0.011551388, 0.009726445, 0.006670387, 0.020050988, -0.00747873, -0.0012074928, 0.0047189263, -0.006359583, -0.01718203, -0.023562348, -0.0051814457, 0.023801394, -0.004928927, -0.016113443, 0.01672777, -0.0069929743, -0.012722719, -0.0137646515, -0.041852377, -0.0011546672, 0.017030545, -0.0022786013, 0.011707037, -0.18675306, -0.035211734, -0.011472648, 0.01970727, 0.0012368832, -0.020796346, -0.018513134, -0.006821043, -0.01843726, -0.00827558, -0.042159837, 0.0038724025, 0.01933339, 0.0139452815, 0.025059255, 0.0015087503, -0.016094029, -0.0035785383,

You will get a set of three embeddings, one for each piece of text you passed in:

In [None]:
len(response.embeddings)

3

You can also see the length of each embedding is 512, as per the `output_dimensionality` you specified.

In [None]:
print(len(response.embeddings[0].values))
print((response.embeddings[0].values[:4], '...'))

512
([-0.0010864572, 0.0069392114, 0.017009795, -0.010305981], '...')


## Next Steps

### Useful API references:

Check out the [Google GenAI SDK](https://github.com/googleapis/python-genai) for more details on the new SDK.

### Related examples

For more detailed examples using Gemini 2.0, check the [Gemini 2.0 folder of the cookbook](https://github.com/google-gemini/cookbook/tree/main/gemini-2/). You'll learn how to use the [Live API](./Get_started_LiveAPI.ipynb), juggle with [multiple tools](../examples/LiveAPI_plotting_and_mapping.ipynb) or use Gemini 2.0 [spatial understanding](./Spatial_understanding.ipynb) abilities.

Also check the [experimental Gemini 2.0 Flash Thinking](./Get_started_thinking.ipynb) model that explicitly showcases its thoughts and can manage more complex reasonings.