# File Search Store Management

A notebook to demonstrate how to create and manage a Gemini FileSearchStore.

The best way to run this notebook is from Google Colab.

<a target="_blank" href="https://colab.research.google.com/github/derailed-dash/gemini-file-search-demo/blob/main/notebooks/file_search_store.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=30/></a>

## Pre-Reqs and Notes

- The `file_search_stores` is a feature exclusive to the Gemini Developer API. 
  - It does not work with the Vertex AI API or the Gen AI SDK in Vertex AI mode.
  - Therefore: don't set env vars for `GOOGLE_CLOUD_LOCATION` or `GOOGLE_GENAI_USE_VERTEXAI` and do not initialise Vertex AI.
- Make sure you have an up-to-date version of the `google-genai` package installed. 
  - Versions older than 1.49.0 do not support the File Search Tool.
  - You can upgrade the package using `pip install --upgrade google-genai`.
  - You can add to your `pyproject.toml` file; since we don't explicitly need it outside of this notebook, we can add it to the `[jupyter]` section.
- Add your Gemini API Key to Colab as a secret. Then you can retrieve it using `userdata.get("GEMINI_API_KEY")`

## Setup

In [None]:
import glob
import os
import time

from dotenv import load_dotenv
from google import genai
from google.genai import types
from pydantic import BaseModel


### Local Only

If running locally, setup the Google Cloud environment:

```bash
source scripts/setup-env.sh
```

Then to install the package dependencies into the virtual environment, use the `uv` tool:

1. From your agent's root directory, run `make install` to set up the virtual environment (`.venv`).
2. In this Jupyter notebook, select the kernel from the `.venv` folder to ensure all dependencies are available.

In [None]:
print(f"Current CWD: {os.getcwd()}")

# Load env vars
load_dotenv()

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
if not GEMINI_API_KEY:
    print("Warning: GEMINI_API_KEY environment variable not set. Client may fail.")
else:
    print("Successfully loaded Gemini API key.")

STORE_NAME = os.getenv("STORE_NAME")
if not STORE_NAME:
    print("Warning: STORE_NAME environment variable not set. You will not be able to work with the store.")
else:
    print(f"Successfully loaded Gemini File Search Store: {STORE_NAME}")


### Or In Colab

In [None]:
%pip install -q -U "google-genai>=1.49.0"

In [None]:
from google.colab import userdata

os.environ["GEMINI_API_KEY"] = userdata.get("GEMINI_API_KEY")

### Client Initialisation

In [None]:
client = genai.Client()

## Store Management

### View All Stores

In [None]:
try:
    for a_store in client.file_search_stores.list():
        print(a_store)
except Exception as e:
    print(f"Error listing stores (check creds?): {e}")

### Retrieve the Store

Here's a utility function to retrieve the store(s) that match a given display name. Note that display name is not unique, so this function returns the first matching store.

In [None]:
def get_store(store_name: str):
    """Retrieve a store by display name"""
    try:
        for a_store in client.file_search_stores.list():
            if a_store.display_name == store_name:
                return a_store
    except Exception as e:
        print(f"Error in get_store path: {e}")
    return None

### Create the Store (One Time)

Once you've created the store, save the store ID for use in your application.

In [None]:
if not get_store(STORE_NAME):
    file_search_store = client.file_search_stores.create(config={"display_name": STORE_NAME})
    print(f"Created store: {file_search_store.name}")
else:
    print(f"Store {STORE_NAME} already exists.")

### View the Store

We can interrogate a store and see what files have been uploaded to it.

In [None]:
file_search_store = get_store(STORE_NAME)
if not file_search_store:
    print(f"Store {STORE_NAME} not found.")
else:
    print(file_search_store)

    # List all documents in the store
    # The 'parent' argument is the resource name of the store
    docs = client.file_search_stores.documents.list(parent=file_search_store.name)
    try:
        doc_list = list(docs)
        print(f"Docs in {STORE_NAME}: {len(doc_list)}")

        if not doc_list:
            print("No documents found in the store.")
        else:
            for i, doc in enumerate(doc_list):
                section_heading = f"Document {i}:"
                print("-" * len(section_heading))
                print(section_heading)
                print("-" * len(section_heading))
                print(f"  Display name:{doc.display_name}")
                print(f"  ID: {doc.name}")
                print(f"  Metadata: {doc.custom_metadata}")
    except Exception as e:
        print(f"Error listing docs (might be empty): {e}")

### Delete Store(s)


In [None]:
# First, point to the right store. For example:
file_search_store = get_store(STORE_NAME)

# Delete the store
if file_search_store:
    print(f"Store found: {file_search_store.name}")
    # print(f"Deleting {file_search_store}")
    # Uncomment to delete
    # client.file_search_stores.delete(name=file_search_store.name, config={'force': True})

## Upload and Process Files

Now we need to place the files in a suitable local folder to upload to the store.

In [None]:
UPLOAD_PATH = "../data"

Create some utility classes and functions:

In [None]:
class DocumentMetadata(BaseModel):
    title: str
    author: str
    abstract: str


def delete_doc(doc, file_search_store):
    """Delete document(s) from the file search store"""
    print(f"♻️  Deleting duplicate: '{doc.display_name}' (ID: {doc.name})")
    client.file_search_stores.documents.delete(name=doc.name, config={"force": True})
    time.sleep(2)  # small throttle and allow propagation


def generate_metadata(file_name: str, temp_file) -> DocumentMetadata:
    """Generate metadata for a document"""

    print(f"Extracting metadata from {file_name}...")
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=[
            """Please extract title, author, and short abstract from this document. 
            Each value should be under 200 characters.

            Abstracts should be succinct and NOT include preamble text like `This document describes...`

            Example bad abstract: 
            Now I want to cover a key consideration that can potentially 
            save you more in future IT spend than any other decision you can make: 
            embracing open source as a core element of your cloud strategy.

            Example good abstract:
            How you can significantly reduce IT spend by embracing open source
            as a core component of your cloud strategy.

            Example bad abstract:
            This article discusses how you can design your cloud landing zone.

            Example good abstract:
            How to design your cloud landing zone according to best practices.
            """,
            temp_file,
        ],
        config={
            "response_mime_type": "application/json",
            "response_schema": DocumentMetadata,
        },
    )

    metadata: DocumentMetadata = response.parsed
    print(f"Title: {metadata.title}")
    print(f"Author: {metadata.author}")
    print(f"Abstract: {metadata.abstract}")

    return metadata


def upload_doc(file_path, file_search_store):
    """Upload a document to the file search store"""

    file_name = os.path.basename(file_path)

    print(f"Uploading {file_name} for metadata extraction...")
    temp_file = client.files.upload(file=file_path)

    # Verify file is active (ready for inference)
    while temp_file.state.name == "PROCESSING":
        print("Still uploading...", end="\r")
        time.sleep(2)
        temp_file = client.files.get(name=temp_file.name)

    if temp_file.state.name != "ACTIVE":
        raise RuntimeError(f"File upload failed with state: {temp_file.state.name}")

    # Now let's check if this is a replacement of an existing file
    # If so, we should delete the existing entry first
    # Iterate through all docs in the store
    for doc in client.file_search_stores.documents.list(parent=file_search_store.name):
        should_delete = False

        # Match by Display Name
        if doc.display_name == file_name:
            should_delete = True

        # Match by Custom Metadata (Robust Match)
        # This catches docs where display_name was set to the Title
        elif doc.custom_metadata:
            for meta in doc.custom_metadata:
                if meta.key == "file_name" and meta.string_value == file_name:
                    should_delete = True
                    break

        if should_delete:
            delete_doc(doc, file_search_store)

    metadata = generate_metadata(file_name, temp_file)

    # Import the file into the file search store with custom metadata
    operation = client.file_search_stores.upload_to_file_search_store(
        file_search_store_name=file_search_store.name,
        file=file_path,
        config={
            "display_name": metadata.title,
            "custom_metadata": [
                {"key": "title", "string_value": metadata.title},
                {"key": "file_name", "string_value": file_name},
                {"key": "author", "string_value": metadata.author},
                {"key": "abstract", "string_value": metadata.abstract},
            ],
        },
    )

    # Wait until import is complete
    while not operation.done:
        time.sleep(5)
        print("Still importing...")
        operation = client.operations.get(operation)

    print(f"{file_name} successfully uploaded and indexed")

Now actually **upload and process our documents**:

In [14]:
file_search_store = get_store(STORE_NAME)
if file_search_store is None:
    print(f"Store {STORE_NAME} not found.")
else:
    print(f"Uploading files to {file_search_store.name}...")
    files_to_upload = glob.glob(f"{UPLOAD_PATH}/*")
    if files_to_upload:
        for file_path in files_to_upload:
            print(f"Uploading {file_path}")
            upload_doc(file_path, file_search_store)
        print("Upload complete.")
    else:
        print(f"No files found in {UPLOAD_PATH}")

Uploading files to fileSearchStores/demofilestore-rsf3q46u6fcu...
Uploading ../data/story.md
Uploading story.md for metadata extraction...
♻️  Deleting duplicate: 'The Wormhole Incursion' (ID: fileSearchStores/demofilestore-rsf3q46u6fcu/documents/the-wormhole-incursion-o25713mzzb51)
Extracting metadata from story.md...
Title: The Wormhole Incursion: A Squadron Chronicle
Author: Unknown
Abstract: Commander Dazbo leads his unique AI-piloted squadron to defend a mining colony from a Krellon wormhole invasion, culminating in the destruction of the alien mothership, the Star-Eater.
Still importing...
story.md successfully uploaded and indexed
Upload complete.


## Verify with Query

Now that the data is uploaded, let's verify we can retrieve it using the File Search Tool.

In [15]:
# Retrieve the store again to be sure
store = get_store(STORE_NAME)

if store:
    print(f"Querying store: {store.name} ({store.display_name})")

    try:
        # Use the File Search Tool
        if hasattr(types, "FileSearch"):
            print("FileSearch tool config...")
            response = client.models.generate_content(
                model="gemini-2.5-flash",
                contents="What is the cargo capacity of the 'Too Many Pies' Anaconda?",
                config=types.GenerateContentConfig(
                    tools=[types.Tool(file_search=types.FileSearch(file_search_store_names=[store.name]))]
                ),
            )
            print("\nResponse:")
            print(response.text)
        else:
            print("types.FileSearch not found. Skipping in-notebook query verification.")

    except Exception as e:
        print(f"Query failed: {e}")

else:
    print("Store not found, cannot verify.")

Querying store: fileSearchStores/demofilestore-rsf3q46u6fcu (demo-file-store)
FileSearch tool config...

Response:
The 'Too Many Pies' Anaconda has a cargo capacity of 258 tons of munitions.
