# Vertex AI Search with Filters & Metadata




## Objective

This notebook shows how to use [filters and metadata](https://cloud.google.com/generative-ai-app-builder/docs/filter-search-metadata) in search requests to [Vertex AI Search](https://cloud.google.com/generative-ai-app-builder/docs/introduction).

This works with unstructured apps that contain metadata. You can use metadata fields to restrict your search to a specific set of documents.


Services used in the notebook:

- ✅ Vertex AI Search for document search and retrieval

## Install pre-requisites

If running in Colab install the pre-requisites into the runtime. Otherwise it is assumed that the notebook is running in Vertex AI Workbench. In that case it is recommended to install the pre-requisites from a terminal using the `--user` option.


In [1]:
%pip install -q google-cloud-discoveryengine==0.11.2 --upgrade --user

Note: you may need to restart the kernel to use updated packages.


### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [2]:
# Restart kernel after installs so that your environment can access the new packages

# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>


## Authenticate

If running in Colab authenticate with `google.colab.google.auth` otherwise assume that running on Vertex AI Workbench.


In [3]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth as google_auth

    google_auth.authenticate_user()

from google.auth import default

creds, _ = default()

## Configure notebook environment


## Data store metadata

Metadata for the data store `<DATA_STORE_ID>` looks like this:

```json
{
  "id": "1",
  "structData": {
    "title": "Document1",
    "category": [
      "PersonaA"
    ],
    "name": "Document1"
  },
  "content": {
    "mimeType": "application/pdf",
    "uri": "gs://<BUCKETNAME>/data/Document1"
  }
}
```

```json
{
  "id": "2",
  "structData": {
    "title": "Document2",
    "category": [
      "PersonaA",
      "PersonaB"
    ],
    "name": "Document2"
  },
  "content": {
    "mimeType": "application/pdf",
    "uri": "gs://<BUCKETNAME>/data/Document2"
  }
}
```

### Set the following constants to reflect your environment

In [4]:
PROJECT_ID = "my-project-0004-346516"
LOCATION = "global"  # Replace with your data store location
DATA_STORE_ID = 'cymbal-financial-datastore_1727706682106' #"cymbal-google-financial_1727706596157"

### REST API examples

The filter `name: ANY("Document1")` ensures the query is against only the documents with `name` matching `Document1`.

In [5]:
%%bash -s "$PROJECT_ID" "$LOCATION" "$DATA_STORE_ID"

project_id=$1
location=$2
data_store_id=$3


curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://discoveryengine.googleapis.com/v1beta/projects/$project_id/locations/$location/collections/default_collection/dataStores/$data_store_id/servingConfigs/default_search:search" \
-d '{
"query": "claim",
"filter": "id: ANY(\"doc-25\")"
}'


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   499    0   446  100    53   3484    414 --:--:-- --:--:-- --:--:--  3898


{
  "attributionToken": "_gHw_QoMCM6BkroGELbYmqoDEiQ2NzQzNDZkMC0wMDAwLTIxOGUtODZiZC0xNDIyM2JjOWFmMGUiB0dFTkVSSUMqvAG3kq4wlJLFMM7mtS-OkckwxPi8MPn2sy2b1rct6d3EMK7Eii3bj5oiq8SKLfz2sy2NpLQwoImzLcTGsTDFy_MXzpq0MMH8yzDej5oimNa3LYCymiKOvp0V5-2ILcfGsTCjgJci1LKdFZbeqC_C8J4Vt7eMLeiCsS2Q97Iwy5q0MJCktDDm3cQwnN3YMKr4sy3R5rUvmd6oL8T8yzCt-LMttJKuMIOymiLrgrEtmd3YMKOJsy3B-Lww5O2ILTAB",
  "guidedSearchResult": {},
  "summary": {},
  "queryExpansionInfo": {}
}


The filter `category: ANY("PersonaB")` ensures the query is against only the documents with `name` matching `Document1`.

In [6]:
# cymbal-financial-datastore_1727706682106


In [7]:
%%bash -s "$PROJECT_ID" "$LOCATION" "$DATA_STORE_ID"

project_id=$1
location=$2
data_store_id=$3

curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://discoveryengine.googleapis.com/v1beta/projects/$project_id/locations/$location/collections/default_collection/dataStores/$data_store_id/servingConfigs/default_search:search" \
-d '{
"query": "claims",
"filter": "title: ANY(\"PersonaB\")"
}'


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   280    0   221  100    59   4333   1156 --:--:-- --:--:-- --:--:--  5490


{
  "error": {
    "code": 400,
    "message": "Invalid filter syntax 'title: ANY(\"PersonaB\")'. Parsing filter failed with error: Unsupported field \"title\" on \":\" operator..",
    "status": "INVALID_ARGUMENT"
  }
}


### Python code equivalent

In [8]:
from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine_v1beta as discoveryengine


def search_data_store(
    project_id: str,
    location: str,
    data_store_id: str,
    search_query: str,
    filter_str: str,
) -> discoveryengine.SearchResponse:
    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )

    # Create a client
    client = discoveryengine.SearchServiceClient(client_options=client_options)

    # The full resource name of the search engine serving config
    # e.g. projects/{project_id}/locations/{location}/dataStores/{data_store_id}/servingConfigs/{serving_config_id}
    serving_config = client.serving_config_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        serving_config="default_config",
    )

    # Optional: Configuration options for search
    # Refer to the `ContentSearchSpec` reference for all supported fields:
    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest.ContentSearchSpec
    content_search_spec = discoveryengine.SearchRequest.ContentSearchSpec(
        # For information about snippets, refer to:
        # https://cloud.google.com/generative-ai-app-builder/docs/snippets
        snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(
            return_snippet=True
        ),
        extractive_content_spec=discoveryengine.SearchRequest.ContentSearchSpec.ExtractiveContentSpec(
            max_extractive_answer_count=5,
            max_extractive_segment_count=1,
        ),
        # For information about search summaries, refer to:
        # https://cloud.google.com/generative-ai-app-builder/docs/get-search-summaries
        summary_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec(
            summary_result_count=5,
            include_citations=True,
            ignore_adversarial_query=False,
            ignore_non_summary_seeking_query=False,
        ),
    )

    # Refer to the `SearchRequest` reference for all supported fields:
    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest
    request = discoveryengine.SearchRequest(
        serving_config=serving_config,
        query=search_query,
        filter=filter_str,
        page_size=5,
        content_search_spec=content_search_spec,
        query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec(
            condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO,
        ),
        spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec(
            mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO
        ),
    )

    response = client.search(request)
    return response

In [9]:
search_query = "what are google financial results"
filter_str = 'id: ANY(\"doc-25\")' #'name: ANY("Document1")'

results = search_data_store(
    PROJECT_ID, LOCATION, DATA_STORE_ID, search_query, filter_str
)

print(f"\nQuestion: '{search_query}'\n\n")
print("Summary" + "-" * 40)
print(results.summary.summary_text)

print("Raw Results" + "-" * 40)
print(results)


Question: 'what are google financial results'


Summary----------------------------------------
Google's financial results for the third quarter of 2009 are available in a PDF document. [1] The document is titled "20090630_google_10Q.pdf" and is categorized as a financial document. [1] The document is located in the "gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs" directory. [1] The document is associated with Alphabet Inc., the parent company of Google. [1] The document is for the third quarter of 2009. [1] 

Raw Results----------------------------------------
SearchPager<results {
  id: "00f88c892eea9b99d85efa30a90c105a"
  document {
    name: "projects/255766800726/locations/global/collections/default_collection/dataStores/cymbal-financial-datastore_1727706682106/branches/0/documents/00f88c892eea9b99d85efa30a90c105a"
    id: "00f88c892eea9b99d85efa30a90c105a"
    struct_data {
      fields {
        key: "content"
        value {
          struct_value {
    

Here is a slightly more complex filter based on 2 metadata values

In [10]:
search_query = "how to understand google financial results of the quarter"
filter_str = 'id: ANY("doc-25") OR id: ANY("doc-170")'

results = search_data_store(
    PROJECT_ID, LOCATION, DATA_STORE_ID, search_query, filter_str
)

print(f"\nQuestion: '{search_query}'\n\n")
print("Summary" + "-" * 40)
print(results.summary.summary_text)

print("Raw Results" + "-" * 40)
print(results)


Question: 'how to understand google financial results of the quarter'


Summary----------------------------------------
Google's financial results for a quarter are typically presented in a 10-Q report. These reports are publicly available and can be found on the Google Investor Relations website. The 10-Q report provides a detailed overview of Google's financial performance for the quarter, including revenue, expenses, and net income. It also includes information about Google's business operations and future outlook. To understand Google's financial results, it is important to review the 10-Q report and any accompanying press releases. 

Raw Results----------------------------------------
SearchPager<results {
  id: "1a24a714d7b0951995ab19e5d5934559"
  document {
    name: "projects/255766800726/locations/global/collections/default_collection/dataStores/cymbal-financial-datastore_1727706682106/branches/0/documents/1a24a714d7b0951995ab19e5d5934559"
    id: "1a24a714d7b0951995ab19e5d59

### 

In [25]:
!pip install --upgrade google-cloud-discoveryengine

Collecting google-cloud-discoveryengine
  Downloading google_cloud_discoveryengine-0.13.4-py3-none-any.whl.metadata (5.3 kB)
Downloading google_cloud_discoveryengine-0.13.4-py3-none-any.whl (2.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m51.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: google-cloud-discoveryengine
  Attempting uninstall: google-cloud-discoveryengine
    Found existing installation: google-cloud-discoveryengine 0.11.2
    Uninstalling google-cloud-discoveryengine-0.11.2:
      Successfully uninstalled google-cloud-discoveryengine-0.11.2
Successfully installed google-cloud-discoveryengine-0.11.14


In [29]:
from langchain.chains import (
    ConversationalRetrievalChain,
    RetrievalQA,
    RetrievalQAWithSourcesChain,
)
from langchain_google_vertexai import VertexAI
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
from langchain_google_community import (
    VertexAIMultiTurnSearchRetriever,
    VertexAISearchRetriever,
)
MODEL = "gemini-1.5-pro"  # @param {type:"string"}
DATA_STORE_LOCATION = "global"  # @param {type:"string"}

llm = VertexAI(model_name=MODEL)

retriever = VertexAISearchRetriever(
    project_id=PROJECT_ID,
    location_id=DATA_STORE_LOCATION,
    data_store_id=DATA_STORE_ID,
    get_extractive_answers=True,
    max_documents=10,
    max_extractive_segment_count=1,
    max_extractive_answer_count=5,
)

In [32]:
qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True
)

print(qa.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:


In [33]:
prompt_template = """Use the context to answer the question at the end.
You must always use the context and context only to answer the question. Never try to make up an answer. If the context is empty or you do not know the answer, just say "I don't know".
The answer should consist of only 1 word and not a sentence.

Context: {context}

Question: {question}
Helpful Answer:
"""
prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
qa_chain = RetrievalQA.from_llm(
    llm=llm, prompt=prompt, retriever=retriever, return_source_documents=True
)
     


In [34]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)


Use the context to answer the question at the end.
You must always use the context and context only to answer the question. Never try to make up an answer. If the context is empty or you do not know the answer, just say "I don't know".
The answer should consist of only 1 word and not a sentence.

Context: {context}

Question: {question}
Helpful Answer:



In [36]:
retriever = VertexAISearchRetriever(
    project_id=PROJECT_ID,
    location_id=DATA_STORE_LOCATION,
    data_store_id=DATA_STORE_ID,
    max_documents=3,
)

query = "What are Alphabet's Other Bets?"

result = retriever.invoke(query)
for doc in result:
    print(doc)

NameError: name 'LOCATION_ID' is not defined