# 利用 Vertex AI Search 做大量問題解答

使用 Vertex AI Search 資料庫從 CSV 回答問題。

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/doggy8088/generative-ai/blob/main/search/bulk-question-answering/bulk_question_answering.zh.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> 在 Google Colab 中執行
    </a>
  </td>
  <td>
    <a href="https://github.com/doggy8088/generative-ai/blob/main/search/bulk-question-answering/bulk_question_answering.zh.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> 在 GitHub 上檢視
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/doggy8088/generative-ai/blob/main/search/bulk-question-answering/bulk_question_answering.zh.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> 在 Vertex AI Workbench 中開啟
    </a>
  </td>
</table>


| | |
|-|-|
|作者 | [Ruchika Kharwar](https://github.com/rasalt), [Holt Skinner](https://github.com/holtskinner) |


## 安裝先備條件

如果在 Colab 中執行，請將先備條件安裝到執行時期中。否則，假定該筆記本在 Vertex Workbench 中執行。在這種情況下，建議使用 `--user` 選項從終端機安裝先備條件。


In [None]:
%pip install google-cloud-discoveryengine google-auth pandas --upgrade --user -q

### 重新啟動目前的執行階段

要在此 Jupyter 執行階段中使用新安裝的套件，你必須重新啟動執行階段。你可以執行下列Cell來執行此項操作，如此將重新啟動目前的Kernel。


In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ Kernel將重新啟動。請等待它完成，再繼續執行下一個步驟。⚠️</b>
</div>


## 驗證

如果在 Colab 中執行，則使用 `google.colab.google.auth` 驗證，否則假設在 Vertex Workbench 上執行。


In [2]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth as google_auth

    google_auth.authenticate_user()

## 索引值定義

- "查詢"
- "說明文件"
- "說明文件頁數"
- "說明答案"
- "前 5 項文件"
- "前 5 項摘要答案"
- "前 5 項摘要段落"
- "答案 /摘要"


# 匯入函式庫


In [3]:
import pandas as pd
from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine_v1beta as discoveryengine

### 將下列常數設定為反映你的環境
* 本範例中使用的查詢與包含 Alphabet 投資人 PDF 檔案的 GCS 儲存空間相關，但你應將這些內容自訂為你自己的資料。


In [7]:
PROJECT_ID = "YOUR_PROJECT_ID"  # @param {type:"string"}
LOCATION = "global"  # @param {type:"string"}
DATA_STORE_ID = "YOUR_DATA_STORE_ID"  # @param {type:"string"}

## 函式用於搜尋 Vertex AI Search 資料儲存空間


In [4]:
def search_data_store(
    project_id: str,
    location: str,
    data_store_id: str,
    search_query: str,
) -> discoveryengine.SearchResponse:
    #  For more information, refer to:
    # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
    client_options = (
        ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
        if location != "global"
        else None
    )

    # Create a client
    client = discoveryengine.SearchServiceClient(client_options=client_options)

    # The full resource name of the search engine serving config
    # e.g. projects/{project_id}/locations/{location}/dataStores/{data_store_id}/servingConfigs/{serving_config_id}
    serving_config = client.serving_config_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        serving_config="default_config",
    )

    # Optional: Configuration options for search
    # Refer to the `ContentSearchSpec` reference for all supported fields:
    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest.ContentSearchSpec
    content_search_spec = discoveryengine.SearchRequest.ContentSearchSpec(
        # For information about snippets, refer to:
        # https://cloud.google.com/generative-ai-app-builder/docs/snippets
        snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(
            return_snippet=True
        ),
        extractive_content_spec=discoveryengine.SearchRequest.ContentSearchSpec.ExtractiveContentSpec(
            max_extractive_answer_count=5,
            max_extractive_segment_count=1,
        ),
        # For information about search summaries, refer to:
        # https://cloud.google.com/generative-ai-app-builder/docs/get-search-summaries
        summary_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec(
            summary_result_count=5,
            include_citations=True,
            ignore_adversarial_query=False,
            ignore_non_summary_seeking_query=False,
        ),
    )

    # Refer to the `SearchRequest` reference for all supported fields:
    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest
    request = discoveryengine.SearchRequest(
        serving_config=serving_config,
        query=search_query,
        page_size=5,
        content_search_spec=content_search_spec,
        query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec(
            condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO,
        ),
        spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec(
            mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO
        ),
    )

    response = client.search(request)
    return response

# 函式將結果載入資料框中


In [49]:
def answer_questions(
    row, project_id: str, location: str, data_store_id: str, top_n: int = 5
) -> None:
    """This function returns the top 5 docs, extractive segments, answers"""
    # Perform search with Query
    response = search_data_store(project_id, location, data_store_id, row["Query"])

    row["Answer / Summary"] = response.summary.summary_text

    top5docs, top5answers, top5segments = [], [], []
    ext_ans_cnt, ext_seg_cnt = 0, 0

    for result in response.results:
        doc_data = getattr(result.document, "derived_struct_data", None)
        if not doc_data:
            continue

        # Process extractive answers
        for chunk in doc_data.get("extractive_answers", []):
            content = chunk.get("content", "").replace("\n", "")
            top5answers.append(content)
            top5docs.append(
                f"Doc: {doc_data.get('link', '')}  Page: {chunk.get('pageNumber', '')}"
            )
            ext_ans_cnt += 1

        # Process extractive segments
        for chunk in doc_data.get("extractive_segments", []):
            data = chunk.get("content", "").replace("\n", "")
            top5segments.append(data)
            ext_seg_cnt += 1

        if ext_ans_cnt >= top_n and ext_seg_cnt >= top_n:
            break

    row["Top 5 Docs"] = "\n\n".join(top5docs)
    row["Top 5 extractive answers"] = "\n\n".join(top5answers)
    row["Top 5 extractive segments"] = "\n\n".join(top5segments)

### 收集 Vertex AI Search 所有的結果

- 以 CSV 為 Pandas DataFrame 進行讀取
- 將問題發送至 Vertex AI Search
- 將摘要、前 5 篇文件、萃取的答案、萃取的片段載入至 DataFrame
- 將 DataFrame 輸出至 TSV


In [53]:
# Open the CSV file and read column values
df = pd.read_csv("bulk_question_answering_input.csv", header=0, dtype=str)

# Make Vertex AI Search request for each question
df.apply(
    lambda row: answer_questions(row, PROJECT_ID, LOCATION, DATA_STORE_ID, top_n=5),
    axis=1,
)

# Output results to new TSV file
df.to_csv("bulk_question_answering_output.tsv", index=False, sep="\t")

df

Unnamed: 0,Query,Golden Doc,Golden Doc Page Number,Golden Answer,Top 5 Docs,Top 5 extractive answers,Top 5 extractive segments,Answer / Summary,Feedback from customer / account team about returned docs and answer
0,What was Google's revenue in 2021?,,,,Doc: gs://cloud-samples-data/gen-app-builder/s...,Google Cloud had an Operating Loss of $890 mil...,"Within Other Revenues, we are pleased with the...",Google's revenue for the full year 2021 was $5...,
1,What was Google's revenue in 2022?,,,,Doc: gs://cloud-samples-data/gen-app-builder/s...,"Other Revenues were $8.2 billion, up 22%, driv...",Let me now turn to our segment financial resul...,Google's total revenue was $282.8 billion in 2...,
