##### 版權所有 2024 Google LLC.


In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemma - Minimal RAG

這本 Cookbook 演示了如何在不使用任何編排工具（如 LangChain 或 LlamaIndex）或任何向量資料庫的情況下建構一個最小的檢索增強生成（RAG）系統。唯一需要的相依套件是 Google 的 [UniSim](https://github.com/google/unisim) 專案作為嵌入模型和 [HtmlChunker](https://github.com/google/labs-prototypes/tree/main/seeds/chunker-python)。

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/doggy8088/gemma-cookbook/blob/zh-tw/Gemma/Minimal_RAG.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />在 Google Colab 中執行</a>
  </td>
</table>


## 設定

### 選擇 Colab 執行環境
要完成本指南，你需要有一個具有足夠資源的 Colab 執行環境來執行 Gemma 模型。在這種情況下，你可以使用 T4 GPU:

1. 在 Colab 視窗的右上角，選擇 **▾ (額外連接選項)** 。
2. 選擇 **變更執行環境類型** 。
3. 在 **硬體加速器** 下，選擇 **T4 GPU** 。

### 在 Hugging Face 上設定 Gemma
這本 Cookbook 使用經過 Hugging Face 調整的 Gemma 7B 指令模型。因此你需要:

* 通過接受特定模型的 Hugging Face 頁面上的 Gemma 許可，在 [huggingface.co](huggingface.co) 上獲取 Gemma 的訪問權限，即 [Gemma 7B IT](https://huggingface.co/google/gemma-7b-it)。
* 生成一個 [Hugging Face 訪問令牌](https://huggingface.co/docs/hub/en/security-tokens) 並將其配置為 Colab 的秘密 'HF_TOKEN'。


## 檢索增強生成 (RAG)

大型語言模型 (LLMs) 可以在沒有直接訓練的情況下學習新能力。然而，已知 LLMs 在回答未經訓練的問題時會出現“幻覺”。這部分是因為 LLMs 在訓練後對事件一無所知。也很難追溯 LLMs 從中獲取回應的來源。對於可靠且可擴展的應用程序，重要的是 LLM 提供基於事實的回應並能夠引用其資訊來源。

克服這些限制的一種常見方法稱為檢索增強生成 (RAG)，它通過資訊檢索 (IR) 機制從外部知識庫檢索相關數據來增強發送給 LLM 的提示。知識庫可以是你自己的文件、數據庫或 APIs。

### 將數據分塊

為了提高向量數據庫在檢索過程中返回內容的相關性，在攝取文件時將大型文件分解為較小的部分或塊。

在這本秘訣中，你將使用 [Google I/O 2024 Gemma family expansion launch blog](https://developers.googleblog.com/en/gemma-family-and-toolkit-expansion-io-2024/) 作為範例文件，並使用 Google 的 [Open Source HtmlChunker](https://github.com/google/labs-prototypes/tree/main/seeds/chunker-python) 將其分塊為段落。


In [None]:
!pip install google-labs-html-chunker

from google_labs_html_chunker.html_chunker import HtmlChunker

from urllib.request import urlopen

with urlopen(
    "https://developers.googleblog.com/en/gemma-family-and-toolkit-expansion-io-2024/"
) as f:
    html = f.read().decode("utf-8")

# Chunk the file using HtmlChunker
chunker = HtmlChunker(
    max_words_per_aggregate_passage=200,
    greedily_aggregate_sibling_nodes=True,
    html_tags_to_exclude={"noscript", "script", "style"},
)
passages = chunker.chunk(html)



看看分塊文字的樣子。


In [None]:
for passage in passages:
    print(passage)

Introducing PaliGemma, Gemma 2, and an Upgraded Responsible AI Toolkit
            
            
            
            - Google Developers Blog
Products Develop Android Chrome ChromeOS Cloud Firebase Flutter Google Assistant Google Maps Platform Google Workspace TensorFlow YouTube Grow Firebase Google Ads Google Analytics Google Play Search Web Push and Notification APIs Earn AdMob Google Ads API Google Pay Google Play Billing Interactive Media Ads Solutions Events Learn Community Groups Google Developer Groups Google Developer Student Clubs Woman Techmakers Google Developer Experts Tech Equity Collective Programs Accelerator Solution Challenge DevFest Stories All Stories Developer Profile Blog Search English English Español (Latam) Bahasa Indonesia 日本語 한국어 Português (Brasil) 简体中文
Products More Solutions Events Learn Community More Developer Profile Blog Develop Android Chrome ChromeOS Cloud Firebase Flutter Google Assistant Google Maps Platform Google Workspace TensorFlow YouTube G

## 取得相關的區塊


給定一個使用者問題 'where can I get PaliGemma?'，你將使用 Unisim 來檢索相關的區塊。

首先，計算使用者問題與所有文本區塊（段落）之間的相似度。


In [None]:
!pip install unisim
from unisim import TextSim

user_question = "where can I find PaliGemma?"

text_sim = TextSim()

similarities = []
for passage in passages:
    similarities.append(text_sim.similarity(user_question, passage))

INFO: Loaded backend
INFO: Using TF with GPU




INFO: UniSim is storing a copy of the indexed data
INFO: If you are using large data corpus, consider disabling this behavior using store_data=False


將段落和相似性放入資料框架中。


In [None]:
import pandas as pd

results_df = pd.DataFrame({"passage": passages, "similarity": similarities})
results_df

Unnamed: 0,passage,similarity
0,"Introducing PaliGemma, Gemma 2, and an Upgrade...",0.517319
1,Products Develop Android Chrome ChromeOS Cloud...,0.299514
2,Products More Solutions Events Learn Community...,0.296253
3,"Gemini Introducing PaliGemma, Gemma 2, and an ...",0.508258
4,"At Google, we believe in the power of collabor...",0.369846
5,Link to Youtube Video (visible only when JS is...,0.33353
6,"Gemma is a family of lightweight, state-of-the...",0.386614
7,Introducing PaliGemma: Open Vision-Language Mo...,0.57323
8,Screenshot from the HuggingFace Space running ...,0.530472
9,Announcing Gemma 2: Next-Gen Performance and E...,0.460508


識別最相關的前三段。


In [None]:
top_3_similarities = results_df.nlargest(3, "similarity")
top_3_targets = top_3_similarities["passage"]
top_3_targets

7    Introducing PaliGemma: Open Vision-Language Mo...
8    Screenshot from the HuggingFace Space running ...
0    Introducing PaliGemma, Gemma 2, and an Upgrade...
Name: passage, dtype: object

接下來，組裝一個提示，使用用戶問題和檢索到的上下文。


In [None]:
prompt_template = """You are an expert in answering user questions. You always understand user questions well, and then provide high-quality answers based on the information provided in the context.

If the provided context does not contain relevent information, just respond "I could not find the answer based on the context you provided."

User question: {}

Context:
{}
"""

context = "\n".join(
    [f"{i+1}. {passage}" for i, passage in enumerate(top_3_targets.iloc[:].tolist())]
)
prompt = f"{prompt_template.format(user_question, context)}"

這是將要發送給 Gemma 的最終提示。


In [None]:
print(prompt)

You are an expert in answering user questions. You always understand user questions well, and then provide high-quality answers based on the information provided in the context.

If the provided context does not contain relevent information, just respond "I could not find the answer based on the context you provided."

User question: where can I find PaliGemma?

Context:
1. Introducing PaliGemma: Open Vision-Language Model PaliGemma is a powerful open VLM inspired by PaLI-3 . Built on open components including the SigLIP vision model and the Gemma language model, PaliGemma is designed for class-leading fine-tune performance on a wide range of vision-language tasks. This includes image and short video captioning, visual question answering, understanding text in images, object detection, and object segmentation. We're providing both pretrained and fine-tuned checkpoints at multiple resolutions, as well as checkpoints specifically tuned to a mixture of tasks for immediate exploration. To 

### 產生答案


現在使用 Hugging Face 以量化的 4 位元模式載入 Gemma 模型。


In [None]:
!pip install bitsandbytes accelerate
from transformers import AutoTokenizer
import transformers
import torch
import bitsandbytes, accelerate

model = "google/gemma-7b-it"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
    },
)



`low_cpu_mem_usage` was None, now set to True since model is quantized.
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

最後, 產生答案。


In [None]:
messages = [
    {"role": "user", "content": prompt},
]
prompt = pipeline.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.1)
print(outputs[0]["generated_text"][len(prompt) :])



Sure, here is the answer to the user question:

You can find PaliGemma on GitHub, Hugging Face models, Kaggle, Vertex AI Model Garden, and ai.nvidia.com (accelerated with TensoRT-LLM) with easy integration through JAX and Hugging Face Transformers.


Gemma 能夠根據檢索到的上下文提供正確的答案。

在這本秘訣中，範例文件 [Google I/O 2024 Gemma family expansion launch blog](https://developers.googleblog.com/en/gemma-family-and-toolkit-expansion-io-2024/) 相當短，所以在分塊後沒有太多段落可供搜索。為了使秘訣最小化，我們進行了徹底的搜索以找到相關的搜索。

在現實世界的使用案例中，可能會有很多塊需要通過單一查詢進行搜索，在這種情況下，你需要使用近似最近鄰(ANN)來提高效率。這通常由向量資料庫直接支持。UniSim 也支持 ANN，請參閱 UniSim 文件及其 [Colab](https://github.com/google/unisim/blob/main/notebooks/unisim_text_demo.ipynb) 以了解索引和搜索。

UniSim 團隊還創建了一個單獨的 [RAG 展示](https://github.com/google/unisim/blob/main/notebooks/unisim-gemma-text_rag_demo.ipynb)。隨時查看。
