# Rerank
本筆記本介紹如何開始使用Chroma向量儲存。
- https://python.langchain.com/docs/integrations/document_transformers/rankllm-reranker/

In [1]:
# 安裝套件
!uv pip install -qU "langchain-chroma"

In [2]:
# Helper function for printing docs
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

In [3]:
# llm模型設定
# https://build.nvidia.com/deepseek-ai/deepseek-r1
# nvapi-xxx
import getpass
import os
if not os.environ.get("NVIDIA_API_KEY"):
  os.environ["NVIDIA_API_KEY"] = getpass.getpass("Enter API key for NVIDIA: ")

from langchain.chat_models import init_chat_model
llm = init_chat_model("meta/llama-4-maverick-17b-128e-instruct", model_provider="nvidia")

Enter API key for NVIDIA:  ········




In [4]:
# embedding and rerank 模型
# https://jina.ai/
# jina_xxx
import getpass
import os

if not os.environ.get("JINA_API_KEY"):
  os.environ["JINA_API_KEY"] = getpass.getpass("Enter API key for Voyage AI: ")

from langchain_community.embeddings import JinaEmbeddings
embeddings = JinaEmbeddings(
    jina_api_key=os.environ["JINA_API_KEY"], model_name="jina-embeddings-v3"
)


from langchain_community.document_compressors import JinaRerank

rerank = JinaRerank(model="jina-reranker-v2-base-multilingual")  # 或使用 "rerank-1" 精準但較慢

Enter API key for Voyage AI:  ········


In [6]:
from langchain_chroma import Chroma

vector_store = Chroma(
    #collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db30",  # Where to save data locally, remove if not necessary
)

In [7]:
# 輸出所有儲存的 document（包括 id, page_content, metadata）
all_data = vector_store._collection.get()

# 印出每筆紀錄的內容
for i in range(len(all_data["ids"])):
    print(f"ID: {all_data['ids'][i]}")
    print(f"Document: {all_data['documents'][i]}")
    print(f"Metadata: {all_data['metadatas'][i]}")
    print("="*40)

ID: 1aa3d78d-c1c4-4f8f-b9c9-f83d4d5c9884
Document: I had chocolate chip pancakes and scrambled eggs for breakfast this morning.
Metadata: {'source': 'tweet'}
ID: f5030079-91cc-4652-9243-e1910da2ae6c
Document: The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.
Metadata: {'source': 'news'}
ID: 13268674-6721-46a6-ae7d-b409c71f6b55
Document: Building an exciting new project with LangChain - come check it out!
Metadata: {'source': 'tweet'}
ID: 7d929b87-a1c1-4de6-9584-844f09986730
Document: Robbers broke into the city bank and stole $1 million in cash.
Metadata: {'source': 'news'}
ID: a3789186-b526-4cad-b320-dd81309ff9c7
Document: Wow! That was an amazing movie. I can't wait to see it again.
Metadata: {'source': 'tweet'}
ID: 2b69b396-2195-44ec-bd11-c23f1d8affae
Document: Is the new iPhone worth the price? Read this review to find out.
Metadata: {'source': 'website'}
ID: d93f6e7f-e17c-4953-bdba-a906d6c11afe
Document: The top 10 soccer players in the world rig

In [9]:
# Query by turning into retriever

retriever = vector_store.as_retriever(
    search_type="mmr", search_kwargs={"k": 5, "fetch_k": 10}
)
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})

[Document(id='7d929b87-a1c1-4de6-9584-844f09986730', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.'),
 Document(id='d99c3215-94fb-4e33-9831-3cde708b739e', metadata={'source': 'news'}, page_content='The stock market is down 500 points today due to fears of a recession.'),
 Document(id='f5030079-91cc-4652-9243-e1910da2ae6c', metadata={'source': 'news'}, page_content='The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.')]

In [12]:
# Step 2: 把向量資料庫轉換為 retriever，並指定檢索參數
retriever = vector_store.as_retriever(search_kwargs={"k": 5})
query = "RStealing from the bank is a crime"
docs = retriever.invoke(query)
pretty_print_docs(docs)

Document 1:

Robbers broke into the city bank and stole $1 million in cash.
----------------------------------------------------------------------------------------------------
Document 2:

The stock market is down 500 points today due to fears of a recession.
----------------------------------------------------------------------------------------------------
Document 3:

I have a bad feeling I am going to get deleted :(
----------------------------------------------------------------------------------------------------
Document 4:

Is the new iPhone worth the price? Read this review to find out.
----------------------------------------------------------------------------------------------------
Document 5:

The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.


In [13]:
query = "Stealing from the bank is a crime"
from langchain.retrievers import ContextualCompressionRetriever
rerank_retriever = ContextualCompressionRetriever(
    base_compressor=rerank, base_retriever=retriever
)

rerank_docs = rerank_retriever.invoke(query)
#print(rerank_docs)
pretty_print_docs(rerank_docs)

Document 1:

Robbers broke into the city bank and stole $1 million in cash.
----------------------------------------------------------------------------------------------------
Document 2:

Is the new iPhone worth the price? Read this review to find out.
----------------------------------------------------------------------------------------------------
Document 3:

The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.


In [14]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate



system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise. "
    "Context: {context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)
question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(rerank_retriever, question_answer_chain)
query = "Stealing from the bank is a crime"
chain.invoke({"input": query}, filter={"source": "news"})


{'input': 'Stealing from the bank is a crime',
 'context': [Document(metadata={'source': 'news', 'relevance_score': 0.2751297354698181}, page_content='Robbers broke into the city bank and stole $1 million in cash.'),
  Document(metadata={'source': 'website', 'relevance_score': 0.04603390023112297}, page_content='Is the new iPhone worth the price? Read this review to find out.'),
  Document(metadata={'source': 'news', 'relevance_score': 0.044680867344141006}, page_content='The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.')],
 'answer': 'Yes, stealing from the bank is a crime, specifically known as bank robbery or theft. The robbers broke the law by stealing $1 million in cash. This act is punishable under the law.'}

##   作業

**使用 Rerank 提升問答系統**

Notebook 中創建了一個基本的問答系統，使用 Chroma 檢索相關文檔並使用 LLM 回答問題。

**作業：**

1.  **擴展問答系統：**
    * 在 notebook 的基礎上，加入 Rerank 功能，提升系統回答問題的準確性。
2.  **選擇文檔來源：**
    * 選擇一個適合問答系統的文檔來源，例如：
        * 網頁文章
        * 產品說明書
        * 法律文件
        * 學術論文
    * 請在報告中說明你選擇的文檔來源及其適用情境。
3.  **實作 Rerank：**
    * 實作 Rerank 功能，對向量資料庫檢索到的文檔進行重新排序。
    * 可以選擇以下 Rerank 方法：
        * 自訂 Rerank 演算法
    * 請在報告中說明你選擇的 Rerank 方法及其原因。
4.  **回答使用者問題：**
    * 系統能夠根據 Rerank 後的文檔，回答使用者提出的問題。

**評估標準：**

* 系統是否能夠正確回答使用者提出的問題？ (40%)
* Rerank 是否能夠提升檢索結果的相關性？ (30%)
* Rerank 對系統效率的影響？ (30%)
