這份 Notebook 示範 Similarity(相似性) 和 Relevance(相關性) 的不同

透過 Vector embeddings 搜索出來的相似性高，但不一定是和問題最相關的內容，因此可透過 Reranker 模型再重新排序相關性。

用 Embeddings 檢索是快，但不是最準的。建議做二階段檢索。

更多 Reranker 評測: https://ihower.tw/blog/archives/12227

In [1]:
# from google.colab import userdata
# openai_api_key = userdata.get('openai_api_key')

In [1]:
import requests
import json
from pprint import pp

In [12]:
# Import necessary libraries
## 設定 OpenAI API Key 變數
from dotenv import load_dotenv
import os

# Load the environment variables from .env file
load_dotenv()

# Access the API key
openai_api_key = os.getenv('OPENAI_API_KEY')
cohere_api_key = os.getenv('COHERE_API_KEY')





In [3]:
def get_embeddings(input, dimensions = 1536, model="text-embedding-3-small"):
  payload = { "input": input, "model": model, "dimensions": dimensions }
  headers = { "Authorization": f'Bearer {openai_api_key}', "Content-Type": "application/json" }
  response = requests.post('https://api.openai.com/v1/embeddings', headers = headers, data = json.dumps(payload) )
  obj = json.loads(response.text)
  if response.status_code == 200 :
    return obj["data"][0]["embedding"]
  else :
    return obj["error"]

In [4]:
import numpy as np
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

## 我們想要判斷用戶 question 和 chunk_a, chunk_b, chunk_c 有多相關?

In [5]:
question = "為什麼良好的睡眠對於保持健康至關重要?"

chunk_a = "均衡良好的飲食對於保持健康至關重要，因為它提供身體所需的各種營養素。" # 相似度高，但其實相關度較低
chunk_b = "深度睡眠期間，大腦會進行記憶固化和身體修復工作" # 相似度低，但其實相關度高
chunk_c = "定期進行戶外運動對於保持良好的社交關係至關重要，因為它提供了與他人互動的機會。" # 相似度也高，但是非常不相關

## 方法1: 用 embedding 模型計算 question 和 context a,b,c 的相似性分數

In [6]:
question_embedding = get_embeddings(question)
chunk_a_embedding = get_embeddings(chunk_a)
chunk_b_embedding = get_embeddings(chunk_b)
chunk_c_embedding = get_embeddings(chunk_c)

A 句相似度高(因為句型非常相似)，但其實相關度比較低。問題是問睡眠，這句講飲食。

In [7]:
cosine_similarity(question_embedding, chunk_a_embedding)

0.5500915565834913

B 句其實最相關，都是在講睡眠，但分數卻比 A 句還低

In [8]:
cosine_similarity(question_embedding, chunk_b_embedding)

0.49571930969692773

C 句更不相關

In [9]:
cosine_similarity(question_embedding, chunk_c_embedding)

0.34479794841217687

## 方法二: 使用 Cohere Reranking API

* https://cohere.com/rerank
* https://docs.cohere.com/docs/reranking

這需要去申請 cohere API key

* 相比 embeddings 模型是先算出兩段內容的向量，然後計算向量相似度。適合大規模計算，用向量資料庫上百萬千萬筆都不成問題。
* reranking 模型是輸入兩段內容，輸出相關性分數。更準確但是效能較差，幾十筆幾百筆吧。

In [10]:
!pip install cohere

Looking in indexes: https://mirrors.aliyun.com/pypi/simple/



[notice] A new release of pip is available: 24.1.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [15]:
import cohere
import cohere

co = cohere.ClientV2(api_key=cohere_api_key)


## 用 Cohere Reranking 給分排序

https://docs.cohere.com/docs/rerank-2

B句比A句更相關，的確分數變成最高了。C 句的相關分數超低

In [16]:
result2 = co.rerank(query=question, documents=[chunk_a, chunk_b, chunk_c], top_n=3, model='rerank-multilingual-v3.0')
result2.results

[V2RerankResponseResultsItem(document=None, index=1, relevance_score=0.7285552),
 V2RerankResponseResultsItem(document=None, index=0, relevance_score=0.641647),
 V2RerankResponseResultsItem(document=None, index=2, relevance_score=0.007695647)]

## 用 RankGPT 排序

也有人發明可以用 LLM 來做相關性排序，不過成本比較高也比較慢啦

In [17]:
def get_completion(messages, model="gpt-4.1-nano", temperature=0, format_type=None):
  payload = { "model": model, "temperature": temperature, "messages": messages }
  if format_type:
    payload["response_format"] =  { "type": format_type }

  headers = { "Authorization": f'Bearer {openai_api_key}', "Content-Type": "application/json" }
  response = requests.post('https://api.openai.com/v1/chat/completions', headers = headers, data = json.dumps(payload) )
  obj = json.loads(response.text)
  if response.status_code == 200 :
    return obj["choices"][0]["message"]["content"]
  else :
    return obj["error"]

In [18]:
messages=[
    {"role": "system", "content": "You are RankGPT, an intelligent assistant that can rank passages based on their relevancy to the query."},
    {"role": "user", "content": f"I will provide you with 3 passages, each indicated by number identifier []. Rank the passages based on their relevance to query: {question}" },
    {"role": "assistant", "content": "Okay, please provide the passages."},
    {"role": "user", "content": f"[0] {chunk_a}" },
    {"role": "assistant", "content": "Received passage [0]"},
    {"role": "user", "content": f"[1] {chunk_b}" },
    {"role": "assistant", "content": "Received passage [1]"},
    {"role": "user", "content": f"[2] {chunk_c}" },
    {"role": "assistant", "content": "Received passage [2]"},
    {"role": "user", "content": f"Search Query: {question}. Rank the 3 passages above based on their relevance to the search query. The passages should be listed in descending order using identifiers. The most relevant passages should be listed first. The output format should be [] > [], e.g., [1] > [2]. Only response the ranking results, do not say any word or explain." },
]

result = get_completion(messages)
result

'[1] > [0] > [2]'

結果跟 Cohere ranker 一樣 :)

Reanker model 評測: https://ihower.tw/blog/archives/12227