# 使用交叉编码器进行搜索结果重新排序本笔记本将带您浏览如何使用交叉编码器来重新排列搜索结果的示例。这是我们客户常见的用例，您已经使用嵌入（使用[双编码器](https://www.sbert.net/examples/applications/retrieve_rerank/README.html#retrieval-bi-encoder)生成）实现了语义搜索，但结果不如您的用例要求的那样准确。可能的原因是存在一些业务规则，您可以使用这些规则重新排列文档，例如文档的最新程度或流行程度。然而，通常有一些微妙的领域特定规则有助于确定相关性，这就是交叉编码器可以发挥作用的地方。交叉编码器比双编码器更准确，但不具有良好的可扩展性，因此将它们用于重新排序语义搜索返回的缩短列表是理想的用例。### 示例考虑一个具有D个文档和Q个查询的搜索任务。计算每对相关性的蛮力方法是昂贵的；其成本随着```D * Q```而增加。这被称为**交叉编码**。更快的方法是**基于嵌入的搜索**，其中为每个文档和查询计算一次嵌入，然后多次重复使用以廉价地计算成对相关性。因为嵌入只计算一次，其成本随着```D + Q```而增加。这被称为**双编码**。尽管基于嵌入的搜索速度更快，但质量可能更差。为了兼顾两者的优点，一个常见的方法是使用嵌入（或另一个双编码器）廉价地识别出顶级候选项，然后使用GPT（或另一个交叉编码器）昂贵地重新对这些顶级候选项进行排序。这种混合方法的成本随着```(D + Q) * 嵌入成本 + (N * Q) * 重新排序成本```而增加，其中```N```是重新排序的候选项数。### 演练为了说明这种方法，我们将使用启用了```logprobs```的```text-davinci-003```来构建一个由GPT驱动的交叉编码器。我们的GPT模型具有强大的通用语言理解能力，当通过一些少量示例进行调整时，可以提供简单而有效的交叉编码选项。本笔记本借鉴了Weaviate的这篇[精彩文章](https://weaviate.io/blog/cross-encoders-as-reranker)，以及来自Sentence Transformers的这篇[优秀解释](https://www.sbert.net/examples/applications/cross-encoder/README.html)关于双编码器与交叉编码器的区别。

In [None]:
!pip install openai!pip install arxiv!pip install tenacity!pip install pandas!pip install tiktoken

In [1]:
import arxivfrom math import expimport openaiimport osimport pandas as pdfrom tenacity import retry, wait_random_exponential, stop_after_attemptimport tiktokenclient = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))OPENAI_MODEL = "gpt-4"

## 搜索在这个示例中，我们将使用arXiv搜索服务，但这一步骤可以由您拥有的任何搜索服务执行。需要考虑的关键要点是稍微过度获取以捕获所有潜在相关的文档，然后重新对其进行排序。

In [2]:
query = "how do bi-encoders work for sentence embeddings"search = arxiv.Search(    query=query, max_results=20, sort_by=arxiv.SortCriterion.Relevance)

In [3]:
result_list = []for result in search.results():    result_dict = {}    result_dict.update({"title": result.title})    result_dict.update({"summary": result.summary})    # 采用所提供的第一个网址    result_dict.update({"article_url": [x.href for x in result.links][0]})    result_dict.update({"pdf_url": [x.href for x in result.links][1]})    result_list.append(result_dict)

In [4]:
result_list[0]

{'title': 'SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features',
 'summary': 'Models based on large-pretrained language models, such as S(entence)BERT,\nprovide effective and efficient sentence embeddings that show high correlation\nto human similarity ratings, but lack interpretability. On the other hand,\ngraph metrics for graph-based meaning representations (e.g., Abstract Meaning\nRepresentation, AMR) can make explicit the semantic aspects in which two\nsentences are similar. However, such metrics tend to be slow, rely on parsers,\nand do not reach state-of-the-art performance when rating sentence similarity.\n  In this work, we aim at the best of both worlds, by learning to induce\n$S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our\nS$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize\nvarious semantic sentence features (e.g., semantic roles, negation, or\nquantification). We show 

In [5]:
for i, result in enumerate(result_list):    print(f"{i + 1}: {result['title']}")

1: SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features
2: Are Classes Clusters?
3: Semantic Composition in Visually Grounded Language Models
4: Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions
5: Learning Probabilistic Sentence Representations from Paraphrases
6: Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings
7: How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation
8: Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences
9: Vec2Sent: Probing Sentence Embeddings with Natural Language Generation
10: Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings
11: SentPWNet: A Unified Sentence Pair Weighting Network for Task-specific Sentence Embedding
12: Learning Joint Representations of Videos and Sentences with Web Image Search

## 交叉编码器我们将使用```Completions```端点创建一个交叉编码器，需要考虑的关键因素包括：- 使您的示例具有特定领域的特色 - 当您将交叉编码器定制到您的领域时，它的强大之处就会显现出来。- 在重新排列多少潜在示例与处理速度之间存在权衡。考虑对交叉编码器请求进行分批处理和并行处理，以加快处理速度。以下是操作步骤：- 构建一个提示来评估相关性，并提供少量示例来调整它以适应您的领域。- 为```Yes```和```No```的标记添加一个```logit bias```，以减少其他标记出现的可能性。- 返回yes/no的分类以及```logprobs```。- 根据以```Yes```为键的```logprobs```重新排列结果。

In [6]:
tokens = [" Yes", " No"]tokenizer = tiktoken.encoding_for_model(OPENAI_MODEL)ids = [tokenizer.encode(token) for token in tokens]ids[0], ids[1]

([3363], [1400])

In [7]:
prompt = '''You are an Assistant responsible for helping detect whether the retrieved document is relevant to the query. For a given input, you need to output a single token: "Yes" or "No" indicating the retrieved document is relevant to the query.Query: How to plant a tree?Document: """Cars were invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.[3][4][5] Cars became widely available during the 20th century. One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced horse-drawn carriages.[6] In Europe and other parts of the world, demand for automobiles did not increase until after World War II.[7] The car is considered an essential part of the developed economy."""Relevant: NoQuery: Has the coronavirus vaccine been approved?Document: """The Pfizer-BioNTech COVID-19 vaccine was approved for emergency use in the United States on December 11, 2020."""Relevant: YesQuery: What is the capital of France?Document: """Paris, France's capital, is a major European city and a global center for art, fashion, gastronomy and culture. Its 19th-century cityscape is crisscrossed by wide boulevards and the River Seine. Beyond such landmarks as the Eiffel Tower and the 12th-century, Gothic Notre-Dame cathedral, the city is known for its cafe culture and designer boutiques along the Rue du Faubourg Saint-Honoré."""Relevant: YesQuery: What are some papers to learn about PPO reinforcement learning?Document: """Proximal Policy Optimization and its Dynamic Version for Sequence Generation: In sequence generation task, many works use policy gradient for model optimization to tackle the intractable backpropagation issue when maximizing the non-differentiable evaluation metrics or fooling the discriminator in adversarial learning. In this paper, we replace policy gradient with proximal policy optimization (PPO), which is a proved more efficient reinforcement learning algorithm, and propose a dynamic approach for PPO (PPO-dynamic). We demonstrate the efficacy of PPO and PPO-dynamic on conditional sequence generation tasks including synthetic experiment and chit-chat chatbot. The results show that PPO and PPO-dynamic can beat policy gradient by stability and performance."""Relevant: YesQuery: Explain sentence embeddingsDocument: """Inside the bubble: exploring the environments of reionisation-era Lyman-α emitting galaxies with JADES and FRESCO: We present a study of the environments of 16 Lyman-α emitting galaxies (LAEs) in the reionisation era (5.8<z<8) identified by JWST/NIRSpec as part of the JWST Advanced Deep Extragalactic Survey (JADES). Unless situated in sufficiently (re)ionised regions, Lyman-α emission from these galaxies would be strongly absorbed by neutral gas in the intergalactic medium (IGM). We conservatively estimate sizes of the ionised regions required to reconcile the relatively low Lyman-α velocity offsets (ΔvLyα<300kms−1) with moderately high Lyman-α escape fractions (fesc,Lyα>5%) observed in our sample of LAEs, indicating the presence of ionised ``bubbles'' with physical sizes of the order of 0.1pMpc≲Rion≲1pMpc in a patchy reionisation scenario where the bubbles are embedded in a fully neutral IGM. Around half of the LAEs in our sample are found to coincide with large-scale galaxy overdensities seen in FRESCO at z∼5.8-5.9 and z∼7.3, suggesting Lyman-α transmission is strongly enhanced in such overdense regions, and underlining the importance of LAEs as tracers of the first large-scale ionised bubbles. Considering only spectroscopically confirmed galaxies, we find our sample of UV-faint LAEs (MUV≳−20mag) and their direct neighbours are generally not able to produce the required ionised regions based on the Lyman-α transmission properties, suggesting lower-luminosity sources likely play an important role in carving out these bubbles. These observations demonstrate the combined power of JWST multi-object and slitless spectroscopy in acquiring a unique view of the early stages of Cosmic Reionisation via the most distant LAEs."""Relevant: NoQuery: {query}Document: """{document}"""Relevant:'''@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))def document_relevance(query, document):    response = openai.chat.completions.create(        model="text-davinci-003",        message=prompt.format(query=query, document=document),        temperature=0,        logprobs=True,        logit_bias={3363: 1, 1400: 1},    )    return (        query,        document,        response.choices[0].message.content,        response.choices[0].logprobs.token_logprobs[0],    )

In [8]:
content = result_list[0]["title"] + ": " + result_list[0]["summary"]# 将logprobs设置为1，以便我们的响应将包含模型识别出的最可能的标记。response = openai.chat.completions.create(    model=OPENAI_MODEL,    prompt=prompt.format(query=query, document=content),    temperature=0,    logprobs=1,    logit_bias={3363: 1, 1400: 1},    max_tokens=1,)

In [9]:
result = response.choices[0]print(f"Result was {result.message.content}")print(f"Logprobs was {result.logprobs.token_logprobs[0]}")print("\nBelow is the full logprobs object\n\n")print(result["logprobs"])

Result was Yes
Logprobs was -0.05869877

Below is the full logprobs object


{
  "tokens": [
    "Yes"
  ],
  "token_logprobs": [
    -0.05869877
  ],
  "top_logprobs": [
    {
      "Yes": -0.05869877
    }
  ],
  "text_offset": [
    5764
  ]
}


In [10]:
output_list = []for x in result_list:    content = x["title"] + ": " + x["summary"]    try:        output_list.append(document_relevance(query, document=content))    except Exception as e:        print(e)

In [11]:
output_list[:10]

[('how do bi-encoders work for sentence embeddings',
  'SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features: Models based on large-pretrained language models, such as S(entence)BERT,\nprovide effective and efficient sentence embeddings that show high correlation\nto human similarity ratings, but lack interpretability. On the other hand,\ngraph metrics for graph-based meaning representations (e.g., Abstract Meaning\nRepresentation, AMR) can make explicit the semantic aspects in which two\nsentences are similar. However, such metrics tend to be slow, rely on parsers,\nand do not reach state-of-the-art performance when rating sentence similarity.\n  In this work, we aim at the best of both worlds, by learning to induce\n$S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our\nS$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize\nvarious semantic sentence features (e.g., semantic roles, negation

In [12]:
output_df = pd.DataFrame(    output_list, columns=["query", "document", "prediction", "logprobs"]).reset_index()# 使用 exp() 函数将对数概率转换为概率。output_df["probability"] = output_df["logprobs"].apply(exp)# 根据可能性高低进行重新排序output_df["yes_probability"] = output_df.apply(    lambda x: x["probability"] * -1 + 1    if x["prediction"] == "No"    else x["probability"],    axis=1,)output_df.head()

Unnamed: 0,index,query,document,prediction,logprobs,probability,yes_probability
0,0,how do bi-encoders work for sentence embeddings,SBERT studies Meaning Representations: Decompo...,Yes,-0.053264,0.94813,0.94813
1,1,how do bi-encoders work for sentence embeddings,Are Classes Clusters?: Sentence embedding mode...,No,-0.009535,0.99051,0.00949
2,2,how do bi-encoders work for sentence embeddings,Semantic Composition in Visually Grounded Lang...,No,-0.008887,0.991152,0.008848
3,3,how do bi-encoders work for sentence embeddings,Evaluating the Construct Validity of Text Embe...,No,-0.008584,0.991453,0.008547
4,4,how do bi-encoders work for sentence embeddings,Learning Probabilistic Sentence Representation...,No,-0.011976,0.988096,0.011904


In [13]:
# 返回重新排序的结果reranked_df = output_df.sort_values(    by=["yes_probability"], ascending=False).reset_index()reranked_df.head(10)

Unnamed: 0,level_0,index,query,document,prediction,logprobs,probability,yes_probability
0,16,16,how do bi-encoders work for sentence embeddings,In Search for Linear Relations in Sentence Emb...,Yes,-0.004824,0.995187,0.995187
1,8,8,how do bi-encoders work for sentence embeddings,Vec2Sent: Probing Sentence Embeddings with Nat...,Yes,-0.004863,0.995149,0.995149
2,19,19,how do bi-encoders work for sentence embeddings,Relational Sentence Embedding for Flexible Sem...,Yes,-0.038814,0.96193,0.96193
3,0,0,how do bi-encoders work for sentence embeddings,SBERT studies Meaning Representations: Decompo...,Yes,-0.053264,0.94813,0.94813
4,15,15,how do bi-encoders work for sentence embeddings,Sentence-T5: Scalable Sentence Encoders from P...,No,-0.291893,0.746849,0.253151
5,6,6,how do bi-encoders work for sentence embeddings,How to Probe Sentence Embeddings in Low-Resour...,No,-0.015551,0.98457,0.01543
6,18,18,how do bi-encoders work for sentence embeddings,Efficient and Flexible Topic Modeling using Pr...,No,-0.015296,0.98482,0.01518
7,9,9,how do bi-encoders work for sentence embeddings,Non-Linguistic Supervision for Contrastive Lea...,No,-0.013869,0.986227,0.013773
8,12,12,how do bi-encoders work for sentence embeddings,Character-based Neural Networks for Sentence P...,No,-0.012866,0.987216,0.012784
9,7,7,how do bi-encoders work for sentence embeddings,Clustering and Network Analysis for the Embedd...,No,-0.012663,0.987417,0.012583


In [14]:
# 检查我们重新排序后的新顶级文档reranked_df["document"][0]

'In Search for Linear Relations in Sentence Embedding Spaces: We present an introductory investigation into continuous-space vector\nrepresentations of sentences. We acquire pairs of very similar sentences\ndiffering only by a small alterations (such as change of a noun, adding an\nadjective, noun or punctuation) from datasets for natural language inference\nusing a simple pattern method. We look into how such a small change within the\nsentence text affects its representation in the continuous space and how such\nalterations are reflected by some of the popular sentence embedding models. We\nfound that vector differences of some embeddings actually reflect small changes\nwithin a sentence.'

## 结论我们展示了如何创建一个定制的交叉编码器来重新排列学术论文。这种方法在领域特定细微差别可以用来选择最相关语料库给用户，并且在进行了一些预过滤以限制交叉编码器需要处理的数据量时效果最佳。我们看到了一些典型的使用情况，包括：- 返回最相关的100份股票报告列表，然后根据特定客户投资组合的详细背景重新排序为前5或前10- 在经典基于规则的搜索之后运行，获取最相关的100或1000个结果，然后根据特定用户的背景进行修剪### 推进这一步采用我们在这里展示的少样本方法，在领域足够普遍以至于少量示例就能涵盖大多数重新排序情况时效果很好。然而，随着文档之间的差异变得更加具体，您可能需要考虑使用```Fine-tuning```端点，以创建一个更为复杂、具有更广泛示例的交叉编码器。使用```text-davinci-003```会对延迟产生影响，即使我们上面的几个示例每个也需要几秒钟 - 再次强调，如果您能从```ada```或```babbage```的微调模型中获得不错的结果，那么```Fine-tuning```端点可能会对您有所帮助。我们使用了OpenAI的```Completions```端点来构建我们的交叉编码器，但这个领域也受到开源社区的良好服务。[这里](https://huggingface.co/jeffwan/mmarco-mMiniLMv2-L12-H384-v1)是HuggingFace的一个示例。希望您发现这对调整您的搜索用例有所帮助，并期待看到您构建的内容。