<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/finetuning/cross_encoder_finetuning/cross_encoder_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在 Colab 中打开"/></a>


# 如何使用LLamaIndex对交叉编码器进行微调


如果您在Colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。


In [None]:
%pip install llama-index-finetuning-cross-encoders
%pip install llama-index-llms-openai

In [None]:
!pip install llama-index

In [None]:
# 下载依赖库
!pip install datasets --quiet
!pip install sentence-transformers --quiet
!pip install openai --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m43.9 MB/s[0m

## 过程

- 使用Datasets库从HuggingFace Hub下载QASPER数据集（https://huggingface.co/datasets/allenai/qasper）

- 从数据集的训练集和测试集中分别提取800个和80个样本

- 使用从训练数据中收集的800个样本，这些样本在研究论文上有相应的问题，以生成CrossEncoder微调所需格式的数据集。目前我们使用的格式是，微调数据的单个样本包括两个句子（问题和上下文）和一个分数，分数为0或1，其中1表示问题和上下文相关，0表示它们不相关。

- 使用测试集的100个样本提取两种类型的评估数据集
  * Rag评估数据集：一个数据集包含样本，其中单个样本包括研究论文内容、研究论文上的问题列表、研究论文上问题列表的答案。在形成此数据集时，我们仅保留具有长答案/自由形式答案的问题，以便与RAG生成的答案进行更好的比较。

  * 重新排序评估数据集：另一个数据集包含样本，其中单个样本包括研究论文内容、研究论文上的问题列表、与每个问题相关的研究论文内容列表

- 我们使用在llamaindex中编写的辅助工具微调交叉编码器，并使用huggingface cli tokens login将其推送到HuggingFace Hub，可以在此处找到：- https://huggingface.co/settings/tokens

- 我们使用两种指标和三种情况对两个数据集进行评估
     1. 仅使用OpenAI嵌入，没有任何重新排序器
     2. 将OpenAI嵌入与cross-encoder/ms-marco-MiniLM-L-12-v2作为重新排序器结合使用
     3. 将OpenAI嵌入与我们微调的交叉编码器模型作为重新排序器结合使用

* 每个评估数据集的评估标准
  - 命中指标：用于评估重新排序评估数据集，我们简单地使用LLamaIndex的检索器+后处理功能，以查看在不同情况下相关上下文被检索的次数，并将其称为命中指标。

  - 两两比较评估器：我们使用LLamaIndex提供的两两比较评估器（https://github.com/run-llama/llama_index/blob/main/llama_index/evaluation/pairwise.py）来比较每种情况下创建的查询引擎的响应与提供的参考自由形式答案。


## 加载数据集


In [None]:

from datasets import load_dataset
import random

# 从HuggingFace下载QASPER数据集 https://huggingface.co/datasets/allenai/qasper
dataset = load_dataset("allenai/qasper")

# 将数据集分割为训练、验证和测试集
train_dataset = dataset["train"]
validation_dataset = dataset["validation"]
test_dataset = dataset["test"]

random.seed(42)  # 设置随机种子以便重现结果

# 从训练集中随机抽取800行数据
train_sampled_indices = random.sample(range(len(train_dataset)), 800)
train_samples = [train_dataset[i] for i in train_sampled_indices]

# 从测试集中随机抽取100行数据
test_sampled_indices = random.sample(range(len(test_dataset)), 80)
test_samples = [test_dataset[i] for i in test_sampled_indices]

# 现在我们有800篇研究论文用于训练，以及80篇研究论文用于评估。

## QASPER数据集
* 每一行都有以下6列
    - id：研究论文的唯一标识符

    - title：研究论文的标题

    - abstract：研究论文的摘要

    - full_text：研究论文的全文

    - qas：与每篇研究论文相关的问题和答案

    - figures_and_tables：每篇研究论文的图表


In [None]:
# 从QASPER的训练样本中获取完整的论文数据和论文上的问题，以生成用于交叉编码器微调的训练数据集
from typing import List


# 从数据集中获取研究论文的全文的实用函数
def get_full_text(sample: dict) -> str:
    """
    :param dict sample: QASPER中的行样本
    """
    title = sample["title"]
    abstract = sample["abstract"]
    sections_list = sample["full_text"]["section_name"]
    paragraph_list = sample["full_text"]["paragraphs"]
    combined_sections_with_paras = ""
    if len(sections_list) == len(paragraph_list):
        combined_sections_with_paras += title + "\t"
        combined_sections_with_paras += abstract + "\t"
        for index in range(0, len(sections_list)):
            combined_sections_with_paras += str(sections_list[index]) + "\t"
            combined_sections_with_paras += "".join(paragraph_list[index])
        return combined_sections_with_paras

    else:
        print("不同数量的章节和段落列表")


# 从数据集中提取问题列表的实用函数
def get_questions(sample: dict) -> List[str]:
    """
    :param dict sample: QASPER中的行样本
    """
    questions_list = sample["qas"]["question"]
    return questions_list


doc_qa_dict_list = []

for train_sample in train_samples:
    full_text = get_full_text(train_sample)
    questions_list = get_questions(train_sample)
    local_dict = {"paper": full_text, "questions": questions_list}
    doc_qa_dict_list.append(local_dict)

In [None]:
len(doc_qa_dict_list)

800

In [None]:
# 将训练数据保存为csv格式
import pandas as pd

df_train = pd.DataFrame(doc_qa_dict_list)
df_train.to_csv("train.csv")

### 生成RAG评估测试数据


In [None]:
# 获取评估数据的论文、问题和答案
"""
数据集中的答案字段遵循以下格式:-
无法回答的答案将"unanswerable"设置为true。

其余的答案中，只有以下字段中的一个是非空的。

"extractive_spans"是论文中作为答案的片段。
"free_form_answer"是书面答案。
"yes_no"为true，如果答案是Yes，为false，如果答案是No。

我们只接受自由形式的答案，对于其他类型的答案，我们将它们的值设置为'Unacceptable'，
以更好地评估使用成对比较评估器的查询引擎的性能，因为它使用偏向于更喜欢长答案的GPT-4。
https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1

因此，在'yes_no'答案的情况下，它可能更偏向于查询引擎的答案而不是参考答案。
同样，在提取的片段的情况下，它可能更偏向于参考答案而不是查询引擎生成的答案。

"""


eval_doc_qa_answer_list = []


# 从数据集中提取答案的实用函数
def get_answers(sample: dict) -> List[str]:
    """
    :param dict sample: QASPER训练集中的行样本
    """
    final_answers_list = []
    answers = sample["qas"]["answers"]
    for answer in answers:
        local_answer = ""
        types_of_answers = answer["answer"][0]
        if types_of_answers["unanswerable"] == False:
            if types_of_answers["free_form_answer"] != "":
                local_answer = types_of_answers["free_form_answer"]
            else:
                local_answer = "Unacceptable"
        else:
            local_answer = "Unacceptable"

        final_answers_list.append(local_answer)

    return final_answers_list


for test_sample in test_samples:
    full_text = get_full_text(test_sample)
    questions_list = get_questions(test_sample)
    answers_list = get_answers(test_sample)
    local_dict = {
        "paper": full_text,
        "questions": questions_list,
        "answers": answers_list,
    }
    eval_doc_qa_answer_list.append(local_dict)

In [None]:
len(eval_doc_qa_answer_list)

80


In [None]:
# 将评估数据保存为csv
import pandas as pd

df_test = pd.DataFrame(eval_doc_qa_answer_list)
df_test.to_csv("test.csv")

# Rag评估测试数据可以在下面的dropbox链接中找到
# https://www.dropbox.com/scl/fi/3lmzn6714oy358mq0vawm/test.csv?rlkey=yz16080te4van7fvnksi9kaed&dl=0

### 生成微调数据集


In [None]:
# 下载最新版本的llama-index
!pip install llama-index --quiet

In [None]:
# 从QASPER收集的初始训练数据中生成所需格式的相应训练数据集
import os
from llama_index.core import SimpleDirectoryReader
import openai
from llama_index.finetuning.cross_encoders.dataset_gen import (
    generate_ce_fine_tuning_dataset,
    generate_synthetic_queries_over_documents,
)

from llama_index.finetuning.cross_encoders import CrossEncoderFinetuneEngine

os.environ["OPENAI_API_KEY"] = "sk-"
openai.api_key = os.environ["OPENAI_API_KEY"]

In [None]:
from llama_index.core import Document

final_finetuning_data_list = []
for paper in doc_qa_dict_list:
    questions_list = paper["questions"]
    documents = [Document(text=paper["paper"])]
    local_finetuning_dataset = generate_ce_fine_tuning_dataset(
        documents=documents,
        questions_list=questions_list,
        max_chunk_length=256,
        top_k=5,
    )
    final_finetuning_data_list.extend(local_finetuning_dataset)

In [None]:
# 最终微调数据集中的样本总数
len(final_finetuning_data_list)

11674

In [None]:
# 保存最终的微调数据集
import pandas as pd

df_finetuning_dataset = pd.DataFrame(final_finetuning_data_list)
df_finetuning_dataset.to_csv("fine_tuning.csv")

# 可以在下面的dropbox链接中找到微调数据集:-
# https://www.dropbox.com/scl/fi/zu6vtisp1j3wg2hbje5xv/fine_tuning.csv?rlkey=0jr6fud8sqk342agfjbzvwr9x&dl=0

In [None]:
# 加载微调数据集

微调数据集 = 最终微调数据列表

In [None]:
finetuning_dataset[0]

CrossEncoderFinetuningDatasetSample(query='Do they repot results only on English data?', context='addition to precision, recall, and F1 scores for both tasks, we show the average of the F1 scores across both tasks. On the ADE dataset, we achieve SOTA results for both the NER and RE tasks. On the CoNLL04 dataset, we achieve SOTA results on the NER task, while our performance on the RE task is competitive with other recent models. On both datasets, we achieve SOTA results when considering the average F1 score across both tasks. The largest gain relative to the previous SOTA performance is on the RE task of the ADE dataset, where we see an absolute improvement of 4.5 on the macro-average F1 score.While the model of Eberts and Ulges eberts2019span outperforms our proposed architecture on the CoNLL04 RE task, their results come at the cost of greater model complexity. As mentioned above, Eberts and Ulges fine-tune the BERTBASE model, which has 110 million trainable parameters. In contrast, 

### 生成重新排名评估测试数据


In [None]:
# 下载 RAG 评估测试数据
!wget -O test.csv https://www.dropbox.com/scl/fi/3lmzn6714oy358mq0vawm/test.csv?rlkey=yz16080te4van7fvnksi9kaed&dl=0

In [None]:
# 从评估数据生成重新排序评估数据集
import pandas as pd
import ast  # 用于安全地将字符串评估为列表

# 加载评估数据
df_test = pd.read_csv("/content/test.csv", index_col=0)

df_test["questions"] = df_test["questions"].apply(ast.literal_eval)
df_test["answers"] = df_test["answers"].apply(ast.literal_eval)
print(f"测试样本中的论文数量：- {len(df_test)}")

Number of papers in the test sample:- 80


In [None]:
from llama_index.core import Document

final_eval_data_list = []
for index, row in df_test.iterrows():
    documents = [Document(text=row["paper"])]
    query_list = row["questions"]
    local_eval_dataset = generate_ce_fine_tuning_dataset(
        documents=documents,
        questions_list=query_list,
        max_chunk_length=256,
        top_k=5,
    )
    relevant_query_list = []
    relevant_context_list = []

    for item in local_eval_dataset:
        if item.score == 1:
            relevant_query_list.append(item.query)
            relevant_context_list.append(item.context)

    if len(relevant_query_list) > 0:
        final_eval_data_list.append(
            {
                "paper": row["paper"],
                "questions": relevant_query_list,
                "context": relevant_context_list,
            }
        )

In [None]:
# 重新排序评估数据集的长度
len(final_eval_data_list)

38

In [None]:
# 保存重新排序评估数据集
import pandas as pd

df_finetuning_dataset = pd.DataFrame(final_eval_data_list)
df_finetuning_dataset.to_csv("reranking_test.csv")

# 可以在下面的dropbox链接中找到重新排序数据集
# https://www.dropbox.com/scl/fi/mruo5rm46k1acm1xnecev/reranking_test.csv?rlkey=hkniwowq0xrc3m0ywjhb2gf26&dl=0

## 微调交叉编码器


In [None]:
!pip install huggingface_hub --quiet

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from sentence_transformers import SentenceTransformer

# 初始化交叉编码器微调引擎
finetuning_engine = CrossEncoderFinetuneEngine(
    dataset=finetuning_dataset, epochs=2, batch_size=8
)

# 对交叉编码器模型进行微调
finetuning_engine.finetune()

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1460 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1460 [00:00<?, ?it/s]

In [None]:
# 将模型推送到HuggingFace Hub
finetuning_engine.push_to_hub(
    repo_id="bpHigh/Cross-Encoder-LLamaIndex-Demo-v2"
)

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

## 重新排名评估


In [None]:
!pip install nest-asyncio --quiet

In [None]:
# 将其附加到相同的事件循环
import nest_asyncio

nest_asyncio.apply()

In [None]:
# 下载重新排名测试数据
!wget -O reranking_test.csv https://www.dropbox.com/scl/fi/mruo5rm46k1acm1xnecev/reranking_test.csv?rlkey=hkniwowq0xrc3m0ywjhb2gf26&dl=0

--2023-10-12 04:47:18--  https://www.dropbox.com/scl/fi/mruo5rm46k1acm1xnecev/reranking_test.csv?rlkey=hkniwowq0xrc3m0ywjhb2gf26
Resolving www.dropbox.com (www.dropbox.com)... 162.125.85.18, 2620:100:6035:18::a27d:5512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.85.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc414efe80c7598407c86166866d.dl.dropboxusercontent.com/cd/0/inline/CFcxAwrNZkpcZLmEipK-DxnJF6BKMu8rKmoRp-FUoqRF83K1t0kG0OzBliY-8E7EmbRqkkRZENO4ayEUPgul8lzY7iyARc7kauQ4iHdGps9_Y4jHyuLstzxbVT1TDQyhotVUYWZ9uHNmDHI9UFWAKBVm/file# [following]
--2023-10-12 04:47:18--  https://uc414efe80c7598407c86166866d.dl.dropboxusercontent.com/cd/0/inline/CFcxAwrNZkpcZLmEipK-DxnJF6BKMu8rKmoRp-FUoqRF83K1t0kG0OzBliY-8E7EmbRqkkRZENO4ayEUPgul8lzY7iyARc7kauQ4iHdGps9_Y4jHyuLstzxbVT1TDQyhotVUYWZ9uHNmDHI9UFWAKBVm/file
Resolving uc414efe80c7598407c86166866d.dl.dropboxusercontent.com (uc414efe80c7598407c86166866d.dl.dropboxusercontent.com)... 162.125.80.

In [None]:
# 加载重新排序数据集
import pandas as pd
import ast

df_reranking = pd.read_csv("/content/reranking_test.csv", index_col=0)
df_reranking["questions"] = df_reranking["questions"].apply(ast.literal_eval)
df_reranking["context"] = df_reranking["context"].apply(ast.literal_eval)
print(f"重新排序评估数据集中的论文数量：- {len(df_reranking)}")

Number of papers in the reranking eval dataset:- 38


In [None]:
df_reranking.head(1)

Unnamed: 0,paper,questions,context
0,Identifying Condition-Action Statements in Med...,[What supervised machine learning models do th...,[Identifying Condition-Action Statements in Me...


In [None]:
# 评估是通过计算每个（问题，上下文）对的命中来进行的，
# 我们使用问题检索前k个文档，
# 如果结果包含上下文，则算作命中
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Response
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.llms.openai import OpenAI
from llama_index.core import Document
from llama_index.core import Settings

import os
import openai
import pandas as pd

os.environ["OPENAI_API_KEY"] = "sk-"
openai.api_key = os.environ["OPENAI_API_KEY"]

Settings.chunk_size = 256

rerank_base = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-12-v2", top_n=3
)

rerank_finetuned = SentenceTransformerRerank(
    model="bpHigh/Cross-Encoder-LLamaIndex-Demo-v2", top_n=3
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/854 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
without_reranker_hits = 0
base_reranker_hits = 0
finetuned_reranker_hits = 0
total_number_of_context = 0
for index, row in df_reranking.iterrows():
    documents = [Document(text=row["paper"])]
    query_list = row["questions"]
    context_list = row["context"]

    assert len(query_list) == len(context_list)
    vector_index = VectorStoreIndex.from_documents(documents)

    retriever_without_reranker = vector_index.as_query_engine(
        similarity_top_k=3, response_mode="no_text"
    )
    retriever_with_base_reranker = vector_index.as_query_engine(
        similarity_top_k=8,
        response_mode="no_text",
        node_postprocessors=[rerank_base],
    )
    retriever_with_finetuned_reranker = vector_index.as_query_engine(
        similarity_top_k=8,
        response_mode="no_text",
        node_postprocessors=[rerank_finetuned],
    )

    for index in range(0, len(query_list)):
        query = query_list[index]
        context = context_list[index]
        total_number_of_context += 1

        response_without_reranker = retriever_without_reranker.query(query)
        without_reranker_nodes = response_without_reranker.source_nodes

        for node in without_reranker_nodes:
            if context in node.node.text or node.node.text in context:
                without_reranker_hits += 1

        response_with_base_reranker = retriever_with_base_reranker.query(query)
        with_base_reranker_nodes = response_with_base_reranker.source_nodes

        for node in with_base_reranker_nodes:
            if context in node.node.text or node.node.text in context:
                base_reranker_hits += 1

        response_with_finetuned_reranker = (
            retriever_with_finetuned_reranker.query(query)
        )
        with_finetuned_reranker_nodes = (
            response_with_finetuned_reranker.source_nodes
        )

        for node in with_finetuned_reranker_nodes:
            if context in node.node.text or node.node.text in context:
                finetuned_reranker_hits += 1

        assert (
            len(with_finetuned_reranker_nodes)
            == len(with_base_reranker_nodes)
            == len(without_reranker_nodes)
            == 3
        )

### 结果

如下所示，与其他选项相比，我们使用finetuned_cross_encoder获得了更多的点击次数。


In [None]:
without_reranker_scores = [without_reranker_hits]
base_reranker_scores = [base_reranker_hits]
finetuned_reranker_scores = [finetuned_reranker_hits]
reranker_eval_dict = {
    "Metric": "Hits",
    "OpenAI_Embeddings": without_reranker_scores,
    "Base_cross_encoder": base_reranker_scores,
    "Finetuned_cross_encoder": finetuned_reranker_hits,
    "Total Relevant Context": total_number_of_context,
}
df_reranker_eval_results = pd.DataFrame(reranker_eval_dict)
display(df_reranker_eval_results)

Unnamed: 0,Metric,OpenAI_Embeddings,Base_cross_encoder,Finetuned_cross_encoder,Total Relevant Context
0,Hits,30,34,37,85


## RAG 评估


In [None]:
# 下载 RAG 评估测试数据
!wget -O test.csv https://www.dropbox.com/scl/fi/3lmzn6714oy358mq0vawm/test.csv?rlkey=yz16080te4van7fvnksi9kaed&dl=0

--2023-10-12 04:47:36--  https://www.dropbox.com/scl/fi/3lmzn6714oy358mq0vawm/test.csv?rlkey=yz16080te4van7fvnksi9kaed
Resolving www.dropbox.com (www.dropbox.com)... 162.125.85.18, 2620:100:6035:18::a27d:5512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.85.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb6087b1b853dad24e8201987fc.dl.dropboxusercontent.com/cd/0/inline/CFfI9UezsVwFpN4CHgYrSFveuNE01DfczDaeFGZO-Ud5VdDRff1LNG7hEhkBZwVljuRde-EZU336ASpnZs32qVePvpQEFnKB2SeplFpMt50G0m5IZepyV6pYPbNAhm0muYE_rjhlolHxRUQP_iaJBX9z/file# [following]
--2023-10-12 04:47:38--  https://ucb6087b1b853dad24e8201987fc.dl.dropboxusercontent.com/cd/0/inline/CFfI9UezsVwFpN4CHgYrSFveuNE01DfczDaeFGZO-Ud5VdDRff1LNG7hEhkBZwVljuRde-EZU336ASpnZs32qVePvpQEFnKB2SeplFpMt50G0m5IZepyV6pYPbNAhm0muYE_rjhlolHxRUQP_iaJBX9z/file
Resolving ucb6087b1b853dad24e8201987fc.dl.dropboxusercontent.com (ucb6087b1b853dad24e8201987fc.dl.dropboxusercontent.com)... 162.125.80.15, 2620:1

In [None]:
import pandas as pd
import ast  # 用于安全地将字符串作为列表进行评估

# 加载评估数据
df_test = pd.read_csv("/content/test.csv", index_col=0)

df_test["questions"] = df_test["questions"].apply(ast.literal_eval)
df_test["answers"] = df_test["answers"].apply(ast.literal_eval)
print(f"测试样本中的论文数量：- {len(df_test)}")

Number of papers in the test sample:- 80


In [None]:
# 查看一个样本的评估数据，其中包含一个研究论文问题和相应的参考答案
df_test.head(1)

Unnamed: 0,paper,questions,answers
0,Identifying Condition-Action Statements in Med...,[What supervised machine learning models do th...,"[Unacceptable, Unacceptable, 1470 sentences, U..."


### 基准评估

仅使用OpenAI嵌入进行检索，没有任何重新排序器。


#### Eval 方法：
1. 遍历测试数据集的每一行：
    1. 对于当前正在迭代的行，使用数据集中的 paper 列中提供的论文文档创建一个向量索引
    2. 使用 top_k 值为 3，在没有任何重新排序器的情况下查询向量索引
    3. 使用成对比较评估器比较生成的答案与相应样本的参考答案，并将分数添加到列表中
2. 重复步骤 1，直到遍历完所有行
3. 计算所有样本/行的平均分数


In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Response
from llama_index.llms.openai import OpenAI
from llama_index.core import Document
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
)

import os
import openai
import pandas as pd

os.environ["OPENAI_API_KEY"] = "sk-"
openai.api_key = os.environ["OPENAI_API_KEY"]

gpt4 = OpenAI(temperature=0, model="gpt-4")

evaluator_gpt4_pairwise = PairwiseComparisonEvaluator(llm=gpt4)

In [None]:
pairwise_scores_list = []

no_reranker_dict_list = []


# 遍历数据集的行
for index, row in df_test.iterrows():
    documents = [Document(text=row["paper"])]
    query_list = row["questions"]
    reference_answers_list = row["answers"]
    number_of_accepted_queries = 0
    # 为当前迭代的行创建向量索引
    vector_index = VectorStoreIndex.from_documents(documents)

    # 使用 top_k 值为 3 的向量索引进行查询，不使用任何重新排序器
    query_engine = vector_index.as_query_engine(similarity_top_k=3)

    assert len(query_list) == len(reference_answers_list)
    pairwise_local_score = 0

    for index in range(0, len(query_list)):
        query = query_list[index]
        reference = reference_answers_list[index]

        if reference != "Unacceptable":
            number_of_accepted_queries += 1

            response = str(query_engine.query(query))

            no_reranker_dict = {
                "query": query,
                "response": response,
                "reference": reference,
            }
            no_reranker_dict_list.append(no_reranker_dict)

            # 使用两两比较评估器比较生成的答案与相应样本的参考答案，并将分数添加到列表中

            pairwise_eval_result = await evaluator_gpt4_pairwise.aevaluate(
                query, response=response, reference=reference
            )

            pairwise_score = pairwise_eval_result.score

            pairwise_local_score += pairwise_score

        else:
            pass

    if number_of_accepted_queries > 0:
        avg_pairwise_local_score = (
            pairwise_local_score / number_of_accepted_queries
        )
        pairwise_scores_list.append(avg_pairwise_local_score)


overal_pairwise_average_score = sum(pairwise_scores_list) / len(
    pairwise_scores_list
)

df_responses = pd.DataFrame(no_reranker_dict_list)
df_responses.to_csv("No_Reranker_Responses.csv")

In [None]:
results_dict = {
    "name": ["Without Reranker"],
    "pairwise score": [overal_pairwise_average_score],
}
results_df = pd.DataFrame(results_dict)
display(results_df)

Unnamed: 0,name,pairwise score
0,Without Reranker,0.553788


### 使用基本重新排序器进行评估

OpenAI Embeddings + `cross-encoder/ms-marco-MiniLM-L-12-v2` 作为重新排序器


#### Eval 方法:-
1. 遍历测试数据集的每一行:-
    1. 对于当前正在迭代的行，使用数据集中的paper列中提供的论文文档创建一个向量索引
    2. 使用top_k值为5查询向量索引。
    3. 使用cross-encoder/ms-marco-MiniLM-L-12-v2作为重新排序器，作为NodePostprocessor获取8个节点中的前3个节点的top_k值
    4. 使用Pairwise Comparison Evaluator比较生成的答案与相应样本的参考答案，并将分数添加到列表中
5. 重复步骤1，直到遍历完所有行
6. 计算所有样本/行的平均分数


In [None]:
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Response
from llama_index.llms.openai import OpenAI
from llama_index.core import Document
from llama_index.core.evaluation import PairwiseComparisonEvaluator
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-"
openai.api_key = os.environ["OPENAI_API_KEY"]

rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-12-v2", top_n=3
)

gpt4 = OpenAI(temperature=0, model="gpt-4")

evaluator_gpt4_pairwise = PairwiseComparisonEvaluator(llm=gpt4)

Downloading (…)lve/main/config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
pairwise_scores_list = []

base_reranker_dict_list = []


# 遍历数据集的行
for index, row in df_test.iterrows():
    documents = [Document(text=row["paper"])]
    query_list = row["questions"]
    reference_answers_list = row["answers"]

    number_of_accepted_queries = 0
    # 为当前迭代的行创建向量索引
    vector_index = VectorStoreIndex.from_documents(documents)

    # 使用reranker作为cross-encoder/ms-marco-MiniLM-L-12-v2，查询具有top_k值为8的节点的向量索引
    query_engine = vector_index.as_query_engine(
        similarity_top_k=8, node_postprocessors=[rerank]
    )

    assert len(query_list) == len(reference_answers_list)
    pairwise_local_score = 0

    for index in range(0, len(query_list)):
        query = query_list[index]
        reference = reference_answers_list[index]

        if reference != "Unacceptable":
            number_of_accepted_queries += 1

            response = str(query_engine.query(query))

            base_reranker_dict = {
                "query": query,
                "response": response,
                "reference": reference,
            }
            base_reranker_dict_list.append(base_reranker_dict)

            # 使用Pairwise Comparison Evaluator比较生成的答案与相应样本的参考答案，并将分数添加到列表中

            pairwise_eval_result = await evaluator_gpt4_pairwise.aevaluate(
                query=query, response=response, reference=reference
            )

            pairwise_score = pairwise_eval_result.score

            pairwise_local_score += pairwise_score

        else:
            pass

    if number_of_accepted_queries > 0:
        avg_pairwise_local_score = (
            pairwise_local_score / number_of_accepted_queries
        )
        pairwise_scores_list.append(avg_pairwise_local_score)

overal_pairwise_average_score = sum(pairwise_scores_list) / len(
    pairwise_scores_list
)

df_responses = pd.DataFrame(base_reranker_dict_list)
df_responses.to_csv("Base_Reranker_Responses.csv")


In [None]:
results_dict = {
    "name": ["With base cross-encoder/ms-marco-MiniLM-L-12-v2 as Reranker"],
    "pairwise score": [overal_pairwise_average_score],
}
results_df = pd.DataFrame(results_dict)
display(results_df)

Unnamed: 0,name,pairwise score
0,With base cross-encoder/ms-marco-MiniLM-L-12-v...,0.556818


### 使用经过微调的重新排序器进行评估

OpenAI Embeddings + `bpHigh/Cross-Encoder-LLamaIndex-Demo-v2` 作为重新排序器


#### Eval 方法:-
1. 遍历测试数据集的每一行:-
    1. 对于当前正在迭代的行，使用数据集中的 paper 列中提供的论文文档创建一个向量索引
    2. 使用 top_k 值为 5 查询向量索引。
    3. 使用经过微调的 cross-encoder/ms-marco-MiniLM-L-12-v2，保存为 bpHigh/Cross-Encoder-LLamaIndex-Demo 作为一个重新排序器，作为 NodePostprocessor 从 8 个节点中获取 top 3 个节点的 top_k 值
    4. 使用成对比较评估器将生成的答案与各个样本的参考答案进行比较，并将分数添加到列表中
5. 重复步骤 1，直到遍历完所有行
6. 计算所有样本/行的平均分数


In [None]:
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Response
from llama_index.llms.openai import OpenAI
from llama_index.core import Document
from llama_index.core.evaluation import PairwiseComparisonEvaluator
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-"
openai.api_key = os.environ["OPENAI_API_KEY"]

rerank = SentenceTransformerRerank(
    model="bpHigh/Cross-Encoder-LLamaIndex-Demo-v2", top_n=3
)


gpt4 = OpenAI(temperature=0, model="gpt-4")

evaluator_gpt4_pairwise = PairwiseComparisonEvaluator(llm=gpt4)

In [None]:
pairwise_scores_list = []

finetuned_reranker_dict_list = []

# 遍历数据集的行
for index, row in df_test.iterrows():
    documents = [Document(text=row["paper"])]
    query_list = row["questions"]
    reference_answers_list = row["answers"]

    number_of_accepted_queries = 0
    # 为当前迭代的行创建向量索引
    vector_index = VectorStoreIndex.from_documents(documents)

    # 使用reranker作为cross-encoder/ms-marco-MiniLM-L-12-v2，查询具有top_k值为8的节点的向量索引
    query_engine = vector_index.as_query_engine(
        similarity_top_k=8, node_postprocessors=[rerank]
    )

    assert len(query_list) == len(reference_answers_list)
    pairwise_local_score = 0

    for index in range(0, len(query_list)):
        query = query_list[index]
        reference = reference_answers_list[index]

        if reference != "Unacceptable":
            number_of_accepted_queries += 1

            response = str(query_engine.query(query))

            finetuned_reranker_dict = {
                "query": query,
                "response": response,
                "reference": reference,
            }
            finetuned_reranker_dict_list.append(finetuned_reranker_dict)

            # 使用Pairwise Comparison Evaluator比较生成的答案与相应样本的参考答案，并将得分添加到列表中
            pairwise_eval_result = await evaluator_gpt4_pairwise.aevaluate(
                query, response=response, reference=reference
            )

            pairwise_score = pairwise_eval_result.score

            pairwise_local_score += pairwise_score

        else:
            pass

    if number_of_accepted_queries > 0:
        avg_pairwise_local_score = (
            pairwise_local_score / number_of_accepted_queries
        )
        pairwise_scores_list.append(avg_pairwise_local_score)

overal_pairwise_average_score = sum(pairwise_scores_list) / len(
    pairwise_scores_list
)
df_responses = pd.DataFrame(finetuned_reranker_dict_list)
df_responses.to_csv("Finetuned_Reranker_Responses.csv")

In [None]:
results_dict = {
    "name": ["With fine-tuned cross-encoder/ms-marco-MiniLM-L-12-v2"],
    "pairwise score": [overal_pairwise_average_score],
}
results_df = pd.DataFrame(results_dict)
display(results_df)

Unnamed: 0,name,pairwise score
0,With fine-tuned cross-encoder/ms-marco-MiniLM-...,0.6


### 结果

正如我们所看到的，我们使用微调的交叉编码器获得了最高的成对分数。

虽然我想指出，基于命中的重新排序评估是一种比成对比较评估更健壮的指标，因为我已经看到分数存在不一致性，并且在使用GPT-4进行评估时也存在许多固有偏见。
