In this notebook, we will generate a dataset of ground-truth documents to evaluate retrievers using pairs of ground-truth question-answers. Specifically, we will compare the following 3 retrieval techniques:
- a retriever based on embedding vector similarity using a sentence transformer model from HuggingFace: [msmarco model](https://huggingface.co/sentence-transformers/msmarco-distilroberta-base-v2)
- similar to the above but using a different model: [mnet model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2).
- a BM25 keyword-based [retriever](https://docs.haystack.deepset.ai/docs/inmemorybm25retriever).

In order to generate the dataset of ground-truth documents, these are the steps:
- we first load the dataset (we use the ARAGOG dataset from this [repository ](https://github.com/deepset-ai/haystack-evaluation),
- we then index the files from the dataset using the 2 embedding models above,
- we define 3 retriever Haystack pipelines for the 3 retrieval techniques above,
- we run the pipelines using _both_ the query and the ground-truth answer to generate a list of candidate documents,
- we then use a LLM to evaluate each candidate document with respect to the question and answer and assign it a label for full match, partial match or no match,
- we do some processing of the data and save it.

In [1]:
from getpass import getpass
from IPython.display import display
import json
import os
import pandas as pd
from pathlib import Path
import sys
from typing import Any, Dict, List, Optional

from haystack import component, Pipeline, Document
from haystack.components.builders import PromptBuilder, AnswerBuilder
from haystack.components.converters import PyPDFToDocument
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.evaluators import SASEvaluator
from haystack.components.evaluators.llm_evaluator import LLMEvaluator
from haystack.components.generators import OpenAIGenerator
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.retrievers import InMemoryEmbeddingRetriever, InMemoryBM25Retriever
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.utils import Secret

# Configs

We set up some configuration parameters here, including the path to the evaluation repository (https://github.com/deepset-ai/haystack-evaluation) with the dataset that we will use.

In [2]:
pd.set_option('display.max_colwidth', 0)

In [3]:
os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY: ")

OPENAI_API_KEY:  ········


In [4]:
EVALUATION_REPO = Path(f"{os.environ['HOME']}/Developer/haystack-evaluation/")
FILEPATHS = EVALUATION_REPO / "datasets/ARAGOG/"
QA_PATH = FILEPATHS / "eval_questions_relevant_doc.json"

In [5]:
# We will keep the number of documents retrieved fixed in this notebook.
TOP_K = 3

# Load the data

In [6]:
question_answers = json.load(open(QA_PATH, "r"))
QUESTIONS = question_answers['questions']
ANSWERS = question_answers['ground_truths']
GROUND_TRUTH_FILENAMES = question_answers['filepaths']

# Retrieve documents using different retrieval techniques

## Define the indexing pipeline

The function below is based off a version from [here](https://github.com/deepset-ai/haystack-evaluation/blob/6db15f828628a1f31cd54d8657345aaa870cf40e/evaluations/evaluation_aragog.py#L26).

This function creates an indexing pipeline which reads the input PDF files, splits them into documents of size 3 sentences with an overlap of 1 sentence, and embeds them with a [sentence transformer](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder).

In [7]:
def embedding_indexing(embedding_model: str):
    full_path = Path(FILEPATHS)
    files_path = full_path / "papers_for_questions"
    document_store = InMemoryDocumentStore()
    pipeline = Pipeline()
    pipeline.add_component("converter", PyPDFToDocument())
    pipeline.add_component("cleaner", DocumentCleaner())
    pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=3, split_overlap=1))  # splitting by word
    pipeline.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP))
    pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(embedding_model))
    pipeline.connect("converter", "cleaner")
    pipeline.connect("cleaner", "splitter")
    pipeline.connect("splitter", "embedder")
    pipeline.connect("embedder", "writer")
    pdf_files = [full_path / "papers_for_questions" / f_name for f_name in os.listdir(files_path)]
    pipeline.run({"converter": {"sources": pdf_files}})

    return document_store

## Define retriever pipelines

The function below takes an embedding model and a document store and defines a retriever pipeline, which will take a user query, embed it and use the query embedding to retrieve relevant documents from the document store.

In [8]:
def get_retriever_embedding_pipeline(embedding_model, doc_store):
    retriever_embedding_pipeline = Pipeline()
    retriever_embedding_pipeline.add_component("query_embedder", SentenceTransformersTextEmbedder(
        model=embedding_model, progress_bar=False
    ))
    
    retriever_embedding_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(doc_store, top_k=TOP_K))
    
    retriever_embedding_pipeline.connect("query_embedder", "retriever.query_embedding")
    return retriever_embedding_pipeline

We now run indexing of documents using two different sentence transformer models from HuggingFace:
- the [msmarco model](https://huggingface.co/sentence-transformers/msmarco-distilroberta-base-v2)
- the [mnet model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2).

We also define the retriever pipelines for the document stores created with the two models above.

In [9]:
doc_store_msmarco = embedding_indexing("sentence-transformers/msmarco-distilroberta-base-v2")
retriever_msmarco_pipeline = get_retriever_embedding_pipeline("sentence-transformers/msmarco-distilroberta-base-v2", doc_store_msmarco)

doc_store_mnet = embedding_indexing("sentence-transformers/all-mpnet-base-v2")
retriever_mnet_pipeline = get_retriever_embedding_pipeline("sentence-transformers/all-mpnet-base-v2", doc_store_mnet)

Batches:   0%|          | 0/360 [00:00<?, ?it/s]

Batches:   0%|          | 0/360 [00:00<?, ?it/s]

We also define a retriever pipeline using a keyword-based BM25 retriever.

In [10]:
retriever_bm25 = InMemoryBM25Retriever(doc_store_msmarco, top_k=TOP_K)

In [11]:
for doc in doc_store_msmarco.filter_documents()[:2]:
    print(doc.content)

Published as a conference paper at ICLR 2021
MEASURING MASSIVE MULTITASK
LANGUAGE UNDERSTANDING
Dan Hendrycks
UC BerkeleyCollin Burns
Columbia UniversitySteven Basart
UChicagoAndy Zou
UC Berkeley
Mantas Mazeika
UIUCDawn Song
UC BerkeleyJacob Steinhardt
UC Berkeley
ABSTRACT
We propose a new test to measure a text model’s multitask accuracy. The test
covers 57 tasks including elementary mathematics, US history, computer science,
law, and more. To attain high accuracy on this test, models must possess extensive
world knowledge and problem solving ability.
 To attain high accuracy on this test, models must possess extensive
world knowledge and problem solving ability. We ﬁnd that while most recent
models have near random-chance accuracy, the very largest GPT-3 model improves
over random chance by almost 20 percentage points on average. However, on every
one of the 57 tasks, the best models still need substantial improvements before
they can reach expert-level accuracy.


In [12]:
for doc in doc_store_mnet.filter_documents()[:2]:
    print(doc.content)

Published as a conference paper at ICLR 2021
MEASURING MASSIVE MULTITASK
LANGUAGE UNDERSTANDING
Dan Hendrycks
UC BerkeleyCollin Burns
Columbia UniversitySteven Basart
UChicagoAndy Zou
UC Berkeley
Mantas Mazeika
UIUCDawn Song
UC BerkeleyJacob Steinhardt
UC Berkeley
ABSTRACT
We propose a new test to measure a text model’s multitask accuracy. The test
covers 57 tasks including elementary mathematics, US history, computer science,
law, and more. To attain high accuracy on this test, models must possess extensive
world knowledge and problem solving ability.
 To attain high accuracy on this test, models must possess extensive
world knowledge and problem solving ability. We ﬁnd that while most recent
models have near random-chance accuracy, the very largest GPT-3 model improves
over random chance by almost 20 percentage points on average. However, on every
one of the 57 tasks, the best models still need substantial improvements before
they can reach expert-level accuracy.


## Get candidate retrieved documents

Here we will add some functions to retrieve documents given a retriever pipeline that works with an embedding model in `get_reponses`, as well as another function to retrieve documents using the BM25 technique in the `get_responses_bm25` function below.

We run the retriever on two queries, one that only contains the original question, and a second query that is a concatenation of the question and ground-truth answer.

In [13]:
def get_filters_dict(filename):
    filters = {'operator': 'AND', 
                'conditions': [{'field': 'meta.file_path',
                               'value': str(FILEPATHS / f'papers_for_questions/{filename}'),
                               'operator': '=='}]}
    return filters


def get_answer_query(question, answer):
    return question + " " + answer

def get_responses(retriever_pipeline):
    responses = []
    for ind, (question, answer, ground_truth_filename) in enumerate(zip(QUESTIONS, ANSWERS, GROUND_TRUTH_FILENAMES)):

        filters = get_filters_dict(ground_truth_filename)
        response_with_answer = retriever_pipeline.run({"query_embedder": {"text": get_answer_query(question, answer)},
                                                             "retriever": {"filters": filters}})
        
        response_without_answer = retriever_pipeline.run({"query_embedder": {"text": question},
                                                             "retriever": {"filters": filters}})

        combined_docs = []
        doc_ids = set([])
        for doc in response_with_answer['retriever']['documents']:
            if not doc.id in doc_ids:
                doc_ids.add(doc.id)
                combined_docs.append(doc)

        for doc in response_without_answer['retriever']['documents']:
            if not doc.id in doc_ids:
                doc_ids.add(doc.id)
                combined_docs.append(doc)
            
        responses.append(combined_docs)
    return responses

def get_responses_bm25(retriever):
    responses = []
    for ind, (question, answer, ground_truth_filename) in enumerate(zip(QUESTIONS, ANSWERS, GROUND_TRUTH_FILENAMES)):

        filters = get_filters_dict(ground_truth_filename)
        response_with_answer = retriever.run(query=get_answer_query(question, answer), filters=filters)
        response_without_answer = retriever.run(query=question, filters=filters)
        
        combined_docs = []
        doc_ids = set([])
        for doc in response_with_answer['documents']:
            if not doc.id in doc_ids:
                doc_ids.add(doc.id)
                combined_docs.append(doc)

        for doc in response_without_answer['documents']:
            if not doc.id in doc_ids:
                doc_ids.add(doc.id)
                combined_docs.append(doc)
            
        responses.append(combined_docs)
        
    return responses

We run the 3 retrievers over all QA pairs.

In [14]:
doc_store_msmarco.filter_documents()[0]

Document(id=769942adbb20e0177327a764835dd625852faee2d62cefd0bafe61714a7fb89e, content: 'Published as a conference paper at ICLR 2021
MEASURING MASSIVE MULTITASK
LANGUAGE UNDERSTANDING
Dan ...', meta: {'file_path': '/Users/maria/Developer/haystack-evaluation/datasets/ARAGOG/papers_for_questions/MMLU_measure.pdf', 'source_id': '102c5853f251e4158b8acc51d566c5e82e9ac8437952bc02d13f0225d7581063', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0, '_split_overlap': [{'doc_id': '965b199619695ee640c80b6e44ec3a2443834f60b1111522bc18d569acc47bcf', 'range': (0, 112)}]}, embedding: vector of size 768)

In [15]:
responses_msmarco = get_responses(retriever_msmarco_pipeline)

In [16]:
len(responses_msmarco)

107

In [17]:
responses_msmarco[0]

[Document(id=ea14f3061ed516b1ffddd6b9c5f0176755797ed1cc46756c19adc7142b923c49, content: '
 • We show that pre-trained representations reduce
 the need for many heavily-engineered task-
 speciﬁ...', meta: {'file_path': '/Users/maria/Developer/haystack-evaluation/datasets/ARAGOG/papers_for_questions/bert.pdf', 'source_id': '416e31eda451ea7ae9495e88fdc271e6ef2c17fe5bf09a89c0bca54efe029bc3', 'page_number': 2, 'split_id': 24, 'split_idx_start': 4824, '_split_overlap': [{'doc_id': 'b48a07c84d9048e9b2ee23d63f2aa445d02dee758a75e4d1941c7e6d38c2ddc7', 'range': (148, 263)}, {'doc_id': '1152da2a6dea5a4a994a0f18e45a3fc30dc9ef9cd41e14a71c6420cf88e14de2', 'range': (0, 58)}]}, score: 155.0593423059841),
 Document(id=55137fcdaaac7cd1c82eb33c64ce609d6f4915a2a037f2f9d72aa015a45dcc8f, content: ', 2018), BERT is designed to pre-
 train deep bidirectional representations from
 unlabeled text by jo...', meta: {'file_path': '/Users/maria/Developer/haystack-evaluation/datasets/ARAGOG/papers_for_questions/bert.

In [18]:
responses_msmarco[0][0].content

'\n• We show that pre-trained representations reduce\nthe need for many heavily-engineered task-\nspeciﬁc architectures. BERT is the ﬁrst ﬁne-\ntuning based representation model that achieves\nstate-of-the-art performance on a large suite\nof sentence-level andtoken-level tasks, outper-\nforming many task-speciﬁc architectures.\n• BERT advances the state of the art for eleven\nNLP tasks.'

In [19]:
responses_mnet = get_responses(retriever_mnet_pipeline)

In [20]:
responses_mnet[0][0].content

'6 84.9\nTable 5: Ablation over the pre-training tasks using the\nBERT BASE architecture. “No NSP” is trained without\nthe next sentence prediction task.'

In [21]:
responses_bm25 = get_responses_bm25(retriever_bm25)

In [22]:
responses_bm25[0][0].content

'\n5.1 Effect of Pre-training Tasks\nWe demonstrate the importance of the deep bidi-\nrectionality of BERT by evaluating two pre-\ntraining objectives using exactly the same pre-\ntraining data, ﬁne-tuning scheme, and hyperpa-\nrameters as BERT BASE :\nNo NSP : A bidirectional model which is trained\nusing the “masked LM” (MLM) but without the\n“next sentence prediction” (NSP) task.\nLTR & No NSP : A left-context-only model which\nis trained using a standard Left-to-Right (LTR)\nLM, rather than an MLM.'

Now we will create a final list of candidate relevant documents by adding all the documents retrieved above and making sure the list does not contain any duplicates.

In [23]:
candidate_docs = []

for ind, (question, answer, ground_truth_filename) in enumerate(zip(QUESTIONS, ANSWERS, GROUND_TRUTH_FILENAMES)):
    
    resp_msmarco = responses_msmarco[ind]
    resp_mnet = responses_mnet[ind]
    resp_bm25 = responses_bm25[ind]

    new_row = {"question": question,
              "answer": answer,
              "filename": ground_truth_filename}

    doc_ids = set([])
    new_docs = []
    
    for doc in resp_msmarco:
        doc_ids.add(doc.id)
        new_docs.append({"content": doc.content, 
                         "score": doc.score, 
                         "method": "embedding msmarco"})

    for doc in resp_mnet:
        if not doc.id in doc_ids:
            doc_ids.add(doc.id)
            new_docs.append({"content": doc.content, 
                             "score": doc.score, 
                             "method": "embedding mnet"})

    for doc in resp_bm25:
        if not doc.id in doc_ids:
            doc_ids.add(doc.id)
            new_docs.append({"content": doc.content, 
                             "score": doc.score, 
                             "method": "BM25"})

    new_row['docs'] = new_docs
    candidate_docs.append(new_row)        

In [24]:
candidate_docs[0]

{'question': 'What are the two main tasks BERT is pre-trained on?',
 'answer': 'Masked LM (MLM) and Next Sentence Prediction (NSP).',
 'filename': 'bert.pdf',
 'docs': [{'content': '\n• We show that pre-trained representations reduce\nthe need for many heavily-engineered task-\nspeciﬁc architectures. BERT is the ﬁrst ﬁne-\ntuning based representation model that achieves\nstate-of-the-art performance on a large suite\nof sentence-level andtoken-level tasks, outper-\nforming many task-speciﬁc architectures.\n• BERT advances the state of the art for eleven\nNLP tasks.',
   'score': 155.0593423059841,
   'method': 'embedding msmarco'},
  {'content': ', 2018), BERT is designed to pre-\ntrain deep bidirectional representations from\nunlabeled text by jointly conditioning on both\nleft and right context in all layers. As a re-\nsult, the pre-trained BERT model can be ﬁne-\ntuned with just one additional output layer\nto create state-of-the-art models for a wide\nrange of tasks, such as questi

In [25]:
candidate_docs[1]

{'question': 'What model sizes are reported for BERT, and what are their specifications?',
 'answer': 'BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).',
 'filename': 'bert.pdf',
 'docs': [{'content': '3\nWe primarily report results on two model sizes:\nBERT BASE (L=12, H=768, A=12, Total Param-\neters=110M) and BERT LARGE (L=24, H=1024,\nA=16, Total Parameters=340M).\nBERT BASE was chosen to have the same model\nsize as OpenAI GPT for comparison purposes.\nCritically, however, the BERT Transformer uses\nbidirectional self-attention, while the GPT Trans-\nformer uses constrained self-attention where every\ntoken can only attend to context to its left.',
   'score': 194.82163561646738,
   'method': 'embedding msmarco'},
  {'content': ', 2018). By contrast, BERT BASE\ncontains 110M parameters and BERT LARGE con-\ntains 340M parameters.\nIt has long been known that increasing the\nmodel size will lead to continual improvements\

In [26]:
candidate_docs[2]

{'question': "How does BERT's architecture facilitate the use of a unified model across diverse NLP tasks?",
 'answer': 'BERT uses a multi-layer bidirectional Transformer encoder architecture, allowing for minimal task-specific architecture modifications in fine-tuning.',
 'filename': 'bert.pdf',
 'docs': [{'content': '\n• We show that pre-trained representations reduce\nthe need for many heavily-engineered task-\nspeciﬁc architectures. BERT is the ﬁrst ﬁne-\ntuning based representation model that achieves\nstate-of-the-art performance on a large suite\nof sentence-level andtoken-level tasks, outper-\nforming many task-speciﬁc architectures.\n• BERT advances the state of the art for eleven\nNLP tasks.',
   'score': 178.45485510132914,
   'method': 'embedding msmarco'},
  {'content': ', 2018), BERT is designed to pre-\ntrain deep bidirectional representations from\nunlabeled text by jointly conditioning on both\nleft and right context in all layers. As a re-\nsult, the pre-trained BERT 

We extract the questions, answers and contexts (content of the documents) from the list of candidate documents.

In [27]:
questions = [row['question'] for row in candidate_docs]
answers = [row['answer'] for row in candidate_docs]
contexts = [row['docs'] for row in candidate_docs]
contexts_contents = [[doc['content'] for doc in docs] for docs in contexts]

# Score the documents with a LLM

In this section, we will go through the list of candidate documents and score them with a LLM. To this end, we create a custom metric `SingleContextRelevanceEvaluator` that inherits from the `LLMEvaluator` Haystack class. This metric will label every document using the prompt below:

In [28]:
PROMPT_TEMPLATE = """
Your task is to judge how relevant the provided context is for answering a question. We will provide both the question and the true answer.

Please return one of the following categories:
- full: if the provided context has all the information in the answer to address the question entirely,
- partial: if the provided context has some information in the answer necessary to address the question,
- no: if the provided context has no information absolutely necessary to answer the question. 

The response should be in json format and one of the following:
{"response": "full"} or {"response": "partial"} or {"response": "no"}
"""

Note that the `LLMEvaluator` class will add the question and answer to the prompt before passing to the LLM.

In [29]:
@component
class SingleContextRelevanceEvaluator(LLMEvaluator):
    """
    Evaluator that checks if a provided context is relevant to the question.
    """

    def __init__(
            self,
            examples: Optional[List[Dict[str, Any]]] = None,
            progress_bar: bool = True,
            api: str = "openai",
            api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
            raise_on_failure: bool = True,
    ):
        """
        Creates an instance of ContextRelevanceEvaluator.
        """
        self.instructions = PROMPT_TEMPLATE
        self.inputs = [("questions", List[str]), ("contexts", List[List[str]]), ("answers", List[str])]
        self.outputs = ["response"]
        self.examples = [{"inputs": {}, "outputs": {}}]
        self.api = api
        self.api_key = api_key

        super(SingleContextRelevanceEvaluator, self).__init__(
            instructions=self.instructions,
            inputs=self.inputs,
            outputs=self.outputs,
            examples=self.examples,
            api=self.api,
            api_key=self.api_key,
            raise_on_failure=raise_on_failure,
            progress_bar=progress_bar,
        )

    @component.output_types(individual_scores=List[int], score=float, results=List[Dict[str, Any]])
    def run(self, questions: List[str],
            contexts: List[List[str]],
            answers: List[str]) -> Dict[str, Any]:
        """
        Run the LLM evaluator on the contexts.
        """
        all_results = []
        for question, answer, retrieved_contexts in zip(questions, answers, contexts):
            question_results = []
            for retrieved_context in retrieved_contexts:
                result = super(SingleContextRelevanceEvaluator, self).run(questions=[question],
                                                                        contexts=[retrieved_context],
                                                                        answers=[answer])
                try:
                    answer = result['results'][0]["response"]
                except:
                    answer = "error"
                question_results.append(answer)
            all_results.append(question_results)

        return {"results": all_results}

In [None]:
evaluator = SingleContextRelevanceEvaluator(raise_on_failure=False)

evaluator.generator = OpenAIGenerator(
    model="gpt-4-turbo",
    generation_kwargs={"response_format": {"type": "json_object"}, "seed": 42},

)

results = evaluator.run(questions=questions, 
                        contexts=contexts_contents, 
                        answers=answers)

Process the list of candidate documents by adding the LLM labels.

In [31]:
candidate_docs_with_LLM_scores = []

for dataset_row, row_context_results in zip(candidate_docs, results['results']):
    new_row = dataset_row.copy()
    contexts_with_scores = []
    for context, context_score in zip(dataset_row['docs'], row_context_results):
        context['LLM_judgment'] = context_score
        contexts_with_scores.append(context)
    new_row['docs'] = contexts_with_scores
    candidate_docs_with_LLM_scores.append(new_row)

In [32]:
candidate_docs_with_LLM_scores[1]

{'question': 'What model sizes are reported for BERT, and what are their specifications?',
 'answer': 'BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).',
 'filename': 'bert.pdf',
 'docs': [{'content': '3\nWe primarily report results on two model sizes:\nBERT BASE (L=12, H=768, A=12, Total Param-\neters=110M) and BERT LARGE (L=24, H=1024,\nA=16, Total Parameters=340M).\nBERT BASE was chosen to have the same model\nsize as OpenAI GPT for comparison purposes.\nCritically, however, the BERT Transformer uses\nbidirectional self-attention, while the GPT Trans-\nformer uses constrained self-attention where every\ntoken can only attend to context to its left.',
   'score': 194.82163561646738,
   'method': 'embedding msmarco',
   'LLM_judgment': 'full'},
  {'content': ', 2018). By contrast, BERT BASE\ncontains 110M parameters and BERT LARGE con-\ntains 340M parameters.\nIt has long been known that increasing the\nmodel size will lead

## Clean & save the dataset

In this section, we remove any QA pair that has no document labelled as relevant by the LLM. We also sort the documents, so that documents that have been scored as "full" match rank higher than the ones that have been given a "partial" match. This is relevant for some of the retrieval metrics later used.

In [33]:
def prioritise_scores(score):
    if score == "full":
        return 0
    if score == "partial":
        return 1
    return 2

In [34]:
final_dataset = []
for row in candidate_docs_with_LLM_scores:
    new_row = row.copy()
    docs = row['docs']
    filtered_docs = [doc for doc in docs if doc['LLM_judgment'] in ["full", "partial"]]
    sorted_docs = sorted(filtered_docs, key=lambda x: prioritise_scores(x['LLM_judgment']))
    if len(sorted_docs):
        new_row['docs'] = sorted_docs
        final_dataset.append(new_row)

In [35]:
final_dataset[1]

{'question': 'What model sizes are reported for BERT, and what are their specifications?',
 'answer': 'BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).',
 'filename': 'bert.pdf',
 'docs': [{'content': '3\nWe primarily report results on two model sizes:\nBERT BASE (L=12, H=768, A=12, Total Param-\neters=110M) and BERT LARGE (L=24, H=1024,\nA=16, Total Parameters=340M).\nBERT BASE was chosen to have the same model\nsize as OpenAI GPT for comparison purposes.\nCritically, however, the BERT Transformer uses\nbidirectional self-attention, while the GPT Trans-\nformer uses constrained self-attention where every\ntoken can only attend to context to its left.',
   'score': 194.82163561646738,
   'method': 'embedding msmarco',
   'LLM_judgment': 'full'},
  {'content': ', 2018). By contrast, BERT BASE\ncontains 110M parameters and BERT LARGE con-\ntains 340M parameters.\nIt has long been known that increasing the\nmodel size will lead

In [36]:
# json.dump(final_dataset, open("final_dataset.json", "w"))