# RAG-based Q&A on D&D #

# 1. Pulling the API-data from the website #

The first step is to pull the information from the api-website (link: https://www.dnd5eapi.co/api/2014) and save the entries from the tables into dictionaries, so that they can then be written to json files and become permeated information that is indepentent from the api and its availability.

The steps that were taken to pull the information into dictionaries can be found in the file_construction.ipynb.
Please be advised, that the content of the API is older and grave changes to the API might cause the code not to work. 
The content of the API is contained in the "api_data"-file and should not be overwritten.

In [None]:
# All needed modules and installments
%pip install -U datasets huggingface_hub fsspec
%pip -m spacy download en_core_web_sm
%pip install haystack-ai
%pip install google-genai-haystack
%pip install "sentence-transformers>=4.1.0"
%pip install "fsspec==2023.9.2"
%pip install "sentence-transformers>=4.1.0" "huggingface_hub>=0.23.0"
%pip install transformers[torch,sentencepiece]
%pip install huggingface_hub[hf_xet]

In [None]:
# All needed imports
import pprint
import json
import spacy
import os
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever, InMemoryBM25Retriever
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack import Pipeline
from haystack_integrations.components.generators.google_genai import GoogleGenAIChatGenerator
from haystack.utils import Secret
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.joiners import DocumentJoiner
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.rankers import SentenceTransformersSimilarityRanker 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\susib\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

### The next steps ###

What has to be done next is create a dataset and then document store out of our completed json-file, that later is used to retrieve information. However to make the important field 'desc' and 'name' our later retrieved information source and the other fields our meta-data-fields, we need our json-dict to follow the format:

dict: {
    'content': 'desc',
    'meta_data': every other field containig information
}

Also a new meta-data field called 'category' is added for better response filtering later on. The category variable orients itself on the key given to each dictionary entry in the previous dictionary.

### The sentence and later retrieval transformer: ###
multi-qa-distilbert-cos-v1 	("This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at: SBERT.net - Semantic Search" - https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1) - as it has a word limit of 512 word, before writing the documents to the Document store they are split in accordingly sized token chunks with a token overlap of 50.

According to HuggingFace "the model was trained with MultipleNegativesRankingLoss using Mean-pooling, cosine-similarity as similarity function, and a scale of 20"(https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1) in order to aquire their scorings.

Next to the content, almost all meta-information will be included, in order to make that information accessible to the retrievers as well. Meta-data-filters do not provide information, it only sets the scope for the searched documents. This is why the search task becomes difficult in some cases ('race').

In [22]:
# File path variables.
file_path_rag = 'api_data/rag_data.json'
file_path = 'api_data/api_data.json'

In [23]:
# In order to not reaccess the api and reload each dictionary, the already structured file is used to re-structure the rag-file into the desired format:
with open(file_path, 'r') as f:
    api_info = json.load(f)
    f.close()

# Every category that saves each loaded dict is added into the metadata to relieve later filtering.
def convertToRAGFormat (information):
    expected_docs = []
    for category, dicts in information.items():
        for index, items in dicts.items():
            later_content = []
            later_content.append(items.get('name'))
            if items.get('desc') and items.get('desc') != items.get('name'):
                later_content.append(items.get('desc'))
            if items.get('alignment') and category != 'monsters':
                later_content.append(items.get('alignment'))
            meta_info = {intern_key: intern_value for intern_key, intern_value in items.items() if intern_key != 'desc'}
            each_doc = {
                'content': '. '.join(later_content),
                'meta': {**meta_info,'category': category}
            }
            expected_docs.append(each_doc)
    
    return expected_docs

rag_docs = convertToRAGFormat(api_info)
with open(file_path_rag, 'w') as fr:
    json.dump(rag_docs, indent=4, ensure_ascii=False, fp=fr)
    fr.close()
    # The ascii-encoding is set to false, so f.ex. apostrophes aren't converted and can later be filtered if neccessary.
    # For better readability and visible structure four indents are added.



In [30]:
# Initializing Pipeline parts:
document_store = InMemoryDocumentStore()
document_joiner = DocumentJoiner(join_mode='merge')

document_splitter = DocumentSplitter(split_by="period", split_length=512, split_overlap=50)

In [31]:
# In order to be able to use the LLM, that api key is used here:
os.environ["GOOGLE_API_KEY"] = 'AIzaSyD3Bb1km908nqdn39vE_0RT-hhWHFtcOJ4'

In [32]:
# In order to save each entry in the dicts, the file is re-opened and every entry is saved as a document with the new format that was previously constructed, strucutring the document into 'content' and 'meta'-data.
with open(file_path_rag, 'r') as f:
    dataset = json.load(f)
    f.close()

docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]
print(len(docs)) # In order to check whether some docs have been lost, the length of the docs-list will be printed out.

# In order to be able to use every single meta-data key entry that is supposed to be embedded with the corresponding content, every single key is added to a set.
# First an empty set is initialized.
meta_keys = set()
# Then for every document in the document-list, the keys are added to the set, if they aren't already contained via the update()-method.
for doc in docs:
    meta_keys.update(doc.meta.keys())
# After that the set is converted to a list, so that it can be added to the doc_embedder so that all the meta-fields are includded im the embdding too.
meta_keys = list(meta_keys)

# If desired they can be looked at here:
# print(meta_keys)

2018


In [33]:
# Now the entries have to be embedded with an embedder:
doc_embedder = SentenceTransformersDocumentEmbedder(model='multi-qa-distilbert-cos-v1', meta_fields_to_embed=meta_keys)
doc_embedder.warm_up()


In [34]:
# Before embedding the documents and adding them to the document store, they are split into chunks with the document_splitter:
split_docs = document_splitter.run(docs)
docs_w_embeddings = doc_embedder.run(split_docs['documents'])

Batches: 100%|██████████| 64/64 [01:56<00:00,  1.83s/it]


In [10]:
# Because it needs to be checked whether all metadata was considered:
embedded_docs = docs_w_embeddings['documents']
# with this, the correct length of all embeddings can be checked and they are all 768 dimension long as described in the official documentation: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
#for doc in embedded_docs:
    #print(len(doc.embedding))
print(embedded_docs)



In [35]:
# All the embedded documents are added to the document store:
document_store.write_documents(docs_w_embeddings["documents"])

2018

# RAG-Pipeline #

This pipeline contains apart from the standard parts (textual embedder, llm, promptbuilder, retriever) a BM25-retriever to construct a hybrid search as well as dense retriever in order to boost results.
 Results from both retrievers get joined with the Document joiner and ranked according to their score.
However even with the hybrid search a filtering mechanism is still needed. Without a filtering mechanism, a reliable finding of resources will not work reliably because of the dynamic and homogenous naming of metadata fields. 

Queries like : 'What races can I play as?' return every single document, that somewhere contains the word 'race' in it's meta-data. This is often the case when looking at race-related skill or race-related weapons or classes. To alleviate this effect the key 'catgegory' from the api_data.json has been selected to function as a filter. 

An improved filtering mechanism could be possible with for example a multi-label classifier, however this would need a lot of training data, which is not accessible in this contenxt. Another mechanism other than hard-coding filtering rules, would be by letting an LLM decide which category/categories the question falls into. As our chosen LLM only provides 5 calls per day (similar to other free plans of LLMs), we did not integrate this.

In [36]:
template = [
    ChatMessage.from_user(
        """
You are a D&D expert. Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
    Metadata:
    {% if document.meta %}
        {% for key, value in document.meta.items() %}
            {{ key }}: {{ value }}
        {% endfor %}
    {% endif %}
{% endfor %}


Question: {{question}}
Answer:
"""
    )
]

prompt_builder_hybrid = ChatPromptBuilder(template=template)

ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.


In [37]:
chat_generator_hybrid = GoogleGenAIChatGenerator(model="gemini-2.0-flash")

In [38]:
cross_model = 'cross-encoder/ms-marco-MiniLM-L-6-v2'

text_embedder = SentenceTransformersTextEmbedder(model='multi-qa-distilbert-cos-v1')
text_embedder_retr = SentenceTransformersTextEmbedder(model='multi-qa-distilbert-cos-v1')

embedding_retriever = InMemoryEmbeddingRetriever(document_store)
embedding_retriever_retr = InMemoryEmbeddingRetriever(document_store)

bm25_retriever = InMemoryBM25Retriever(document_store)
bm25_retriever_retr = InMemoryBM25Retriever(document_store)

ranker = SentenceTransformersSimilarityRanker(model=cross_model)
ranker_retr = SentenceTransformersSimilarityRanker(model=cross_model)

document_joiner_retr = DocumentJoiner(join_mode='merge')

In [39]:
# Complete pipeline including the LLM
hybrid_retrieval = Pipeline()
hybrid_retrieval.add_component("text_embedder", text_embedder)
hybrid_retrieval.add_component("embedding_retriever", embedding_retriever)
hybrid_retrieval.add_component("bm25_retriever", bm25_retriever)
hybrid_retrieval.add_component("document_joiner", document_joiner)
hybrid_retrieval.add_component("ranker", ranker)

# new:
hybrid_retrieval.add_component("prompt_builder", prompt_builder_hybrid)
hybrid_retrieval.add_component("llm", chat_generator_hybrid)

hybrid_retrieval.connect("text_embedder", "embedding_retriever")
hybrid_retrieval.connect('bm25_retriever','document_joiner')
hybrid_retrieval.connect('embedding_retriever', 'document_joiner')
hybrid_retrieval.connect("document_joiner", "ranker")

# new:
hybrid_retrieval.connect("ranker", "prompt_builder")
hybrid_retrieval.connect("prompt_builder.prompt", "llm.messages")

<haystack.core.pipeline.pipeline.Pipeline object at 0x00000211822AF3D0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - embedding_retriever: InMemoryEmbeddingRetriever
  - bm25_retriever: InMemoryBM25Retriever
  - document_joiner: DocumentJoiner
  - ranker: SentenceTransformersSimilarityRanker
  - prompt_builder: ChatPromptBuilder
  - llm: GoogleGenAIChatGenerator
🛤️ Connections
  - text_embedder.embedding -> embedding_retriever.query_embedding (list[float])
  - embedding_retriever.documents -> document_joiner.documents (list[Document])
  - bm25_retriever.documents -> document_joiner.documents (list[Document])
  - document_joiner.documents -> ranker.documents (list[Document])
  - ranker.documents -> prompt_builder.documents (list[Document])
  - prompt_builder.prompt -> llm.messages (list[ChatMessage])

In [40]:
# In order to evaluate the results from retrieval, the exact same pipeline was built in order to acces the ranked results before the LLM tries to build a prompt:
hb_nollm_pipeline = Pipeline()
hb_nollm_pipeline.add_component("text_embedder", text_embedder_retr)
hb_nollm_pipeline.add_component("embedding_retriever", embedding_retriever_retr)
hb_nollm_pipeline.add_component("bm25_retriever", bm25_retriever_retr)
hb_nollm_pipeline.add_component("document_joiner", document_joiner_retr)
hb_nollm_pipeline.add_component("ranker", ranker_retr)

hb_nollm_pipeline.connect("text_embedder", "embedding_retriever")
hb_nollm_pipeline.connect('bm25_retriever','document_joiner')
hb_nollm_pipeline.connect('embedding_retriever', 'document_joiner')
hb_nollm_pipeline.connect("document_joiner", "ranker")

<haystack.core.pipeline.pipeline.Pipeline object at 0x00000211822A6820>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - embedding_retriever: InMemoryEmbeddingRetriever
  - bm25_retriever: InMemoryBM25Retriever
  - document_joiner: DocumentJoiner
  - ranker: SentenceTransformersSimilarityRanker
🛤️ Connections
  - text_embedder.embedding -> embedding_retriever.query_embedding (list[float])
  - embedding_retriever.documents -> document_joiner.documents (list[Document])
  - bm25_retriever.documents -> document_joiner.documents (list[Document])
  - document_joiner.documents -> ranker.documents (list[Document])

In [None]:
query = "Want to create a new character and I want to make a hollow one dwarf. So i see in lineage how to add a hollow one and it says if I add this to what my race is I get the traits of the hollow one and my dwarf, but how do you add hollow one to the race? I don't see a button or link. I see in hollow one I can 2 skills but that's it. I tried custom lineage also and didn't see anything there either. Am I missing something?"
result = hb_nollm_pipeline.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query, "top_k": 20}, "embedding_retriever": {"top_k": 20}, "ranker": {"query": query}}
)
for doc in result['ranker']['documents']:
    print("Content:", doc.content)
    print("Metadata:", doc.meta['category'])
    print('final document score:', doc.score)
    print("----")

In [121]:
query = "what races can i play as?"

result = hybrid_retrieval.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query,"top_k": 20},"embedding_retriever":{"top_k": 20}, "ranker": {"query": query}, "prompt_builder":{"question": query}}
)
# hybrid_retrieval.draw("hybrid-retrieval.png")
print(result["llm"]["replies"][0])

Batches: 100%|██████████| 1/1 [00:00<00:00, 27.14it/s]


ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='Based on the provided text, you can play as a Lightfoot Halfling or a High Elf.\n')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 3224, 'completion_tokens': 20, 'total_tokens': 3244}})


## Applying filters ##

Another more efficient way to automatically assign the correct category would be the using multi-label classification. This could assign 2 or more fitting labels to search queries. However as we don't have enough queries and data to train our own classifier, we used a zero-shot classifier in order to assign the correct category labels to our queries.

In [41]:
# In order to improve the results, zero-shot classification was added, so that the filters could be applied accordingly:
# The multi_label-parameter is set to true, so that no information is lost in case some questions cover more broad context.
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model='facebook/bart-large-mnli')
# Example question.
question = "what races can i be?"
# The candidate labels consist of of meta-data-categories, that are supposed to be searched after classification.
candidate_labels = ["rules", "rule_sections", "races", "subraces", "classes", "subclasses", "skills", "feats", "languages", "ability_scores", "traits", "proficiencies", "features", "example_character_background", "conditions", "equipment", "equipment_categories","weapon_properties","magic_items","magic_schools","damage_types","spells","monsters"]

def apply_filters(query):
    # The model we use for the classification is very big, so use device = 0 below to use your GPU (if your device supports CUDA)
    # As seen on the documentation, the labels and scores are listed in a descending order, so simply the first 3 items can be extracted a set as filter variables:
    res = classifier(query, candidate_labels, multi_label = True)
    three_highest_scores = []
    labels = []
    high_score_dict = {}

    for label in res['labels'][:3]:
        for score in res['scores'][:3]:
            three_highest_scores.append(score)
            high_score_dict[label] = score
        if " " in label:
            completed_label = str(label)
            new_label = completed_label.replace(" ","_")
            labels.append(new_label)
        else:
            labels.append(label)
    
    applied_filter = {"operator": "OR", "conditions": [{"field": "meta.category", "operator": "==", "value": labels[0]},{"field": "meta.category", "operator": "==", "value": labels[1]},{"field": "meta.category", "operator": "==", "value": labels[2]}]}
    print(applied_filter)
    return applied_filter

Device set to use cpu


In [None]:
query = "what races can i be?"
result = hb_nollm_pipeline.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query,"filters": apply_filters(query), "top_k": 25}, "embedding_retriever": {"filters":apply_filters(query),"top_k": 25}, "ranker": {"query": query}}
)
for doc in result['ranker']['documents']:
    print("Content:", doc.content)
    print("Metadata:", doc.meta['category'])
    print('final document score:', doc.score)
    print("----")

Batches: 100%|██████████| 1/1 [00:00<00:00, 22.00it/s]


Content: Favored Enemy (2 types). Beginning at 1st level  you have significant experience studying  tracking  hunting  and even talking to a certain type of enemy. Choose a type of favored enemy: aberrations  beasts  celestials  constructs  dragons  elementals  fey  fiends  giants  monstrosities  oozes  plants  or undead. Alternatively  you can select two races of humanoid such as gnolls and orcs as favored enemies. You have advantage on Wisdom Survival checks to track your favored enemies  as well as on Intelligence checks to recall information about them. When you gain this feature  you also learn one language of your choice that is spoken by your favored enemies  if they speak one at all. You choose one additional favored enemy  as well as an associated language  at 6th and 14th level. As you gain levels  your choices should reflect the types of monsters you have encountered on your adventures.
Metadata: features
final document score: 0.04323430359363556
----
Content: Favored Enemy 

In [None]:
query = "what races can i play as?"

result = hybrid_retrieval.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query,"top_k": 20},"embedding_retriever":{"top_k": 20}, "ranker": {"query": query}, "prompt_builder":{"question": query}}
)
# hybrid_retrieval.draw("hybrid-retrieval.png")
print(result["llm"]["replies"][0])

## Evaluation ## 

As the API used for this RAG QA pipeline is from 2014 and there have been some changes including new releases in the game and altercations, we worked with the API and therefore need to use queries that were real and similar in content. 
Some were from https://www.dndbeyond.com/?msockid=201f823801c3644c068896b30048651a

The queries were all annotated by hand and original queries saved in the 'original_questions.jsonl'-file. Due to the queries containing a lot of chatter and misformulted questions, they were shortened and reformulated (if needed) in the 'queries.jsonl'-file. 

We only included queries that:
- were understandable to us,
- formulated as questions and did not contain too much additional context, 
- not concerning homwbrew content,
- addressing content that is contained in the used API,
- asked about a subject (we excluded every forum post that contained polls and the gathering of ideas on character building or similar)

In total we annotated 50 queries, sometimes the question was taken from the forums title and sometimes from the text, as some users wrote their questions only into the forums title. 
If looking at the "original-questions"-file, it is noticable that some number in the index, have been skipped. This is the case when the current question has been divided into multiple questions.

The zero-shot classifier, that filtered the categories that should be looked in for each query, also has to be evaluated.
In order to evaluate the retrieval, the groundtruth needs to be established. The task here is not only to collect the relevant documents but also the relevant categories for each query that is supposed to be tested.

For the establishment of the groundtruth for our documents we largely oriented ourselves on the proceedings in the blog of Phong Cau and Geisa Faustino in
"Efficient Ground Truth Generation for Search Evaluation"(30.05.2025. Online at:https://devblogs.microsoft.com/ise/efficient-ground-truth-generation-search-evaluation/. last accessed: 10.09.2025).
In order to efficiently establish a groundtruth, they passed their queries through their hybrid search approach of text- and vector-based retrievers to retrieve the top 100 results for each queries.
The results were joined and before getting manually labeled, the queries were passed to a LLM to again re-evaluate relevancy.

As we also used a hybrid pipeline, we will integrate the same steps, exept for the integration of the LLM due to limited calls, and evaluate relevance of our documents to the queries.
To access the retrieval results from our hybrid pipeline, we will use a separate pipeline that does not include an LLM. Just as Cao and Faustino, we will use the top 100 results from both retrievers, merge the results with
out DocumentJoiner and manually evaluate relevancy through user input.
What is also supposed to happen is that next to the evaluation of the documents each query, gets assigned 4 categories , that serve as the groundtruth for the evaluation of our used zero-shot classifer. 

Following metrics will be used:
For the hybrid retrieval:
- Recall@k & Recall@k
- MRR 

For the zero-shot re-ranker (Crossencoder):
- nDCG
- MRR

For the LLM:
- Human evaluation

In [46]:
# Creates an empty dictionary to save the queries names and the texts as key, aslo included is a list that contains the relevant categories:
query_dict = {}
# Loading all queries into a dictionary:
with open('groundtruth/queries.jsonl', 'r') as json_files:
    for line in json_files:  # reading each line the file separately and
        i_d = json.loads(line)  # loading each line as a json object to acquire the correct encoding
        query_dict[i_d['_id']] = i_d['text'] # each line is then added to the dictionary

json_files.close()

print(query_dict)

{'id01': 'Can i further my range of the teleportation spell with familiar or clairvoyance?', 'id02': 'Assuming all spell slots are used, how many zombies or skeletons could a necromancer raise with the spell Animate Dead?', 'id03': 'What could be done with multiclassing a necromancer and a sorcerer?', 'id04': 'I have a doubt with the loading property, can I use 2 attacks in a round if I have 3 actions?', 'id05': 'Is it possible to twin a spell (like invisibility) from the ring of storing?', 'id06': 'When I use Divine Smite as a Paladin I expend spell slots, so is Divine Smite then considered a spell?', 'id07': 'When a character has regenerate and his health drops to 0, does the generation stops?', 'id08': 'How do you handle darkness in melee combat?', 'id09': 'Can i, as the DM, use the move Petrifying Bite of the Basilisk on every turn?', 'id10': 'What does a monsters challenge rating (CR) mean?', 'id11': 'Does proficiency matter when choosing weapons for two weapon fighting?', 'id12':

In [None]:
# As some of the docs might be partially relevant but not highly relevant there will be a ranked relevance:
from IPython.display import clear_output
groundtruth_dict = {}

groundtruth = 'groundtruth/groundtruth_docs.json'

def annotate_relevance(dict):
    for _id, text in dict.items():
        groundtruth_dict[_id] = {}
        relevant_docs = {}
        print (candidate_labels)
        print('Question: ' + text)
        cats = input("which 4 ones of the stated candidate labels apply to the query the best? : ")
        cats_cats = cats.split(',')
        cattys = [string.strip() for string in cats_cats]
        groundtruth_dict[_id]['categories'] = cattys
        print(groundtruth_dict[_id]['categories'])
        result = hb_nollm_pipeline.run(
        {"text_embedder": {"text": text}, "bm25_retriever": {"query": text, "top_k": 100}, "embedding_retriever": {"top_k": 100}, "ranker": {"query": text, "top_k": 100}})

        for index, doc in enumerate(result['ranker']['documents']):
            print("----", index , "----")
            print("Content:", doc.content, flush=True)
            print("Metadata:", doc.meta['category'], flush=True)
            print('final document score:', doc.score)
            print("----")
            relevance = input('What score would you give this document (0 = irrelevant, 1 = partially relevant, 2 = highly relevant): ')
            relevant_docs[doc.id] = {'relevance_score': int(relevance)}
            print("Document with id:", doc.id," was added to the groundtruth with the following score: ", relevance)
        groundtruth_dict[_id]['documents'] = relevant_docs
        print('------ Next query -----')
        clear_output(wait=True)
    with open(groundtruth, 'w') as fr:
        json.dump(groundtruth_dict, indent=4, ensure_ascii=False, fp=fr)
    fr.close()

annotate_relevance(query_dict)

In [None]:
def evaluate_classifier(model, groundtruth_dict, query_dict, top_k=10):
    results = {}
    for _id, query in query_dict.items():

        docs = list(groundtruth_dict[_id]['categories'])
        doc_texts = [groundtruth_dict[_id][doc_id]["content"] for doc_id in docs]

        # Score with cross-encoder
        scores = model.predict(query, doc_texts)
        sorted_docs = [x for _, x in sorted(zip(scores, docs), reverse=True)]

        y_true = [groundtruth_dict[qid][doc_id]["relevance_score"] for doc_id in sorted_docs]
        y_scores = sorted(scores, reverse=True)

        recall = recall_at_k(y_true, k=top_k)
        ndcg = ndcg_at_k(y_true, y_scores, k=top_k)

        results[_id] = {"recall@k": recall, "ndcg@k": ndcg}
    return results