# A Retrieval-Augmented QA System for Dungeons & Dragons

## 0. Installation

The following Python packages and libraries are required to execute the code.

In [None]:
%pip install -U datasets huggingface_hub fsspec
%pip -m spacy download en_core_web_sm
%pip install haystack-ai
%pip install google-genai-haystack
%pip install 'sentence-transformers>=4.1.0'
%pip install 'fsspec==2023.9.2'
%pip install 'sentence-transformers>=4.1.0" "huggingface_hub>=0.23.0'
%pip install transformers[torch,sentencepiece]
%pip install huggingface_hub[hf_xet]
%pip install evaluate

import pprint
import json
import os
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever, InMemoryBM25Retriever
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack import Pipeline
from haystack_integrations.components.generators.google_genai import GoogleGenAIChatGenerator
from haystack.utils import Secret
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.joiners import DocumentJoiner
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.rankers import SentenceTransformersSimilarityRanker 
import evaluate
from sklearn.metrics import f1_score, hamming_loss
from sklearn.preprocessing import MultiLabelBinarizer
import math
import numpy as np

Collecting graphviz
  Downloading graphviz-0.21-py3-none-any.whl.metadata (12 kB)
Downloading graphviz-0.21-py3-none-any.whl (47 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.21
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## 1. Data Crawling

The first step for creating the Question-Answering (QA) system was to pull the required information about *Dungeons & Dragons* (D&D) from the *D&D 5e SRD API* which is an Application Programming Interface (API) including all Systems Reference Document (SRD) data (which means it is open source) from the 5th Edition published in 2014 (https://www.dnd5eapi.co/, last accessed on 09/19/2025).
The according code is located in the `file_construction.ipynb` file and has been separated from the `main.ipynb` file because the JavaScript Object Notation (JSON) file including the API data should be created once and not be overwritten. Reasons for this are that the API and its structure might change over time which may lead to the code below not functioning properly as it is based on the originally created JSON file.

Once the `api_data.json` is created, it has to be restructured into a new JSON file that is used to retrieve information later on. In order to make the important information fields of `name` and `desc` the information source and the additional fields the `meta_data`, the new file needs to follow this format:
```python
dict: {
    'content': 'desc',
    'meta_data': every other field containig information
}
```
A new `meta_data` field called `category` is also added to improve response filtering later on. It is based on the key given to each dictionary entry in the original JSON file. Next to the content, almost all meta information will be included in order to make that information accessible to the retrievers as well. Meta data filters do not provide information, they only set the scope for the searched documents. This is why the search task becomes difficult in some cases (e.g., 'race').

These are the file paths for both the original JSON file `api_data.json` containing the structured API data and the JSON file `rag_data.json` into which the restructured data will be stored.

In [None]:
file_path = 'api_data/api_data.json'
file_path_rag = 'api_data/rag_data.json'

In order to not reaccess the API and reload each dictionary, the JSON file that has already been created in the `file_construction.ipynb` file is used to restructure the data into the desired format that is compatible with retrieval-augmented generation (RAG).

In [None]:
with open(file_path, 'r') as f:
    api_info = json.load(f)

For this, the function `convert_to_rag_format` iterates through all categories and their entries in the loaded dictionary from the original JSON file. For each entry, a content string is built that combines fields such as `name`, `desc`, and, where applicable, `alignment`. 

The meta data for the entry is preserved by copying all fields, except the description, and adding the additional keys of `category` and `document_id` as a unique identifier derived from the category and name. 

Each processed entry is stored as a dictionary with the two fields of `content`, which is the searchable textual information relevant for retrieval, and `meta`, which are the structured meta data fields and identifiers for filtering. 

All entries are collected into a list, which is then saved as a JSON file with identation for readability and preservation of non-ASCII (American Standard Code for Information Interchange) characters so they are written correctly without being escaped. This format ensures that the data is directly usable in a RAG pipeline.

In [None]:
def convert_to_rag_format (information):
    expected_docs = []

    for category, dicts in information.items():
        for index, items in dicts.items():
            later_content = [items.get('name')]
            if items.get('desc') and items.get('desc') != items.get('name'):
                later_content.append(items.get('desc'))
            if items.get('alignment') and category != 'monsters':
                later_content.append(items.get('alignment'))
            
            meta_info = {intern_key: intern_value for intern_key, intern_value in items.items() if intern_key != 'desc'}

            document_id = (category + " " + items.get('name').lower()).replace(' ','_').replace(' ','_')

            each_doc = {
                'content': '. '.join(later_content),
                'meta': {**meta_info, 'category': category, 'document_id': document_id}
            }
            expected_docs.append(each_doc)
    
    return expected_docs

rag_docs = convert_to_rag_format(api_info)

with open(file_path_rag, 'w') as fr:
    json.dump(
        rag_docs, 
        indent=4, 
        ensure_ascii=False, 
        fp=fr
    )

## 2. RAG Pipeline

The sentence (and later retrieval) transformer used is `multi-qa-distilbert-cos-v1`. According to the website, it is a sentence-transformers model which "maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search" and "[i]t has been trained on 215M (question, answer) pairs from diverse sources" (https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1, last accessed on 09/19/2025). It is also mentioned that the model has a word limit of 512 word pieces for input texts, the documents will be split in accordingly sized token chunks with a token overlap of 50 before writing them into the document store. Additionally, "the model was trained with MultipleNegativesRankingLoss using Mean-pooling, cosine-similarity as similarity function, and a scale of 20" in order to aquire their scorings.

To be able to use the Large Language Model (LLM), the following Google API key is used.

In [None]:
os.environ["GOOGLE_API_KEY"] = 'AIzaSyD3Bb1km908nqdn39vE_0RT-hhWHFtcOJ4'

The next step initalizes the main building blocks of the retrieval pipeline. Firstly, an `InMemoryDocumentStore` is created to temporarily hold all processed documents in memory. Secondly, a `DocumentJoiner` with `join_mode='merge'` is set up to combine related documents into a single entry when needed. And thirdly, a `DocumentSplitter` is defined to break longer texts into smaller segments based on sentence boundaries (`split_by="period"`). Each segment is limited to a maximum length of 512 tokens with an overlap of 50 tokens between segments to preserve context across splits. This aligns with the word limit of the `multi-qa-distilbert-cos-v1` model.

In [None]:
document_store = InMemoryDocumentStore()
document_joiner = DocumentJoiner(join_mode='merge')
document_splitter = DocumentSplitter(split_by="period", split_length=512, split_overlap=50)

Now the RAG-formatted JSON file created earlier is re-opened to create a list of documents. For this, each entry is reconstructed into a `Document` object with a unique `id` from `document_id`, a `content` field containing the text, and a `meta` field holding all associated meta data. These documents are collected into the list `docs`.

In [None]:
with open(file_path_rag, 'r') as f:
    dataset = json.load(f)

docs = [Document(id=doc["meta"]["document_id"],content=doc["content"], meta=doc["meta"]) for doc in dataset]

Next, all meta data keys across the documents are collected to be able to use every single meta data key entry that is supposed to be embedded with the corresponding content. 

An empty set `meta_keys` is initialized, and for each document in the `docs` list the meta data keys are added to the set using `update()`. The key `document_id` is removed since it is only used internally for unique identification and not intended for embedding. Finally, the set is converted into a list so that it can be passed to the document embedder, ensuring that all relevant meta data fields are included alongside the text content during embedding. The meta data keys can be printed if desired.

In [None]:
meta_keys = set()

for doc in docs:
    meta_keys.update(doc.meta.keys())
    meta_keys.remove('document_id')

meta_keys = list(meta_keys)

print(meta_keys)

As a first step in order to embed the entries with an embedder, the `SentenceTransformersDocumentEmbedder` is initialized with the pretrained model `multi-qa-distilbert-cos-v1`. The parameter `meta_fields_to_embed` ensures that metadata fields collected earlier are embedded together with the main content, so that both text and metadata contribute to semantic similarity. The `warm_up()` method is called once to pre-load the model into memory and reduce latency for subsequent embedding operations.

In [None]:
doc_embedder = SentenceTransformersDocumentEmbedder(model='multi-qa-distilbert-cos-v1', meta_fields_to_embed=meta_keys)
doc_embedder.warm_up()

Before embedding, the documents are split into smaller chunks using the previously defined `document_splitter`. This ensures that long texts are divided into manageable segments, preventing input length issues and improving retrieval quality. The split documents are then passed into the embedder with `doc_embedder.run()`, producing vector embeddings that represent both their content and selected meta data.

In [None]:
split_docs = document_splitter.run(docs)
docs_w_embeddings = doc_embedder.run(split_docs['documents'])

Batches: 100%|██████████| 64/64 [02:56<00:00,  2.76s/it]


The resulting documents with embeddings are accessed through `docs_w_embeddings['documents']`. 

At this stage, it is possible to verify that all meta data was correctly preserved and that embeddings were generated with the expected dimensionality. In this case, each embedding vector has a length of 768, as specified in the official documentation for the `multi-qa-distilbert-cos-v1` model.

In [None]:
embedded_docs = docs_w_embeddings['documents']

for doc in embedded_docs:
    print(len(doc.embedding))

print(embedded_docs)



Finally, all the embedded documents are written into the previously defined `document_store` using `write_documents()`. This step makes them available for retrieval in the pipeline, enabling semantic search and other RAG operations based on both the content and metadata of the embedded documents.

In [None]:
document_store.write_documents(docs_w_embeddings["documents"])

Here, a prompt template is created to structure how the RAG pipeline communicates with the LLM. The template defines a conversational message from the user role, instructing the model to act as a *Dungeons & Dragons* expert. The template inserts both the retrieved context documents and the user's question into a structured prompt: For each document, its content is added, all its available meta data is iterated over and displayed as key-value pairs, and the user's original question is appended, followed by the placeholder for the answer. This ensures the model has both the relevant textual content and the associated meta data when formulating a response. 

In [None]:
template = [
    ChatMessage.from_user(
        """
        You are a D&D expert. Given the following information, answer the question.

        Context:
        {% for document in documents %}
            {{ document.content }}
            Metadata:
            {% if document.meta %}
                {% for key, value in document.meta.items() %}
                    {{ key }}: {{ value }}
                {% endfor %}
            {% endif %}
        {% endfor %}


        Question: {{question}}
        Answer:
        """
    )
]

The template is then passed into a `ChatPromptBuilder`, which transforms it into a dynamic prompt generation tool. When executed, it automatically fills in the retrieved documents and the user's query, producing a fully formatted input for the LLM.

In [None]:
prompt_builder_hybrid = ChatPromptBuilder(
    template=template,
    required_variables=["documents", "question"]
)

This step connects the RAG pipeline to the LLM, enabling it to generate natural language answers from the embedded and retrieved documents. For this, a chat-based language model generator using the `GoogleGenAIChatGenerator` is initialized. For generating responses, the pipeline uses Google's `gemini-2.0-flash` model. The resulting `chat_generator_hybrid` object can then be used to generate answers in a conversational format, taking structured prompts (like the ones created with `ChatPromptBuilder`) and producing text outputs based on the given context and user query.

In [11]:
chat_generator_hybrid = GoogleGenAIChatGenerator(model="gemini-2.0-flash")

The RAG pipeline extends the standard components (text embedder, LLM, prompt builder, retriever) with a BM25 retriever for key-word based search and a dense retriever for semantic search. This results in a hybrid search that improves retrieval performance. The results from both retrievers are merged using the `DocumentJoiner` and then ranked according to their scores.

To implement this, multiple specialized components are initialized:

The cross-encoder ranker `cross-encoder/ms-marco-MiniLM-L-6-v2` is used to rerank candidate documents based on semantic similarity, improving the quality of results from the hybrid retrieval.

In [None]:
cross_model = 'cross-encoder/ms-marco-MiniLM-L-6-v2'

ranker = SentenceTransformersSimilarityRanker(model=cross_model)
ranker_retrieval = SentenceTransformersSimilarityRanker(model=cross_model)

The `SentenceTransformersTextEmbedder` models (`multi-qa-distilbert-cos-v1`) are used to generate dense embeddings for documents with separate instances for the main pipeline, the retriever branch, and a branch without ranking.

In [None]:
text_embedder = SentenceTransformersTextEmbedder(model='multi-qa-distilbert-cos-v1')
text_embedder_retrieval = SentenceTransformersTextEmbedder(model='multi-qa-distilbert-cos-v1')
text_embedder_without_ranking = SentenceTransformersTextEmbedder(model='multi-qa-distilbert-cos-v1')

The `InMemoryEmbeddingRetriever` models wrap the document store to perform similarity-based retrieval on the dense embeddings. Again, separate instances correspond to different branches of the pipeline.

In [None]:
embedding_retriever = InMemoryEmbeddingRetriever(document_store)
embedding_retriever_retrieval = InMemoryEmbeddingRetriever(document_store)
embedding_retriever_without_ranking = InMemoryEmbeddingRetriever(document_store)

The `InMemoryBM25Retriever` models are used for keyword-based retrieval in each branch.

In [None]:
bm25_retriever = InMemoryBM25Retriever(document_store)
bm25_retriever_retrieval = InMemoryBM25Retriever(document_store)
bm25_retriever_without_ranking = InMemoryBM25Retriever(document_store)

The `DocumentJoiner` objects with `join_mode='merge'` are used to combine documents from different retrieval paths, ensuring that content from multiple sources is aggregated before being sent to the LLM.

In [None]:
document_joiner_retrieval = DocumentJoiner(join_mode='merge')
document_joiner_without_ranking = DocumentJoiner(join_mode='merge')

With this setup, the pipeline is able to flexibly combine dense retrieval, which uses embeddings generated by neural networks like `multi-qa-distilbert-cos-v1` to capture semantic meaning, and sparse retrieval, which uses keyword-based search like BM25. It then ranks the results effectively and prepares them for downstream processing, which includes merging retrieved documents, embedding meta data, and formatting prompts for the LLM so that the model can generate accurate and contextually relevant responses.

Next, the RAG pipeline including the LLM is completed. In order to do this, several components are added to the hybrid retrieval pipeline:

The `text_embedder` generates embeddings for semantic search, the `embedding_retriever` performs dense similarity search using embeddings, the `bm25_retriever` performs keyword-based retrieval, the `document_joiner` merges documents from both retrievers, the `ranker` reranks merged documents by semantic relevance, the `prompt_builder` formats the retrieved and ranked documents along with the user query into a structured prompt, and the `llm` generates the final answer using the structured prompt.

Afterwards, connections between the components are created to define the flow of data:

The `text_embedder` generates embeddings for the `embedding_retriever`, the dense (`text_embedder`) and sparse (`bm25_retriever`) retrieval feed into the `document_joiner` so it can merge the documents from both retrievers, the `document_joiner` passes the merged documents to the `ranker` so it is able to rerank them, the `ranker` gives these documents to the `prompt_builder` which formats them and produces a structured prompt, and finally the `prompt_builder` passes the prompt to the `llm` which generates the answer.

In [None]:
hybrid_retrieval = Pipeline()

hybrid_retrieval.add_component("text_embedder", text_embedder)
hybrid_retrieval.add_component("embedding_retriever", embedding_retriever)
hybrid_retrieval.add_component("bm25_retriever", bm25_retriever)
hybrid_retrieval.add_component("document_joiner", document_joiner)
hybrid_retrieval.add_component("ranker", ranker)
hybrid_retrieval.add_component("prompt_builder", prompt_builder_hybrid)
hybrid_retrieval.add_component("llm", chat_generator_hybrid)

hybrid_retrieval.connect("text_embedder", "embedding_retriever")
hybrid_retrieval.connect('bm25_retriever','document_joiner')
hybrid_retrieval.connect('embedding_retriever', 'document_joiner')
hybrid_retrieval.connect("document_joiner", "ranker")
hybrid_retrieval.connect("ranker", "prompt_builder")
hybrid_retrieval.connect("prompt_builder.prompt", "llm.messages")

<haystack.core.pipeline.pipeline.Pipeline object at 0x0000020C868CE220>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - embedding_retriever: InMemoryEmbeddingRetriever
  - bm25_retriever: InMemoryBM25Retriever
  - document_joiner: DocumentJoiner
  - ranker: SentenceTransformersSimilarityRanker
  - prompt_builder: ChatPromptBuilder
  - llm: GoogleGenAIChatGenerator
🛤️ Connections
  - text_embedder.embedding -> embedding_retriever.query_embedding (list[float])
  - embedding_retriever.documents -> document_joiner.documents (list[Document])
  - bm25_retriever.documents -> document_joiner.documents (list[Document])
  - document_joiner.documents -> ranker.documents (list[Document])
  - ranker.documents -> prompt_builder.documents (list[Document])
  - prompt_builder.prompt -> llm.messages (list[ChatMessage])

Now another pipeline identical to the hybrid one but excluding the LLM is constructed. 

In [None]:
# In order to evaluate the results from retrieval, the exact same pipeline was built in order to 
# acces the ranked results before the LLM tries to build a prompt:
hb_nollm_pipeline = Pipeline()
hb_nollm_pipeline.add_component("text_embedder", text_embedder_retrieval)
hb_nollm_pipeline.add_component("embedding_retriever", embedding_retriever_retrieval)
hb_nollm_pipeline.add_component("bm25_retriever", bm25_retriever_retrieval)
hb_nollm_pipeline.add_component("document_joiner", document_joiner_retrieval)
hb_nollm_pipeline.add_component("ranker", ranker_retrieval)

hb_nollm_pipeline.connect("text_embedder", "embedding_retriever")
hb_nollm_pipeline.connect('bm25_retriever','document_joiner')
hb_nollm_pipeline.connect('embedding_retriever', 'document_joiner')
hb_nollm_pipeline.connect("document_joiner", "ranker")

<haystack.core.pipeline.pipeline.Pipeline object at 0x0000020C868CE9D0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - embedding_retriever: InMemoryEmbeddingRetriever
  - bm25_retriever: InMemoryBM25Retriever
  - document_joiner: DocumentJoiner
  - ranker: SentenceTransformersSimilarityRanker
🛤️ Connections
  - text_embedder.embedding -> embedding_retriever.query_embedding (list[float])
  - embedding_retriever.documents -> document_joiner.documents (list[Document])
  - bm25_retriever.documents -> document_joiner.documents (list[Document])
  - document_joiner.documents -> ranker.documents (list[Document])

In [None]:
# In order to evaluate the results from retrieval, the exact same pipeline was built in order to acces the ranked results before the LLM tries to build a prompt:
# For groundtruth:
hb_noranker_pipeline = Pipeline()
hb_noranker_pipeline.add_component("text_embedder", text_embedder_without_ranking)
hb_noranker_pipeline.add_component("embedding_retriever", embedding_retriever_without_ranking)
hb_noranker_pipeline.add_component("bm25_retriever", bm25_retriever_without_ranking)
hb_noranker_pipeline.add_component("document_joiner", document_joiner_without_ranking)

hb_noranker_pipeline.connect("text_embedder", "embedding_retriever")
hb_noranker_pipeline.connect('bm25_retriever','document_joiner')
hb_noranker_pipeline.connect('embedding_retriever', 'document_joiner')


<haystack.core.pipeline.pipeline.Pipeline object at 0x0000020C868CE7F0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - embedding_retriever: InMemoryEmbeddingRetriever
  - bm25_retriever: InMemoryBM25Retriever
  - document_joiner: DocumentJoiner
🛤️ Connections
  - text_embedder.embedding -> embedding_retriever.query_embedding (list[float])
  - embedding_retriever.documents -> document_joiner.documents (list[Document])
  - bm25_retriever.documents -> document_joiner.documents (list[Document])

In [None]:
query = "Want to create a new character and I want to make a hollow one dwarf. So i see in lineage how to add a hollow one and it says if I add this to what my race is I get the traits of the hollow one and my dwarf, but how do you add hollow one to the race? I don't see a button or link. I see in hollow one I can 2 skills but that's it. I tried custom lineage also and didn't see anything there either. Am I missing something?"
result = hb_nollm_pipeline.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query, "top_k": 20}, "embedding_retriever": {"top_k": 20}, "ranker": {"query": query}}
)
for doc in result['ranker']['documents']:
    print("Content:", doc.content)
    print("Metadata:", doc.meta['category'])
    print('final document score:', doc.score)
    print("----")

In [None]:
query = "what races can i play as?"
result = hybrid_retrieval.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query,"top_k": 20},"embedding_retriever":{"top_k": 20}, "ranker": {"query": query}, "prompt_builder":{"question": query}}
)
print(result["llm"]["replies"][0])

Batches: 100%|██████████| 1/1 [00:00<00:00, 27.14it/s]


ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='Based on the provided text, you can play as a Lightfoot Halfling or a High Elf.\n')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 3224, 'completion_tokens': 20, 'total_tokens': 3244}})


Even with hybrid search, however, a filtering mechanism is still required. Without it, the retrieval process is unreliable due to the dynamic and homogenous naming of meta data fields. For example, a query such as "What races can I play?" would return nearly every document containing the word "race" in its meta data, even if it only refers to race-related skills, weapons, or classes. To address this, the key `category` from `api_data.json` is used as a filter to narrow down results to the most relevant entries.

A more advanced filtering mechanism could be achieved using a multi-label classifier, but this would require a large amount of labeled training data, which is not available in this context. Another alternative would be to let an LLM dynamically assign categories to a query instead of relying on hard-coded filtering rules. But since the chosen LLM is restricted to only five calls per day (similar to other free-tier LLM plans), this option was not integrated.

## Applying filters ##

Another more efficient way to automatically assign the correct category would be the using multi-label classification. This could assign 2 or more fitting labels to search queries. However as we don't have enough queries and data to train our own classifier, we used a zero-shot classifier in order to assign the correct category labels to our queries.

As we have no available training data, we are again using a zero-shot model to automatically assign a probability for each of the candidate labels, from whom the highest 4 predictions are taken and constructed into filters.
We oriented ourselves on the documentation of usage of this model on the official huggingface page (https://huggingface.co/facebook/bart-large-mnli) when using the model. 

This zero-shot classifier uses Natural Language Inference (NLI) in order to assign probabilities to the different candidate labels (1).
The query gets passed to the model as a premise and each one of the candidate labels can be formulated into an according hypotheses that also get passed to the model (1).
As explained by Davison, for each of the candidate labels the model determines the probabiity that the current candidate label entails the premise (our query) (1).
These probabilities are returned to us, and the 4 highest-scoring labels are returned as filters that we pass to our RAG.

(1) Davison, Joe (2020): Zero-shot Learning in Modern NLP. State-of-the-art NLP models for text classification without annotated data. https://joeddav.github.io/blog/2020/05/29/ZSL.html. (last accessed: 11.09.2025).
(The blog is also accessbile through the model page on huggingface!)

In [16]:
# In order to improve the results, zero-shot classification was added, so that the filters could be applied accordingly:
# The multi_label-parameter is set to true, so that no information is lost in case some questions cover more broad context.
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model='facebook/bart-large-mnli')
# Example question.
question = "what races can i be?"
# The candidate labels consist of of meta-data-categories, that are supposed to be searched after classification.
candidate_labels = ["rules", "rule_sections", "races", "subraces", "classes", "subclasses", "skills", "feats", "languages", "ability_scores", "traits", "proficiencies", "features", "example_character_background", "conditions", "equipment", "equipment_categories","weapon_properties","magic_items","magic_schools","damage_types","spells","monsters"]

def apply_filters(query):
    # The model we use for the classification is very big, so use device = 0 below to use your GPU (if your device supports CUDA)
    # As seen on the documentation, the labels and scores are listed in a descending order, so simply the first 3 items can be extracted a set as filter variables:
    res = classifier(query, candidate_labels, multi_label = True)
    four_highest_scores = []
    labels = []
    high_score_dict = {}

    for label in res['labels'][:3]:
        for score in res['scores'][:3]:
            four_highest_scores.append(score)
            high_score_dict[label] = score
        if " " in label:
            completed_label = str(label)
            new_label = completed_label.replace(" ","_")
            labels.append(new_label)
        else:
            labels.append(label)
    
    applied_filter = {"operator": "OR", "conditions": [{"field": "meta.category", "operator": "==", "value": labels[0]},{"field": "meta.category", "operator": "==", "value": labels[1]},{"field": "meta.category", "operator": "==", "value": labels[2]}, {"field": "meta.category", "operator": "==", "value": "rule_sections"}]}
    print(applied_filter)
    return applied_filter

Device set to use cpu


In [50]:
query = "what races can i be?"
result = hb_nollm_pipeline.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query,"filters": apply_filters(query), "top_k": 25}, "embedding_retriever": {"filters":apply_filters(query),"top_k": 25}, "ranker": {"query": query}}
)
for doc in result['ranker']['documents']:
    print("Content:", doc.content)
    print("Metadata:", doc.meta['category'])
    print('final document score:', doc.score)
    print("----")

Batches: 100%|██████████| 1/1 [00:00<00:00, 22.00it/s]


Content: Favored Enemy (2 types). Beginning at 1st level  you have significant experience studying  tracking  hunting  and even talking to a certain type of enemy. Choose a type of favored enemy: aberrations  beasts  celestials  constructs  dragons  elementals  fey  fiends  giants  monstrosities  oozes  plants  or undead. Alternatively  you can select two races of humanoid such as gnolls and orcs as favored enemies. You have advantage on Wisdom Survival checks to track your favored enemies  as well as on Intelligence checks to recall information about them. When you gain this feature  you also learn one language of your choice that is spoken by your favored enemies  if they speak one at all. You choose one additional favored enemy  as well as an associated language  at 6th and 14th level. As you gain levels  your choices should reflect the types of monsters you have encountered on your adventures.
Metadata: features
final document score: 0.04323430359363556
----
Content: Favored Enemy 

In [None]:
query = "what races can i play as?"

result = hybrid_retrieval.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query,"top_k": 20},"embedding_retriever":{"top_k": 20}, "ranker": {"query": query}, "prompt_builder":{"question": query}}
)
# hybrid_retrieval.draw("hybrid-retrieval.png")
print(result["llm"]["replies"][0])

## Evaluation ## 

As the API used for this RAG QA pipeline is from 2014 and there have been some changes including new releases in the game and altercations, we worked with the API and therefore need to use queries that were real and similar in content. 
Some were from https://www.dndbeyond.com/?msockid=201f823801c3644c068896b30048651a

The queries were all annotated by hand and original queries saved in the 'original_questions.jsonl'-file. Due to the queries containing a lot of chatter and misformulted questions, they were shortened and reformulated (if needed) in the 'queries.jsonl'-file. 

We only included queries that:
- were understandable to us,
- formulated as questions and did not contain too much additional context, 
- not concerning homwbrew content,
- addressing content that is contained in the used API,
- asked about a subject (we excluded every forum post that contained polls and the gathering of ideas on character building or similar)

In total we annotated 50 queries, sometimes the question was taken from the forums title and sometimes from the text, as some users wrote their questions only into the forums title. 
If looking at the "original-questions"-file, it is noticable that some number in the index, have been skipped. This is the case when the current question has been divided into multiple questions.

The zero-shot classifier, that filtered the categories that should be looked in for each query, also has to be evaluated.
In order to evaluate the retrieval, the groundtruth needs to be established. The task here is not only to collect the relevant documents but also the relevant categories for each query that is supposed to be tested.

For the establishment of the groundtruth for our documents we largely oriented ourselves on the proceedings in the blog of Phong Cau and Geisa Faustino in
"Efficient Ground Truth Generation for Search Evaluation"(30.05.2025. Online at:https://devblogs.microsoft.com/ise/efficient-ground-truth-generation-search-evaluation/. last accessed: 10.09.2025).
In order to efficiently establish a groundtruth, they passed their queries through their hybrid search approach of text- and vector-based retrievers to retrieve the top 100 results for each queries.
The results were joined and before getting manually labeled, the queries were passed to a LLM to again re-evaluate relevancy.

As we also used a hybrid pipeline, we will integrate the same steps, exept for the integration of the LLM due to limited calls, and evaluate relevance of our documents to the queries.
To access the retrieval results from our hybrid pipeline, we will use a separate pipeline that does not include an LLM. Just as Cao and Faustino, we will use the top 100 results from both retrievers, merge the results with
out DocumentJoiner and manually evaluate relevancy through user input.
What is also supposed to happen is that next to the evaluation of the documents each query, gets assigned 4 categories , that serve as the groundtruth for the evaluation of our used zero-shot classifer. 

In [17]:
# Creates an empty dictionary to save the queries names and the texts as key, aslo included is a list that contains the relevant categories:
query_dict = {}
# Loading all queries into a dictionary:
with open('groundtruth/queries.jsonl', 'r') as json_files:
    for line in json_files:  # reading each line the file separately and
        i_d = json.loads(line)  # loading each line as a json object to acquire the correct encoding
        query_dict[i_d['_id']] = i_d['text'] # each line is then added to the dictionary

json_files.close()

print(query_dict)

{'id01': 'Can i further my range of the teleportation spell with familiar or clairvoyance?', 'id02': 'Assuming all spell slots are used, how many zombies or skeletons could a necromancer raise with the spell Animate Dead?', 'id03': 'Does the character with the highest Initiative go first?', 'id04': 'I have a doubt with the loading property, can I use 2 attacks in a round if I have 3 actions?', 'id05': 'Is it possible to twin a spell (like invisibility) from the ring of storing?', 'id06': 'When I use Divine Smite as a Paladin I expend spell slots, so is Divine Smite then considered a spell?', 'id07': 'When a character has regenerate and his health drops to 0, does the generation stop?', 'id08': 'How do you handle darkness in melee combat?', 'id09': 'Can i, as the DM, use the move Petrifying Bite of the Basilisk on every turn?', 'id10': 'Can Sight Rot disease be cured with Lesser Restoration?', 'id11': 'Does proficiency matter when choosing weapons for two weapon fighting?', 'id12': 'Do 

In [28]:
# As some of the docs might be partially relevant but not highly relevant there will be a ranked relevance:
from IPython.display import clear_output
groundtruth_dict = {}

groundtruth = 'groundtruth/groundtruth_docs.json'

def annotate_relevance(dictionary):
    if os.path.getsize(groundtruth) > 0:
        try:
            with open(groundtruth, "r") as f:
                groundtruth_dict = json.load(f)
        except json.JSONDecodeError:
            print("Groundtruth file is empty or broken, using empty dictionary")
            groundtruth_dict = {}
    else:
        groundtruth_dict = {}

    for _id, text in dictionary.items():
        # Skip already annotated queries
        if _id in groundtruth_dict:
            print(f"Skipping {_id}, already annotated.")
            continue
        else:
            groundtruth_dict[_id] = {}

            relevant_docs = {}
            print (candidate_labels, flush= True)
            print('Question: ' , _id," ", text, flush=True)
            cats = input("which 4 ones of the stated candidate labels apply to the query the best? : ")
            cats_cats = cats.split(',')
            cattys = [string.strip() for string in cats_cats]
            result = hb_noranker_pipeline.run(
            {"text_embedder": {"text": text}, "bm25_retriever": {"query": text, "top_k": 100}, "embedding_retriever": {"top_k": 100}})

            for index, doc in enumerate(result['document_joiner']['documents']):
                print("----", index , "----")
                print("Content:", doc.content, flush=True)
                print("Metadata:", doc.meta['category'], flush=True)
                print('final document score:', doc.score)
                print("----")
                relevance = input('What score would you give this document (0 = irrelevant, 1 = partially relevant, 2 = highly relevant): ')
                doc_id = doc.meta['document_id']
                if doc_id not in relevant_docs:
                    relevant_docs[doc_id] = {'relevance_score': int(relevance)}
                elif doc_id in relevant_docs and relevant_docs[doc_id]['relevance_score'] < int(relevance):
                    relevant_docs[doc_id] = {'relevance_score': int(relevance)}
            
                print("Document with id:", doc_id ," was added to the groundtruth with the following score: ", int(relevance))

                if index == 100:
                    clear_output(wait=True)
                    print('Question: ', text, " ", _id)

            groundtruth_dict[_id]['documents'] = relevant_docs
            groundtruth_dict[_id]['categories'] = cattys
            with open(groundtruth, 'w') as fr:
                json.dump(groundtruth_dict, indent=4, ensure_ascii=False, fp=fr)

            print('------ Next query -----')
            clear_output(wait=False)


annotate_relevance(query_dict)

In [None]:
groundtruth = 'groundtruth/groundtruth_docs.json'
grounds = {}

with open(groundtruth,'r') as f:
    grounds = json.load(f)

# Filtering all relevant queries and their docs for direct comparison:
    bin_truths = {}
    graded_truths = {}
    results_no_filter_ranker_dict = {}
    results_filter_dict = {}
    results_ranker_dict = {}
    results_filter_ranker_dict = {}

for query_id, query in query_dict.items():

    filters = apply_filters(query)
    # Looking up all queries with no filter and no ranker:
    results_no_filter_ranker = hb_noranker_pipeline.run({"text_embedder": {"text": query}, "bm25_retriever": {"query": query, "top_k": 100}, "embedding_retriever": {"top_k": 100}})
    results_no_filter_ranker_dict[query_id] = [doc.meta['document_id'] for doc in results_no_filter_ranker['document_joiner']['documents']]

    # Looking up all queries with filters and no ranker:
    results_filter = hb_noranker_pipeline.run({"text_embedder": {"text": query}, "bm25_retriever": {"query": query, "filters": filters, "top_k": 100}, "embedding_retriever": {"filters": filters,"top_k": 100}})
    results_filter_dict[query_id] = [doc.meta['document_id'] for doc in results_filter['document_joiner']['documents']]

    # Looking up all queries with no filters and the ranker:
    results_ranker = hb_nollm_pipeline.run({"text_embedder": {"text": query}, "bm25_retriever": {"query": query, "top_k": 20}, "embedding_retriever": {"top_k": 20}, "ranker": {"query": query}})
    results_ranker_dict[query_id] = [doc.meta['document_id'] for doc in results_ranker['ranker']['documents']]

    # Looking up all queries with filters and the ranker:
    results_filter_ranker = hb_nollm_pipeline.run({"text_embedder": {"text":query}, "bm25_retriever": {"query": query,"filters": filters, "top_k": 20}, "embedding_retriever": {"filters": filters,"top_k": 20}, "ranker": {"query": query}})
    results_filter_ranker_dict[query_id] = [doc.meta['document_id'] for doc in results_filter_ranker['ranker']['documents']]

    bin_truths[query_id] = [doc_id for doc_id, data in grounds[query_id]['documents'].items() if data['relevance_score'] > 0]
    graded_truths[query_id] = [doc_id for doc_id, data in grounds[query_id]['documents'].items() if data['relevance_score']]

{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'skills'}, {'field': 'meta.category', 'operator': '==', 'value': 'proficiencies'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 17.72it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.73it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.03it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.40it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'monsters'}, {'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 15.24it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 22.73it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 22.44it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.35it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'proficiencies'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 20.50it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 29.54it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 15.29it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 31.03it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'weapon_properties'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 12.23it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.10it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 23.72it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.67it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'magic_items'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 17.09it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 18.69it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 30.66it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.15it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 19.78it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 18.52it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.59it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.37it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 21.61it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 19.98it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 23.66it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 40.31it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'skills'}, {'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 20.16it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 40.25it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.38it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 39.89it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'monsters'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 17.12it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.32it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 21.06it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.34it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'damage_types'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 23.01it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 31.46it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 30.77it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 61.25it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'proficiencies'}, {'field': 'meta.category', 'operator': '==', 'value': 'skills'}, {'field': 'meta.category', 'operator': '==', 'value': 'weapon_properties'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 19.14it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.48it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.34it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 22.48it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'equipment'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 21.16it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.40it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 17.66it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 39.81it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'races'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 23.52it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 27.00it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 22.57it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 13.32it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'monsters'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 16.83it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 29.51it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 28.97it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 40.41it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'monsters'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 27.52it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.71it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.65it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 23.96it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 24.92it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.06it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.11it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 31.39it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 19.54it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 18.99it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 19.16it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 30.16it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 31.04it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.27it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 40.59it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.44it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'traits'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 23.63it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 27.01it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 33.10it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 31.23it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'subraces'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'ability_scores'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 36.12it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 31.43it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 21.13it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 40.44it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 13.57it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 13.74it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 15.48it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 17.50it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'weapon_properties'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 18.28it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 18.21it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.56it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.29it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 22.32it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 21.51it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 30.74it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 36.14it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'skills'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 19.34it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.15it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.56it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.50it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'traits'}, {'field': 'meta.category', 'operator': '==', 'value': 'skills'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 18.47it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.58it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.13it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 17.11it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 20.04it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.19it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 23.92it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 39.03it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'monsters'}, {'field': 'meta.category', 'operator': '==', 'value': 'damage_types'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 21.02it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 29.81it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 30.71it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.12it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'equipment'}, {'field': 'meta.category', 'operator': '==', 'value': 'magic_items'}, {'field': 'meta.category', 'operator': '==', 'value': 'weapon_properties'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 21.61it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 19.85it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 22.95it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 41.25it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 17.35it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.29it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 17.66it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 22.88it/s]


{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 23.84it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.21it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 34.14it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.37it/s]


In [None]:
# Evaluating pipelines:
top_5 = 5
top_10 = 10
top_50 = 50

def precision_at_k(retrieved, truths, query_dict, k):
    # Every precision is saved in this list:
    prec_per_query = []
    # Walking through the query dictionary:
    for query_id in query_dict.keys():
        # The top k doc from the retrieved docs are accessed: 
        retrieved_docs = retrieved[query_id][:k]
        # The groundtruth (here we are using the groundtruth containing the cos that are highly and slightly relevant) is converted to a set,
        # just In case it contains duplicates.
        relevant_docs = set(truths[query_id])
        # The counter sums up 1 for every document that appears in the top k retrieved docs and in the groundtruth:
        num_rel = sum(1 for doc in retrieved_docs if doc in relevant_docs)
        # Then the number of relevant docs is divided by k to calculate precision@k for this query:
        prec_per_query.append(num_rel / k)
        # The results per query is averaged and an average precision@k is returned:
    return float(np.mean(prec_per_query))


def recall_at_k(retrieved, truth, query_dict, k):
    # The recall per query is saved here:
    recall_per_q = []
    # Again the query dictionary keys are accessed and walked through:
    for query_id in query_dict.keys():
        # The top k doc from the retrieved docs are accessed: 
        retrieved_docs = retrieved[query_id][:k]
        # The groundtruth (here we are using the groundtruth containing the cos that are highly and slightly relevant) is converted to a set,
        # just In case it contains duplicates.
        relevant_docs = set(truth[query_id])
        # The counter sums up 1 for every document that appears in the top k retrieved docs and in the groundtruth:
        num_rel = sum(1 for doc in retrieved_docs if doc in relevant_docs)
        # From every set of grountruth documents how many are returned @k;
        recall_per_q.append(num_rel / len(relevant_docs))
    # The values in the list are averaged and returned as an average recall@k:
    return float(np.mean(recall_per_q)) if recall_per_q else 0.0


def mrr(retrieved, query_dict,truth):
    # Here every MRR is saved in this list:
    rr_list = []
    # Again the query dictionary keys are accessed and walked through:
    for query_id in query_dict.keys():
        # The doc at the currect query_id from the retrieved docs are accessed: 
        retrieved_docs =  retrieved[query_id]
        # The groundtruth (here we are using the groundtruth containing the cos that are highly and slightly relevant) is converted to a set,
        # just In case it contains duplicates.
        relevant_docs = set(truth[query_id])
        # rr is set to 0:
        rr = 0.0
        # The docs are walked through assigning ranks starting from one:
        for i, doc in enumerate(retrieved_docs, start=1):
            # If the currenct doc is found in the groundtruth, the MRR is calculated:
            if doc in relevant_docs:
                # MRR at rank i:
                rr = 1.0 / i
                # It's done for this query so it will break and start with the next query:
                print(i)
                break
        # The MRR is added to the list:
        rr_list.append(rr)
    # The values are averaged and returned.
    return float(np.mean(rr_list))


def ndcg_at_k(retrieved, truth, k):
    # returns a value refrlecting the quality of the ordering through the ranker: The higher the value the better.
    all_ndcg = []
    
    for query_index, docs in retrieved.items():
        # Extract the relevance scores for this query
        rel_map = {doc_id: data['relevance_score'] 
                   for doc_id, data in truth[query_index]['documents'].items()}
        
        # Compute DCG
        dcg = 0.0
        for i, doc in enumerate(docs[:k], start=1):
            rel = rel_map.get(doc, 0.0)
            dcg += (2**rel - 1) / math.log2(i + 1)
        
        # Compute IDCG
        ideal_rels = sorted(rel_map.values(), reverse=True)[:k]
        idcg = sum((2**rel - 1) / math.log2(idx + 2) for idx, rel in enumerate(ideal_rels))
        
        ndcg = dcg / idcg if idcg > 0 else 0.0
        all_ndcg.append(ndcg)
    return float(np.mean(all_ndcg))


# Calculating the MAP in order to evaluate whether the ranking improved versus working with no ranker:
def mean_average_precision(retrieved, truth):
    ap_list = []
    # The loop walks through the retrieved docs for each query:
    for query_index, retrieved_docs in retrieved.items():
        # Counter that adds up, when relevant docs are discovered:
        amount_relevant = 0
        # Collects our precision values:
        precisions = []
        # Then every retrieved doc is accessed for each query and given a rank based on the appearance in the retrieved documents as they were saved in the retrieved order.
        for rank, doc in enumerate(retrieved_docs, start=1):
            # If the current document appears in the groundtruth for this query, the amount_relevant variable is counted up.
            if doc in truth[query_index]:
                amount_relevant += 1
                # The precision at this current rank is calculated and added to the precisions list:
                precisions.append(amount_relevant / rank)
        # For every then the precisions are summed up and divided through the length/amount of relevant documents.
        ap_for_query = sum(precisions) / len(truth[query_index])
        # This value is saved for every query.
        ap_list.append(ap_for_query)
        # Returned is the mean average precision, that expresses the quality of the ranking.
    return sum(ap_list) / len(ap_list)


In [20]:
print("Precision@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_no_filter_ranker_dict, bin_truths, query_dict, top_5)," at top_k = 5")
print("Precision@k of the retrieval pipeline with filters and no ranker:", precision_at_k(results_filter_dict, bin_truths, query_dict,top_5)," at top_k = 5")
print("Precision@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_ranker_dict, bin_truths, query_dict,top_5)," at top_k = 5")
print("Precision@k of the retrieval pipeline with filters and ranker:",precision_at_k(results_filter_ranker_dict, bin_truths, query_dict,top_5)," at top_k = 5")
print("----------------")
print("Precision@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_no_filter_ranker_dict, bin_truths, query_dict, top_10)," at top_k = 10")
print("Precision@k of the retrieval pipeline with filters and no ranker:", precision_at_k(results_filter_dict, bin_truths, query_dict,top_10)," at top_k = 10")
print("Precision@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_ranker_dict, bin_truths, query_dict,top_10)," at top_k = 10")
print("Precision@k of the retrieval pipeline with filters and ranker:",precision_at_k(results_filter_ranker_dict, bin_truths, query_dict,top_10)," at top_k = 10")
print("----------------")
print("Precision@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_no_filter_ranker_dict, bin_truths, query_dict, top_50)," at top_k = 50")
print("Precision@k of the retrieval pipeline with filters and no ranker:", precision_at_k(results_filter_dict, bin_truths, query_dict,top_50)," at top_k = 50")
print("Precision@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_ranker_dict, bin_truths, query_dict,top_50)," at top_k = 50")
print("Precision@k of the retrieval pipeline with filters and ranker:",precision_at_k(results_filter_ranker_dict, bin_truths, query_dict,top_50)," at top_k = 50")

Precision@k of the retrieval pipeline without filters and ranker: 0.33333333333333326  at top_k = 5
Precision@k of the retrieval pipeline with filters and no ranker: 0.27999999999999997  at top_k = 5
Precision@k of the retrieval pipeline without filters and ranker: 0.31999999999999995  at top_k = 5
Precision@k of the retrieval pipeline with filters and ranker: 0.2066666666666667  at top_k = 5
----------------
Precision@k of the retrieval pipeline without filters and ranker: 0.23666666666666666  at top_k = 10
Precision@k of the retrieval pipeline with filters and no ranker: 0.17666666666666667  at top_k = 10
Precision@k of the retrieval pipeline without filters and ranker: 0.20999999999999996  at top_k = 10
Precision@k of the retrieval pipeline with filters and ranker: 0.1633333333333333  at top_k = 10
----------------
Precision@k of the retrieval pipeline without filters and ranker: 0.08133333333333333  at top_k = 50
Precision@k of the retrieval pipeline with filters and no ranker: 0.0

In [22]:
print("Recall@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_no_filter_ranker_dict, bin_truths, query_dict, top_5)," at top_k = 5")
print("Recall@k of the retrieval pipeline with filters and no ranker:", precision_at_k(results_filter_dict, bin_truths, query_dict,top_5)," at top_k = 5")
print("Recall@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_ranker_dict, bin_truths, query_dict,top_5)," at top_k = 5")
print("Recall@k of the retrieval pipeline with filters and ranker:",precision_at_k(results_filter_ranker_dict, bin_truths, query_dict,top_5)," at top_k = 5")
print("----------------")
print("Recall@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_no_filter_ranker_dict, bin_truths, query_dict, top_10)," at top_k = 10")
print("Recall@k of the retrieval pipeline with filters and no ranker:", precision_at_k(results_filter_dict, bin_truths, query_dict,top_10)," at top_k = 10")
print("Recall@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_ranker_dict, bin_truths, query_dict,top_10)," at top_k = 10")
print("Recall@k of the retrieval pipeline with filters and ranker:",precision_at_k(results_filter_ranker_dict, bin_truths, query_dict,top_10)," at top_k = 10")
print("----------------")
print("Recall@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_no_filter_ranker_dict, bin_truths, query_dict, top_50)," at top_k = 50")
print("Recall@k of the retrieval pipeline with filters and no ranker:", precision_at_k(results_filter_dict, bin_truths, query_dict,top_50)," at top_k = 50")
print("Recall@k of the retrieval pipeline without filters and ranker:",precision_at_k(results_ranker_dict, bin_truths, query_dict,top_50)," at top_k = 50")
print("Recall@k of the retrieval pipeline with filters and ranker:",precision_at_k(results_filter_ranker_dict, bin_truths, query_dict,top_50)," at top_k = 50")

Recall@k of the retrieval pipeline without filters and ranker: 0.33333333333333326  at top_k = 5
Recall@k of the retrieval pipeline with filters and no ranker: 0.27999999999999997  at top_k = 5
Recall@k of the retrieval pipeline without filters and ranker: 0.31999999999999995  at top_k = 5
Recall@k of the retrieval pipeline with filters and ranker: 0.2066666666666667  at top_k = 5
----------------
Recall@k of the retrieval pipeline without filters and ranker: 0.23666666666666666  at top_k = 10
Recall@k of the retrieval pipeline with filters and no ranker: 0.17666666666666667  at top_k = 10
Recall@k of the retrieval pipeline without filters and ranker: 0.20999999999999996  at top_k = 10
Recall@k of the retrieval pipeline with filters and ranker: 0.1633333333333333  at top_k = 10
----------------
Recall@k of the retrieval pipeline without filters and ranker: 0.08133333333333333  at top_k = 50
Recall@k of the retrieval pipeline with filters and no ranker: 0.054000000000000006  at top_k = 

In [26]:
# These two pipelines are used to see whether the ranker actually improves the ranking:
print("nDCG@k of the retrieval pipeline without a ranker", ndcg_at_k(results_no_filter_ranker_dict, grounds, top_5), " at top_k = 5")
print("nDCG@k of the retrieval pipeline with a ranker", ndcg_at_k(results_ranker_dict, grounds, top_5), " at top_k = 5")
print("----------------")
print("nDCG@k of the retrieval pipeline without a ranker", ndcg_at_k(results_no_filter_ranker_dict, grounds, top_10), " at top_k = 10")
print("nDCG@k of the retrieval pipeline with a ranker", ndcg_at_k(results_ranker_dict, grounds, top_10), " at top_k = 10")
print("----------------")
print("nDCG@k of the retrieval pipeline without a ranker", ndcg_at_k(results_no_filter_ranker_dict, grounds, top_50), " at top_k = 50")
print("nDCG@k of the retrieval pipeline with a ranker", ndcg_at_k(results_ranker_dict, grounds, top_50), " at top_k = 50")

nDCG@k of the retrieval pipeline without a ranker 0.4909211210773096  at top_k = 5
nDCG@k of the retrieval pipeline with a ranker 0.5053654344895262  at top_k = 5
----------------
nDCG@k of the retrieval pipeline without a ranker 0.5481141491572366  at top_k = 10
nDCG@k of the retrieval pipeline with a ranker 0.5371740066539225  at top_k = 10
----------------
nDCG@k of the retrieval pipeline without a ranker 0.594630316851101  at top_k = 50
nDCG@k of the retrieval pipeline with a ranker 0.5231236455238748  at top_k = 50


In [40]:
# MRR for the following pipelines:
print("MRR of the retrieval pipeline without a ranker and without filters: ", mrr(results_no_filter_ranker_dict, query_dict, bin_truths))
print("----------------")
print("MRR of the retrieval pipeline without a ranker and with filters: ", mrr(results_filter_dict, query_dict, bin_truths))
print("----------------")
print("MRR of the retrieval pipeline with a ranker and without filters: ", mrr(results_ranker_dict, query_dict, bin_truths))
print("----------------")
print("MRR of the pipeline with a ranker and filters: ", mrr(results_filter_ranker_dict, query_dict, bin_truths))

1
3
2
1
1
2
8
1
2
1
1
3
1
20
2
1
16
1
1
106
1
1
1
1
1
3
1
1
1
9
MRR of the retrieval pipeline without a ranker and without filters:  0.7119348357791754
----------------
1
2
1
1
1
1
5
3
1
1
1
2
5
102
1
4
2
4
1
16
1
1
1
1
4
MRR of the retrieval pipeline without a ranker and with filters:  0.5685212418300654
----------------
1
1
2
1
1
1
2
1
1
1
1
9
1
1
1
4
2
7
1
2
2
1
1
1
2
1
1
1
3
MRR of the retrieval pipeline with a ranker and without filters:  0.7612433862433862
----------------
1
1
1
1
1
10
10
1
1
1
1
8
3
3
8
2
1
3
2
1
1
1
MRR of the pipeline with a ranker and filters:  0.515


In [28]:
# Here bin_truths is used, because even partially relevant documents should be inlcuded in the metric.
# MAP regarding the pipeline without a ranker:
print(mean_average_precision(results_no_filter_ranker_dict, bin_truths))
# And with a ranker: 
print(mean_average_precision(results_ranker_dict, bin_truths))

0.43528059163283245
0.3776953737620404


In [None]:
# 2. Evaluate the multilabel classifier that was used to predict the category labels that should be searched in:
# As our model didn't "predict" the labels but was used in a pipeline as instructed on the huggingface page.
#  As the 4 highest scoring predictions were turned into filtering labels, we will treat these labels as the model's "predictions".
#  In order to evaluate our classifier we will use the scores: 

def evaluate_classifier(groundtruth_dict, query_dict):
    results = {}
    true_cats = {}
    labels_dict = {}
    # Walking through the query dictionary:
    for _id, query in query_dict.items():
        # From the original groundtruth, the true categories are saved in the true_cats under the corresponding id:
        true_cats[_id] = groundtruth_dict[_id]['categories']
        # Then the questions, the candidate labels are given to the classifier for every query:
        result = classifier(query, candidate_labels, multi_label = True)
        high_score_dict = {}
        labels_dict[_id] = []
        # The same way when filtering the 3 highest scoring labels are saved in a dict
        for label, score in zip(result['labels'][:3], result['scores'][:3]):
            high_score_dict[label] = score
            if " " in label:
                completed_label = str(label)
                new_label = completed_label.replace(" ","_")
                labels_dict[_id].append(new_label)
            else:
                labels_dict[_id].append(label)

    # In order to claculate the f1-micro and f1-macro we are initializing the MultilabelBinarizer to convert our labels into binary classes:
    mlb = MultiLabelBinarizer(classes=candidate_labels)
    print(mlb)
    # The values in our dicts are converted into matrices:
    y_true = mlb.fit_transform(list(true_cats.values()))
    y_pred = mlb.transform(list(labels_dict.values()))
    # From this the scores can be calculated:
    f1_micro = f1_score(y_true, y_pred, average='micro')
    f1_macro = f1_score(y_true, y_pred, average='macro')
    h_loss = hamming_loss(y_true, y_pred)

    print("Classifier evaluation (top-3 predictions):")
    print(f"F1-micro: {f1_micro:.2f}")
    print(f"F1-macro: {f1_macro:.2f}")

    return results

evaluate_classifier(grounds, query_dict)

# Classifier evaluation (top-3 predictions):
# F1-micro: 0.40
# F1-macro: 0.33

MultiLabelBinarizer(classes=['rules', 'rule_sections', 'races', 'subraces',
                             'classes', 'subclasses', 'skills', 'feats',
                             'languages', 'ability_scores', 'traits',
                             'proficiencies', 'features',
                             'example_character_background', 'conditions',
                             'equipment', 'equipment_categories',
                             'weapon_properties', 'magic_items',
                             'magic_schools', 'damage_types', 'spells',
                             'monsters'])
Classifier evaluation (top-3 predictions):
F1-micro: 0.40
F1-macro: 0.33


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{}

In [None]:
# Evaluation of LLM answers: 
import re
from sentence_transformers import SentenceTransformer, util
groundtruth_a = 'groundtruth/groundtruth_answers.jsonl'
llm_path = 'llm_answers/answers.jsonl'


groundtruth_ans = {}

with open(groundtruth_a, 'r') as f:
    for line in f:
        if line.strip():
            # As the answers file is a JSONL, every line needs to be accessed manually,
            # and added manually
            item = json.loads(line)
            query_id = item['_id']  
            groundtruth_ans[query_id] = item

# Load a small semantic similarity model
sim_model = SentenceTransformer('all-MiniLM-L6-v2')

def evaluate_answers(query_dict, groundtruth_answers):
    results = {}
    llm_answers_dict = {}
    try:
        if os.path.getsize(llm_path) > 0:
            with open(llm_path, "r") as f:
                llm_answers_dict = json.load(f)
    except json.JSONDecodeError:
        print("LLM-answers file is empty or broken!")

    # Going through every item in the query-dictionary:
    for query_id, query in query_dict.items():
        # Skip already annotated queries
        if query_id in llm_answers_dict:
            print(f"Skipping {query_id}, already saved.")
            continue
        else:
            try:
                # Attempting to run the LLM pipeline, checking if all the calls have been used up:
                # Every query is sent to the hybrid pipeline with apllied filters:
                filters = apply_filters(query)
                result_pipeline = hybrid_retrieval.run({"text_embedder": {"text": query}, "bm25_retriever": {"filters":filters,"query": query,"top_k": 20},"embedding_retriever":{"filters": filters,"top_k": 20}, "ranker": {"query": query}, "prompt_builder":{"question": query}})
                print(result_pipeline)
                # The result is saved here:
                pipeline_answers_raw = result_pipeline['llm']['replies'][0].text
                # The groundtruth answers for the corresponding query_id is saved here:
                groundtruth_answer_raw = groundtruth_answers[query_id]['text']

                # Lowercasing both strings:
                pipeline_answers = pipeline_answers_raw.lower()
                groundtruth_answer = groundtruth_answer_raw.lower()

                # Remove symbols like **:
                pipeline_answers = re.sub(r"(\*)", "", pipeline_answers_raw)
                # Replace multiple newlines with a single space
                pipeline_answers = re.sub(r"(\\n)", "", pipeline_answers)
                pipeline_answers = re.sub(r'(\\")', "", pipeline_answers)
                groundtruth_answer = re.sub(r'(\\")', "", pipeline_answers)
                # Stripping unecessary whitespaces:
                pipeline_answers = pipeline_answers.strip()
                
                # Calculate the semantic similarity between the groundtruth answer and the generated answer:
                gen_emb = sim_model.encode(pipeline_answers, convert_to_tensor=True)
                true_embs = sim_model.encode(groundtruth_answer, convert_to_tensor=True)
                cosine_scores = util.cos_sim(gen_emb, true_embs)

                # Returns the similarity score:
                similarity_score = cosine_scores.item()
                
                # The results from the computation is stored in a dict containing a reuslt for every query:
                results[query_id] = {
                    'question': query,
                    'generated_answer': pipeline_answers_raw,
                    'similarity_score': similarity_score,
                    'true_answer': groundtruth_answer_raw
                }
                # For every query_id, the answer from the LLM is saved in a dictionary:
                llm_answers_dict[query_id] = results[query_id]
                # Due to limited calls, every answer will immediatly be saved to the file below, in order to keep the results:
                with open(llm_path, 'w') as fr:
                    json.dump(llm_answers_dict, indent=4, ensure_ascii=False, fp=fr)
            
                # Compute average semantic similarity:
                avg_similarity = sum(r['similarity_score'] for r in results.values()) / len(results)
                print("Average semantic similarity:", avg_similarity)
            except Exception as e:
                # Print the error, but continue to the next query
                print(f"Error processing query {query_id}: {e}")
                break
    return results

evaluate_answers(query_dict, groundtruth_ans)

Skipping id01, already saved.
Skipping id02, already saved.
Skipping id03, already saved.
Skipping id04, already saved.
Skipping id05, already saved.
Skipping id06, already saved.
Skipping id07, already saved.
Skipping id08, already saved.
Skipping id09, already saved.
Skipping id10, already saved.
Skipping id11, already saved.
Skipping id12, already saved.
Skipping id13, already saved.
Skipping id14, already saved.
Skipping id15, already saved.
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 23.98it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='The text states, "Illusion spells deceive the senses or minds of others. They cause people to see things that are not there, to miss things that are there, to hear phantom noises, or to remember things that never happened. Some illusions create phantom images that any creature can see, but the most insidious illusions plant an image directly in the mind of a creature." This implies illusions do not inherently block or reflect light, but rather alter perceptions. Therefore, they don\'t physically interact with light.\n')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 7292, 'completion_tokens': 102, 'total_tokens': 7394}})]}}
Average semantic similarity: 1.0000001192092896
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'

Batches: 100%|██████████| 1/1 [00:00<00:00, 17.45it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='Based on the provided texts, there isn\'t a direct equivalent to the "spellcraft" ability that allows you to identify a spell as it\'s being cast. However, here are some ways to gain information about spells being cast that might approximate that functionality:\n\n1.  **Identify (on the caster):** The *Identify* spell can be used on a creature to learn what spells, if any, are currently affecting it. However, this would only work *after* the spell has been cast and is actively affecting the target. It wouldn\'t reveal the spell as it\'s being cast.\n\n2.  **Detect Thoughts:** If a spellcaster is casting a spell, that might be the surface thought of the caster, which you could potentially read using *Detect Thoughts.* This is very circumstantial, requiring you to be within 30 feet, be able to see the caster, and for the spellcasting to be at the forefront of their mind. It also wouldn\'

Batches: 100%|██████████| 1/1 [00:00<00:00, 29.47it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='When you make a high jump, you leap into the air a number of feet equal to 3 + your Strength modifier if you move at least 10 feet on foot immediately before the jump. When you make a standing high jump, you can jump only half that distance. Either way, each foot you clear on the jump costs a foot of movement.\n\nIn some circumstances, your GM might allow you to make a Strength Athletics check to jump higher than you normally can.\n\nYou can extend your arms half your height above yourself during the jump. Thus, you can reach above you a distance equal to the height of the jump plus 1.5 times your height.\n')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 9969, 'completion_tokens': 137, 'total_tokens': 10106}})]}}
Average semantic similarity: 1.0000000794728596
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator'

Batches: 100%|██████████| 1/1 [00:00<00:00, 20.26it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='The provided text does not contain information about adamantine mail shirts, but it does specify the AC for Adamantine objects. The source also states that you must be proficient with the armor you are wearing in order to cast a spell.')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 13576, 'completion_tokens': 46, 'total_tokens': 13622}})]}}
Average semantic similarity: 1.0000000596046448
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'subraces'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'ability_scores'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 19.76it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='This document does not contain any rules for directly countering a counter.\n')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16311, 'completion_tokens': 15, 'total_tokens': 16326}})]}}
Average semantic similarity: 1.0000000476837159
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 10.11it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text="No, you cannot cast spells with your action as normal on rounds 2 and 3 if you are readying a spell like Magic Missile.\n\nHere's why:\n\n*   **Readying a Spell:** Readying a spell means you are holding the spell's energy, waiting for a specific trigger. You are using your action to prepare the spell for release when the trigger occurs.\n*   **Action Economy:** You only have one action per turn. If you use your action on round 1 to ready the *Magic Missile* spell, then you've already spent your action for that turn. You can't then use that action to cast another spell. You continue to hold the readied spell, doing nothing with your action, until either the trigger occurs or you decide to not use the readied action.\n*   **Concentration (Not Applicable Here):** The concentration rules don't come into play here because *Magic Missile* doesn't require concentration. However, if you were r

Batches: 100%|██████████| 1/1 [00:00<00:00, 20.09it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='The provided text does not contain any information about Faraday cages or how armor might interact with electricity beyond basic AC bonuses. Therefore, I cannot answer whether chain or plate armor would function as a Faraday cage in the context of D&D.\n')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 5653, 'completion_tokens': 48, 'total_tokens': 5701}})]}}
Average semantic similarity: 1.000000034059797
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 16.57it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text="The text doesn't directly address whether the *bless* spell remains active on a polymorphed character. However, we can infer based on the description of *polymorph* and the general rules of spellcasting.\n\nThe *polymorph* spell transforms a creature into a new form. The text provided describes spells as discrete magical effects. Since *bless* is a spell effect on a creature, not an inherent property of the creature, *polymorph* would likely end the *bless* effect.\n\nIt is important to note that this is an interpretation based on general rules and spell descriptions. A specific DM could rule differently.\n")], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 13016, 'completion_tokens': 129, 'total_tokens': 13145}})]}}
Average semantic similarity: 1.0000000447034836
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator

Batches: 100%|██████████| 1/1 [00:00<00:00, 24.01it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='The text does not explicitly state whether Sneak Attack can be used with ranged spells like Eldritch Blast. However, here\'s what we can infer:\n\n*   **Sneak Attack:** The rogue\'s Sneak Attack feature is mentioned in the context of critical hits and damage dice, but its specific requirements aren\'t detailed in this document.\n*   **Ranged Attacks:** The "Making an Attack" section describes ranged attacks using weapons like bows and crossbows. It doesn\'t explicitly include spells but states "Many spells also involve making a ranged attack".\n*   **Spell Attacks:** The "Actions in Combat" section mentions spellcasters using spells to great effect in combat and that casting a spell is not necessarily an action.\n*   **Ability Modifier:** Some spells require an attack roll, and the ability modifier used for a spell attack depends on the spellcasting ability of the spellcaster.\n\nTo de

Batches: 100%|██████████| 1/1 [00:00<00:00, 19.70it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='Based on the provided text snippets, there is no explicit rule that states you can learn traits, effects, moves & general features from other classes without multiclassing.\n')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 3748, 'completion_tokens': 34, 'total_tokens': 3782}})]}}
Average semantic similarity: 1.0000000476837159
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'example_character_background'}, {'field': 'meta.category', 'operator': '==', 'value': 'features'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 15.55it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='I am sorry, but I cannot answer that question with the information I have.')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 8466, 'completion_tokens': 16, 'total_tokens': 8482}})]}}
Average semantic similarity: 1.0000000541860408
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'monsters'}, {'field': 'meta.category', 'operator': '==', 'value': 'damage_types'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 18.59it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='Yes, undead creatures can be affected by psychic damage unless they have a specific immunity or resistance to it.\n')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 5148, 'completion_tokens': 22, 'total_tokens': 5170}})]}}
Average semantic similarity: 1.0000000496705372
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'equipment'}, {'field': 'meta.category', 'operator': '==', 'value': 'magic_items'}, {'field': 'meta.category', 'operator': '==', 'value': 'weapon_properties'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 22.82it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='Yes, the description of the Staff of Healing says "This staff has 10 charges."\n')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 3933, 'completion_tokens': 20, 'total_tokens': 3953}})]}}
Average semantic similarity: 1.0000000641896174
{'operator': 'OR', 'conditions': [{'field': 'meta.category', 'operator': '==', 'value': 'spells'}, {'field': 'meta.category', 'operator': '==', 'value': 'rules'}, {'field': 'meta.category', 'operator': '==', 'value': 'conditions'}, {'field': 'meta.category', 'operator': '==', 'value': 'rule_sections'}]}


Batches: 100%|██████████| 1/1 [00:00<00:00, 17.56it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text="The *Shatter* spell states that a creature made of inorganic material has disadvantage on the saving throw. Magic resistance would not counteract this. Magic resistance typically grants advantage on saving throws against spells and other magical effects, and sometimes advantage on all saving throws, but it doesn't remove disadvantage imposed by other conditions or spell effects. Disadvantage and advantage cancel each other out. If the golem has some form of magic resistance that grants advantage on Constitution saving throws against spells, and *Shatter* is a spell, then the golem would roll normally, as the advantage and disadvantage would cancel each other out. If the golem's magic resistance does not grant advantage on saving throws, then it would have disadvantage on its saving throw against *Shatter*.\n")], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', '

Batches: 100%|██████████| 1/1 [00:00<00:00, 26.27it/s]


{'llm': {'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='No, the rules text for "What Is a Spell?" states "The ritual version of a spell takes 10 minutes longer to cast than normal. It also doesn\'t expend a spell slot, which means the ritual version of a spell can\'t be cast at a higher level." The text does not mention anything about the material components being bypassed.\n\nThe "Casting a Spell" section says "A character can use a component pouch or a spellcasting focus found in "Equipment" in place of the components specified for a spell. But if a cost is indicated for a component, a character must have that specific component before he or she can cast the spell." This implies that the component cost still applies even if cast as a ritual.\n\nTherefore, the spell component cost still applies if the spell is cast as a ritual.\n')], _name=None, _meta={'model': 'gemini-2.0-flash', 'finish_reason': 'stop', 'usage': {'prompt_tokens': 7898, '

{'id16': {'generated': 'The text states, "Illusion spells deceive the senses or minds of others. They cause people to see things that are not there, to miss things that are there, to hear phantom noises, or to remember things that never happened. Some illusions create phantom images that any creature can see, but the most insidious illusions plant an image directly in the mind of a creature." This implies illusions do not inherently block or reflect light, but rather alter perceptions. Therefore, they don\'t physically interact with light.\n',
  'similarity_score': 1.0000001192092896,
  'question': 'Does an illusion block light or reflect light?',
  'true_answer': "It is only an illusion, it can't actually do anything as it isn't a physical thing. The important part of the spell is. The image can't create sound, light, smell, or any other sensory effect. Physical interaction with the image reveals it to be an illusion, because things can pass through it. So it wouldn't block light, or 

In [None]:
# Visualizations:
output_path = "a-retrieval-augmented-qa-system-for-dungeons-and-dragons/visualisations"

# Make sure the folder exists
os.makedirs(output_path, exist_ok=True)

#hybrid_retrieval.draw(path=output_path)
#hybrid_retrieval.draw(path=output_path)

with open(output_path, "w") as f:
    f.write("Hello")

In [None]:
import matplotlib.pyplot as plt

# Einfach dann über den file path das dokument öffnen und einfügen. Ist hier in dem fall answers.jsonl 
query_ids = list(results.keys())
similarities = [r['similarity_score'] for r in results.values()]
f1_scores = [r.get('f1', 0) for r in results.values()]

plt.figure(figsize=(12,6))
plt.bar(query_ids, similarities, alpha=0.6, label='Semantic similarity')
plt.bar(query_ids, f1_scores, alpha=0.6, label='F1-score')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Score')
plt.title('RAG Evaluation per Query')
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
import seaborn as sns

sns.histplot(similarities, bins=10, kde=True)
plt.title('Distribution of Semantic Similarity Across Queries')
plt.xlabel('Semantic similarity')
plt.show()

In [None]:
from collections import defaultdict

cat_scores = defaultdict(list)
for qid, r in results.items():
    category = groundtruth_dict[qid]['category']
    cat_scores[category].append(r['max_similarity'])

avg_cat_scores = {cat: sum(vals)/len(vals) for cat, vals in cat_scores.items()}

### Revision ###

As no training data was available to us and the amount of data on the API was not sufficient for training, we heavily relied on zero-shot models. Multiple problems could arise from this. The first one is the misalignment of semantic information, for example the word "race" not aligning with the fictional races in the game and therefore them not being semantically retrieved when asking for them. 
Another critic point is the embedding of all metadata key from the data. 

The reason as to why it was implemented this way, was the desire to not have the user having to implement filters to increase results and the ability for the user to ask about every piece of information that is stored in the metadata. This signifies the trade-off we experienced between more noise in the data and the availability of information for retrieval.