# RAG-based Q&A on D&D #

# 1. Pulling the API-data from the website #

The first step is to pull the information from the api-website (link: https://www.dnd5eapi.co/api/2014) and save the entries from the tables into dictionaries, so that they can then be written to json files and become permeated information that is indepentent from the api and its availability.

In [None]:
# All needed modules and installments
%pip install -U datasets huggingface_hub fsspec
%pip -m spacy download en_core_web_sm
%pip install haystack-ai
%pip install google-genai-haystack
%pip install "sentence-transformers>=4.1.0"
%pip install "fsspec==2023.9.2"
%pip install "sentence-transformers>=4.1.0" "huggingface_hub>=0.23.0"
%pip install markdown-it-py mdit_plain pypdf
%pip install transformers[torch,sentencepiece]

In [95]:
# All needed imports
import requests
import pprint
import json
import spacy
from bs4 import BeautifulSoup
import re
import time
import os
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever, InMemoryBM25Retriever
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack import Pipeline
from haystack_integrations.components.generators.google_genai import GoogleGenAIChatGenerator
from haystack.utils import Secret
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.joiners import DocumentJoiner
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.rankers import SentenceTransformersSimilarityRanker 

In [44]:
# As there is a specific rate limit of 10k requests per second, code to respect that rate limit including a buffer were integrated:
MAX_REQUESTS_PER_SECOND = 5000
DELAY = 1 / MAX_REQUESTS_PER_SECOND

In [57]:
# Below is the website link and code to generally acces the api and display the different tables that are supposed to get saved.
# This general code was taken from the api-website to get an understanding of the access.
url = "https://www.dnd5eapi.co/api/2014/classes/druid"

payload = {}
headers = {
  'Accept': 'application/json'
}

response = requests.request("GET", url, headers=headers, data=payload)
answer_whole = response.text
print(answer_whole)

{"index":"druid","name":"Druid","hit_die":8,"proficiency_choices":[{"desc":"Choose two from Arcana, Animal Handling, Insight, Medicine, Nature, Perception, Religion, and Survival","choose":2,"type":"proficiencies","from":{"option_set_type":"options_array","options":[{"option_type":"reference","item":{"index":"skill-arcana","name":"Skill: Arcana","url":"/api/2014/proficiencies/skill-arcana"}},{"option_type":"reference","item":{"index":"skill-animal-handling","name":"Skill: Animal Handling","url":"/api/2014/proficiencies/skill-animal-handling"}},{"option_type":"reference","item":{"index":"skill-insight","name":"Skill: Insight","url":"/api/2014/proficiencies/skill-insight"}},{"option_type":"reference","item":{"index":"skill-medicine","name":"Skill: Medicine","url":"/api/2014/proficiencies/skill-medicine"}},{"option_type":"reference","item":{"index":"skill-nature","name":"Skill: Nature","url":"/api/2014/proficiencies/skill-nature"}},{"option_type":"reference","item":{"index":"skill-percept

In [45]:
# When looking at some of the textual entries, there were multiple entries containing '#', '\n' and multiple whitespaces so they were all removed.
nlp = spacy.load("en_core_web_sm")
text_p = 'How does this work?'

# This method was taken from our exercise class:
def remove_xml_tags(review_text):
    return BeautifulSoup(review_text, "html.parser").text

# This method was also inspired from the one in our class but changed so if fits the context.
def preprocess_text(text):
    # Some of the handled descriptions were lists of strings, so it was checked whether that was the case for each entry string.
    # If they were in a list, the entries were joined to one single string.
    if isinstance(text,list):
        text = ' '.join(text)  
    # Possible html tags were removed
    free_text = remove_xml_tags(text)

    # The unwanted characters were removed - lowering the text and removing stopwords and punctuation was not done, because the llm later needs to restrucutre the given text into an response,
    # and to keep the 'sense' of the description, the stopwords weren't removed.
    # In order to remove these characters, they were filtered by a regex.
    free_text = re.sub(r"[#_*\\(\)\n]", "", text)
    free_text = re.sub(r"\s{2,}", " ", free_text)
    free_text = re.sub(r"[{2,}-]", " ", free_text)
    return free_text.strip()



In [None]:

# These tables were all handled at once, because when looking at them, they had the same basic structure:
list_of_indices = ['conditions','damage-types','magic-schools','weapon-properties']
# Empty dictionaries to later store the information were initialized:
dict_of_conditions = {}
dict_of_damage_types = {}
dict_of_magic_schools = {}
dict_of_weapon_properties = {}

# This method takes the str-input that functions as a identifier for the dict and the indexing word.
def create_dict(type):
    # A dictionary that holds the response data.
    dict_of_response_data = {}

    time.sleep(DELAY)
    
    url = "https://www.dnd5eapi.co/api/2014/"+type
    response = requests.request("GET", url, headers=headers, data=payload)
    resp = response.json()
    # Every entry in the results is walked through and information such as 'name' and 'description' is saved in a variable.
    for entry in resp['results']:
        response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
        resp2 = response2.json()
        # They are joined in a dictionary specific for each entry in the response list.
        response_data = {
            'name': entry['name'].lower(),
            'desc': "".join(preprocess_text(resp2['desc']))
        }
        # For every index in the list an dictionary entry is added to the returned dictionary.
        dict_of_response_data[entry['index']] = response_data
    return dict_of_response_data

# All similar dictionaries are created below:
dict_of_conditions = create_dict(list_of_indices[0])
dict_of_damage_types = create_dict(list_of_indices[1])
dict_of_magic_schools = create_dict(list_of_indices[2])
dict_of_weapon_properties = create_dict(list_of_indices[3])
# An example from above:
pprint.pprint(dict_of_conditions)

{'blinded': {'desc': "A blinded creature can't see and automatically fails any "
                     'ability check that requires sight.   Attack rolls '
                     "against the creature have advantage  and the creature's "
                     'attack rolls have disadvantage.',
             'name': 'blinded'},
 'charmed': {'desc': "A charmed creature can't attack the charmer or target "
                     'the charmer with harmful abilities or magical effects.   '
                     'The charmer has advantage on any ability check to '
                     'interact socially with the creature.',
             'name': 'charmed'},
 'deafened': {'desc': "A deafened creature can't hear and automatically fails "
                      'any ability check that requires hearing.',
              'name': 'deafened'},
 'exhaustion': {'desc': 'Some special abilities and environmental hazards  '
                        'such as starvation and the long term effects of '
                

In [103]:
# This cell handles the possible rules and their sections:
url = "https://www.dnd5eapi.co/api/2014/rule-sections"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_rule_sections = {}
section_data = {}

for entry in resp['results']:
     name = entry['name']

     time.sleep(DELAY)

     response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
     resp2 = response2.json()

     section_data = {
         'name': name,
         'desc': preprocess_text(resp2['desc']),
    }
     
     dict_of_rule_sections[entry['index']] = section_data
pprint.pprint(dict_of_rule_sections)

{'ability-checks': {'desc': 'Ability ChecksAn ability check tests a '
                            "character's or monster's innate talent and "
                            'training in an effort to overcome a challenge. '
                            'The GM calls for an ability check when a '
                            'character or monster attempts an action other '
                            'than an attack that has a chance of failure. When '
                            'the outcome is uncertain  the dice determine the '
                            'results.For every ability check  the GM decides '
                            'which of the six abilities is relevant to the '
                            'task at hand and the difficulty of the task  '
                            'represented by a Difficulty Class.The more '
                            'difficult a task  the higher its DC. The Typical '
                            'Difficulty Classes table shows the most common '
    

In [49]:
# This cell handles the possible skills:
url = "https://www.dnd5eapi.co/api/2014/skills"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_skills = {}
skill_data = {}

for entry in resp['results']:
     name = entry['name']

     time.sleep(DELAY)

     response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
     resp2 = response2.json()

     skill_data = {
         'name': name,
         'desc': preprocess_text(resp2['desc']),
         'ability_score': resp2['ability_score']['name']
    }
     
     dict_of_skills[entry['index']] = skill_data
pprint.pprint(dict_of_skills)
     

{'acrobatics': {'ability_score': 'DEX',
                'desc': 'Your Dexterity Acrobatics check covers your attempt '
                        'to stay on your feet in a tricky situation  such as '
                        "when you're trying to run across a sheet of ice  "
                        'balance on a tightrope  or stay upright on a rocking '
                        "ship's deck. The GM might also call for a Dexterity "
                        'Acrobatics check to see if you can perform acrobatic '
                        'stunts  including dives  rolls  somersaults  and '
                        'flips.',
                'name': 'Acrobatics'},
 'animal-handling': {'ability_score': 'WIS',
                     'desc': 'When there is any question whether you can calm '
                             'down a domesticated animal  keep a mount from '
                             "getting spooked  or intuit an animal's "
                             'intentions  the GM might call for 

In [50]:
# This cell handles the possible feats. The API contain in this and the background table only one entry due to Copyright reasons.
url = "https://www.dnd5eapi.co/api/2014/feats"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_feats = {}
feat_data = {}

for entry in resp['results']:
     name = entry['name']

     time.sleep(DELAY)

     response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
     resp2 = response2.json()

     feat_data = {
         'name': name,
         'desc': preprocess_text(resp2['desc'])
    }
     if resp2.get('prerequisites'):
          feat_data['prerequisites']= [{'ability_score': item['ability_score']['name'], 'minimum_score':item['minimum_score'] }for item in resp2['prerequisites']]
     
     dict_of_feats[entry['index']] = feat_data
pprint.pprint(dict_of_feats)

{'grappler': {'desc': 'You’ve developed the Skills necessary to hold your own '
                      'in close  quarters Grappling. You gain the following '
                      'benefits:   You have advantage on Attack Rolls against '
                      'a creature you are Grappling.   You can use your action '
                      'to try to pin a creature Grappled by you. To do so  '
                      'make another grapple check. If you succeed  you and the '
                      'creature are both Restrained until the grapple ends.',
              'name': 'Grappler',
              'prerequisites': [{'ability_score': 'STR', 'minimum_score': 13}]}}


In [51]:
# This cell handles the ability score table from the API:
url = "https://www.dnd5eapi.co/api/2014/ability-scores"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_ability_scores = {}
ability_score_data = {}

for entry in resp['results']:
     name = entry['name']

     time.sleep(DELAY)
     
     response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
     resp2 = response2.json()

     ability_score_data = {
         'abbreviation': name,
         'name': resp2['full_name'],
         'desc': preprocess_text(resp2['desc'])
    }
     if resp2.get('skills'):
          ability_score_data['skills']= [item['name'] for item in resp2['skills']]
     
     dict_of_ability_scores[entry['index']] = ability_score_data
pprint.pprint(dict_of_ability_scores)

{'cha': {'abbreviation': 'CHA',
         'desc': 'Charisma measures your ability to interact effectively with '
                 'others. It includes such factors as confidence and '
                 'eloquence  and it can represent a charming or commanding '
                 'personality. A Charisma check might arise when you try to '
                 'influence or entertain others  when you try to make an '
                 'impression or tell a convincing lie  or when you are '
                 'navigating a tricky social situation. The Deception  '
                 'Intimidation  Performance  and Persuasion skills reflect '
                 'aptitude in certain kinds of Charisma checks.',
         'name': 'Charisma',
         'skills': ['Deception', 'Intimidation', 'Performance', 'Persuasion']},
 'con': {'abbreviation': 'CON',
         'desc': 'Constitution measures health  stamina  and vital force. '
                 'Constitution checks are uncommon  and no skills apply to '
    

In [56]:
# This cel handles the language table from the API:
url = "https://www.dnd5eapi.co/api/2014/languages"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_languages = {}
language_data = {}

for entry in resp['results']:
     name = entry['name']

     time.sleep(DELAY)
     
     response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
     resp2 = response2.json()

     language_data = {
         'name': name,
         'type': resp2['type'],
         'typical_speakers': resp2['typical_speakers']
    }
     if resp2.get('script'):
          language_data['script']= resp2['script']
     
     dict_of_languages[entry['index']] = language_data
pprint.pprint(dict_of_languages)

{'abyssal': {'name': 'Abyssal',
             'script': 'Infernal',
             'type': 'Exotic',
             'typical_speakers': ['Demons']},
 'celestial': {'name': 'Celestial',
               'script': 'Celestial',
               'type': 'Exotic',
               'typical_speakers': ['Celestials']},
 'common': {'name': 'Common',
            'script': 'Common',
            'type': 'Standard',
            'typical_speakers': ['Humans']},
 'deep-speech': {'name': 'Deep Speech',
                 'type': 'Exotic',
                 'typical_speakers': ['Aboleths', 'Cloakers']},
 'draconic': {'name': 'Draconic',
              'script': 'Draconic',
              'type': 'Exotic',
              'typical_speakers': ['Dragons', 'Dragonborn']},
 'dwarvish': {'name': 'Dwarvish',
              'script': 'Dwarvish',
              'type': 'Standard',
              'typical_speakers': ['Dwarves']},
 'elvish': {'name': 'Elvish',
            'script': 'Elvish',
            'type': 'Standard',
         

In [58]:
# This cell handles the different classes there are in the API:
url = "https://www.dnd5eapi.co/api/2014/classes"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_classes = {}
class_data = {}

for entry in resp['results']:
     name = entry['name']

     time.sleep(DELAY)

     response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
     resp2 = response2.json()

     class_data = {
         'name': name,
         'hit_die': resp2['hit_die']
    }
     # This structure tries to identify whether certain 'variables' exist in the table,
     # they follow the same structure but some don't have values stored in them and these ones are not supposed to be saved, so there is less data.
     if resp2.get('proficiency_choices'):
         class_data['proficiency_choices'] = preprocess_text([item['desc'] for item in resp2['proficiency_choices']])

     if resp2.get('proficiencies'):
         class_data['proficiencies'] = [item['name'] for item in resp2['proficiencies']]

     if resp2.get('saving_throws'):
         class_data['saving_throws'] = [item['name'] for item in resp2['saving_throws']]
     
     if resp2.get('starting_equipment'):
         class_data['starting_equipment'] = [{'name': item['equipment']['name'], 'quantity': item['quantity']} for item in resp2['starting_equipment']]
     
     if resp2.get('starting_equipment_options'):
         class_data['starting_equipment_options'] = preprocess_text([item['desc'] for item in resp2['starting_equipment_options']])
     

     time.sleep(DELAY)
     
     # This table includes annother link for each class the link to this varaible looks like this f.ex. 'https://www.dnd5eapi.co/api/2014/classes/barbarian/levels'
     response3 = requests.request("GET", url+f"/{entry['index']}/levels", headers=headers, data=payload)
     resp3 = response3.json()

     level_changes = []
     # for every entry there are changes to the character and these changes will be saved in the level_changes list and later added to the
     # class structure above under the key "class_levels":
     for lvl_entries in resp3:
          level_dict = {
               'level': lvl_entries['level'],
               'ability_score_bonuses': lvl_entries['ability_score_bonuses'],
               'proficienciy_bonus': lvl_entries['prof_bonus'],
               'features': [item['name'] for item in lvl_entries['features']],
               'class_specific': lvl_entries['class_specific']
          }
          level_changes.append(level_dict)

     if resp2.get('class_levels'):
         class_data['class_levels'] = level_changes
     
     if resp2.get('multi_classing'):
         multi_class_dict = {}
         if resp2['multi_classing'].get('prerequisites'):
            prerequisites = [{'ability': item['ability_score']['name'], 'minimum_score': item['minimum_score']} for item in resp2['multi_classing']['prerequisites']]
            multi_class_dict['prerequisites'] = prerequisites
         if resp2['multi_classing'].get('proficiencies'):
            multi_class_dict['proficienies'] = [item['name'] for item in resp2['multi_classing']['proficiencies']]
         class_data['multi_classing'] = [multi_class_dict]
     
     if resp2.get('subclasses'):
         class_data['subclasses'] = [item['name'] for item in resp2['subclasses']]
     

     # At the end all of indices are saved into the dictionary.
     dict_of_classes[entry['index']] = class_data
pprint.pprint(dict_of_classes)


{'barbarian': {'class_levels': [{'ability_score_bonuses': 0,
                                 'class_specific': {'brutal_critical_dice': 0,
                                                    'rage_count': 2,
                                                    'rage_damage_bonus': 2},
                                 'features': ['Rage', 'Unarmored Defense'],
                                 'level': 1,
                                 'proficienciy_bonus': 2},
                                {'ability_score_bonuses': 0,
                                 'class_specific': {'brutal_critical_dice': 0,
                                                    'rage_count': 2,
                                                    'rage_damage_bonus': 2},
                                 'features': ['Reckless Attack',
                                              'Danger Sense'],
                                 'level': 2,
                                 'proficienciy_bonus': 2},
                

In [59]:
# This cell handles the subclasses the player can be in the API:
url = "https://www.dnd5eapi.co/api/2014/subclasses"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_subclasses = {}
subclass_data = {}

for entry in resp['results']:
     name = entry['name']

     time.sleep(DELAY)

     response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
     resp2 = response2.json()

     subclass_data = {
         'name': name,
         'class': resp2['class']['name'],
         'subclass_flavor': resp2['subclass_flavor'],
         'desc': preprocess_text(resp2['desc'])
    }
     
     time.sleep(DELAY)
     
      # This table includes annother link for each class the link to this varaible looks like this f.ex. 'https://www.dnd5eapi.co/api/2014/classes/barbarian/levels'
     response3 = requests.request("GET", url+f"/{entry['index']}/levels", headers=headers, data=payload)
     resp3 = response3.json()

     sublevel_changes = []

     for lvl_entries in resp3:
          sublevel_dict = {
               'level': lvl_entries['level'],
               'features': [item['name'] for item in lvl_entries['features']]
          }
          sublevel_changes.append(level_dict)

     if resp2.get('subclass_levels'):
         subclass_data['subclass_levels'] = level_changes
     # At the end all of indices are saved into the dictionary.
     dict_of_subclasses[entry['index']] = subclass_data
pprint.pprint(dict_of_subclasses)

{'berserker': {'class': 'Barbarian',
               'desc': 'For some barbarians  rage is a means to an end  that '
                       'end being violence. The Path of the Berserker is a '
                       'path of untrammeled fury  slick with blood. As you '
                       "enter the berserker's rage  you thrill in the chaos of "
                       'battle  heedless of your own health or well being.',
               'name': 'Berserker',
               'subclass_flavor': 'Primal Path',
               'subclass_levels': [{'ability_score_bonuses': 0,
                                    'class_specific': {'arcane_recovery_levels': 1},
                                    'features': ['Spellcasting: Wizard',
                                                 'Arcane Recovery'],
                                    'level': 1,
                                    'proficienciy_bonus': 2},
                                   {'ability_score_bonuses': 0,
                      

In [60]:
# This cell handles the establishment of the trait dictionary. This cell and all the ones below concerning the api follow the same structure in general:
# 1. The information is pulled out.
# 2. The relevant information is saved in a dictionary (entries like 'url' were ignored, because they don't contain relevant information)
# 3. The information for each entry gets saved in a bigger dictionary.

url = "https://www.dnd5eapi.co/api/2014/traits"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_traits = {}
trait_data = {}

for entry in resp['results']:
    name = entry['name']

    time.sleep(DELAY)
    
    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    trait_data = {
         'name': name,
          'desc': "".join(preprocess_text(resp2['desc']))
    }
     # This structure tries to identify whether certain 'variables' exist in the table,
     # they follow the same structure but some don't have values stored in them and these ones are not supposed to be saved, so there is less data.
    if resp2.get('races'):
         trait_data['races'] = [item['name'] for item in resp2['races']]

    if resp2.get('subraces'):
         trait_data['subraces'] = [item['name'] for item in resp2['subraces']]

    if resp2.get('proficiencies'):
         trait_data['proficiencies'] = [item['name'] for item in resp2['proficiencies']]
     
     # At the end all of indices are saved into the dictionary.
    dict_of_traits[entry['index']] = trait_data
pprint.pprint(dict_of_traits)

{'artificers-lore': {'desc': 'Whenever you make an Intelligence History check '
                             'related to magic items  alchemical objects  or '
                             'technological devices  you can add twice your '
                             'proficiency bonus  instead of any proficiency '
                             'bonus you normally apply.',
                     'name': "Artificer's Lore",
                     'subraces': ['Rock Gnome']},
 'brave': {'desc': 'You have advantage on saving throw against being '
                   'frightened.',
           'name': 'Brave',
           'races': ['Halfling']},
 'breath-weapon': {'desc': 'You can use your action to exhale destructive '
                           'energy. Your draconic ancestry determines the '
                           'size  shape  and damage type of the exhalation. '
                           'When you use your breath weapon  each creature in '
                           'the area of the exhala

In [61]:
# This cell handles the establishment of the rule dictionary. 
url = "https://www.dnd5eapi.co/api/2014/rules"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_rules = {}
rule_data = {}

for entry in resp['results']:
    name = entry['name']

    time.sleep(DELAY)
    
    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    rule_data = {
        'name' : name,
        'desc': "".join(preprocess_text(resp2['desc']))
    }
    # In some cases, the variables stored in the tables contain lists of dictionaries. So each dictionary is accessed and the name of the responding subsection is saved in a list.
    if resp2.get('subsections'):
         rule_data['subsection_in_rule_sections'] = [item['name'] for item in resp2['subsections']]

    dict_of_rules[entry['index']] = rule_data
pprint.pprint(dict_of_rules)

{'adventuring': {'desc': 'Adventuring',
                 'name': 'Adventuring',
                 'subsection_in_rule_sections': ['Time',
                                                 'Movement',
                                                 'The Environment',
                                                 'Traps',
                                                 'Diseases',
                                                 'Madness',
                                                 'Resting',
                                                 'Between Adventures']},
 'appendix': {'desc': 'Appendix',
              'name': 'Appendix',
              'subsection_in_rule_sections': ['Fantasy-Historical Pantheons',
                                              'The Planes of Existence']},
 'combat': {'desc': 'Combat',
            'name': 'Combat',
            'subsection_in_rule_sections': ['The Order of Combat',
                                            'Movement and Position',
     

In [62]:
# This cell handles the establishment of the spell dictionary.
url = "https://www.dnd5eapi.co/api/2014/spells"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_spells = {}
spell_data = {}

for entry in resp['results']:
    name = entry['name']

    time.sleep(DELAY)
    
    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    spell_data = {
        'name' : name,
        'desc': "".join(preprocess_text(resp2['desc'])),
        'range': resp2['range'],
        'components':resp2['components'],
        'ritual':resp2['ritual'],
        'duration': resp2['duration'],
        'concentration': resp2['concentration'],
        'casting_time': resp2['casting_time'],
        'level': resp2['level'],
        'school_of_magic': [resp2['school']['name']],
        'classes': [item['name'] for item in resp2['classes']],
    }

    if resp2.get('material'):
         spell_data['material'] = resp2['material']

    if resp2.get('subclasses'):
        spell_data['subclasses'] = [item['name'] for item in resp2['subclasses']]

    if resp2.get('higher_level'):
        spell_data['higher_level'] = resp2['higher_level']

    if resp2.get('damage'):
        slot_level_damage = {}
        if 'damage_type' in resp2['damage']:
            spell_data['damage_type'] = [resp2['damage']['damage_type']['name']]
        if 'damage_at_slot_level' in resp2['damage']:
            for slot_level, damage in resp2['damage']['damage_at_slot_level'].items():
                slot_level_damage[slot_level] = damage
            spell_data['damage_at_slot_level'] = slot_level_damage 

    if resp2.get('atack_type'):
       spell_data['attack_type'] = resp2['attack_type']
    
    dict_of_spells[entry['index']] = spell_data
pprint.pprint(dict_of_spells)

{'acid-arrow': {'casting_time': '1 action',
                'classes': ['Wizard'],
                'components': ['V', 'S', 'M'],
                'concentration': False,
                'damage_at_slot_level': {'2': '4d4',
                                         '3': '5d4',
                                         '4': '6d4',
                                         '5': '7d4',
                                         '6': '8d4',
                                         '7': '9d4',
                                         '8': '10d4',
                                         '9': '11d4'},
                'damage_type': ['Acid'],
                'desc': 'A shimmering green arrow streaks toward a target '
                        'within range and bursts in a spray of acid. Make a '
                        'ranged spell attack against the target. On a hit  the '
                        'target takes 4d4 acid damage immediately and  d4 acid '
                        'damage at the end of 

In [63]:
# This cell handles the different races a player can be:
url = "https://www.dnd5eapi.co/api/2014/races"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_races = {}
race_data = {}

for entry in resp['results']:
    name = entry['name']

    time.sleep(DELAY)

    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    bonus_dict = {}
    for item in resp2['ability_bonuses']:
        bonus_dict[item['ability_score']['name']] = item['bonus']
    
    race_data = {
        'name' : name,
        'speed': resp2['speed'],
        'ability_bonuses': bonus_dict,
        'alignment': resp2['alignment'],
        'age': resp2['age'],
        'size': resp2['size'],
        'size_description': resp2['size_description'],
        'languages': [item['name'] for item in resp2['languages']],
        'language_description': resp2['language_desc'],
        'traits': [item['name'] for item in resp2['traits']],
    }

    if resp2.get('subraces') :
         race_data['subraces'] = [item['name'] for item in resp2['subraces']]

    if resp2.get('starting_proficiencies'):
        race_data['starting_proficiencies'] =  [item['name'] for item in resp2['starting_proficiencies']]


        
    dict_of_races[entry['index']] = race_data

pprint.pprint(dict_of_races)

{'dragonborn': {'ability_bonuses': {'CHA': 1, 'STR': 2},
                'age': 'Young dragonborn grow quickly. They walk hours after '
                       'hatching, attain the size and development of a '
                       '10-year-old human child by the age of 3, and reach '
                       'adulthood by 15. They live to be around 80.',
                'alignment': 'Dragonborn tend to extremes, making a conscious '
                             'choice for one side or the other in the cosmic '
                             'war between good and evil. Most dragonborn are '
                             'good, but those who side with evil can be '
                             'terrible villains.',
                'language_description': 'You can speak, read, and write Common '
                                        'and Draconic. Draconic is thought to '
                                        'be one of the oldest languages and is '
                                       

In [None]:
url = "https://www.dnd5eapi.co/api/2014/subraces"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_subraces = {}
subraces_data = {}

for entry in resp['results']:
    name = entry['name']

    time.sleep(DELAY)
    
    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    subraces_data = {
        'name' : name,
        'desc': "".join(preprocess_text(resp2['desc'])),
        'race': resp2['race']['name'],
        'ability_bonuses': bonus_dict,
        'racial_traits': [item['name'] for item in resp2['racial_traits']],
    }

    if resp2.get('starting_proficiencies'):
        subraces_data['starting_proficiencies'] = [item['name'] for item in resp2['starting_proficiencies']]

    if resp2.get('languages'):
        subraces_data['languages'] = resp2['languages']

    if resp2.get('language_options'):
        languages = resp2['language_options']['from']['options']
        language_names = [lang['item']['name'] for lang in languages]
        subraces_data['language_options'] = language_names

    bonus_dict = {}
    for item in resp2['ability_bonuses']:
        bonus_dict[item['ability_score']['name']] = item['bonus'] 

    dict_of_subraces[entry['index']] = subraces_data
    
pprint.pprint(dict_of_subraces)

{'high-elf': {'ability_bonuses': {'CHA': 2, 'INT': 1},
              'desc': 'As a high elf  you have a keen mind and a mastery of at '
                      'least the basics of magic. In many fantasy gaming '
                      'worlds  there are two kinds of high elves. One type is '
                      'haughty and reclusive  believing themselves to be '
                      'superior to non elves and even other elves. The other '
                      'type is more common and more friendly  and often '
                      'encountered among humans and other races.',
              'language_options': ['Dwarvish',
                                   'Giant',
                                   'Gnomish',
                                   'Goblin',
                                   'Halfling',
                                   'Orc',
                                   'Abyssal',
                                   'Celestial',
                                   'Draconic',
  

In [65]:
url = "https://www.dnd5eapi.co/api/2014/proficiencies"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_proficiencies = {}
proficiency_data = {}

for entry in resp['results']:
    name = entry['name']

    time.sleep(DELAY)
    
    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    proficiency_data = {
        'name' : name,
        'type' : resp2['type'],
    }

    if resp2.get('classes'):
         proficiency_data['classes'] = [item['name'] for item in resp2['classes']]

    if resp2.get('races'):
        proficiency_data['races'] = [item['name'] for item in resp2['races']]

    dict_of_proficiencies[entry['index']] = proficiency_data
pprint.pprint(dict_of_proficiencies)


{'alchemists-supplies': {'name': "Alchemist's Supplies",
                         'type': "Artisan's Tools"},
 'all-armor': {'classes': ['Fighter', 'Paladin'],
               'name': 'All armor',
               'type': 'Armor'},
 'bagpipes': {'name': 'Bagpipes', 'type': 'Musical Instruments'},
 'battleaxes': {'name': 'Battleaxes', 'races': ['Dwarf'], 'type': 'Weapons'},
 'blowguns': {'name': 'Blowguns', 'type': 'Weapons'},
 'breastplate': {'name': 'Breastplate', 'type': 'Armor'},
 'brewers-supplies': {'name': "Brewer's Supplies", 'type': "Artisan's Tools"},
 'calligraphers-supplies': {'name': "Calligrapher's Supplies",
                            'type': "Artisan's Tools"},
 'carpenters-tools': {'name': "Carpenter's Tools", 'type': "Artisan's Tools"},
 'cartographers-tools': {'name': "Cartographer's Tools",
                         'type': "Artisan's Tools"},
 'chain-mail': {'name': 'Chain Mail', 'type': 'Armor'},
 'chain-shirt': {'name': 'Chain Shirt', 'type': 'Armor'},
 'clubs': {'cl

In [66]:
url = "https://www.dnd5eapi.co/api/2014/equipment"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()
dict_of_equipment = {}
equipment_data = {}

for entry in resp['results']:
    name_of_equip = entry['name']

    time.sleep(DELAY)
    
    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()
    
    equipment_data =  {
        'name' : name_of_equip,
        'equipment-category': resp2['equipment_category']['name'],
        'gear-category': resp2.get('gear_category',{}).get('name')
    }

    if resp2.get('desc'):
         equipment_data['desc'] = "".join(preprocess_text(resp2['desc']))

    if resp2.get('special'):
         equipment_data['special'] = resp2['special']

    if resp2.get('properties'):
         equipment_data['properties'] = [item['name'] for item in resp2['properties']]

    if resp2.get('contents'):
        equipment_data['contents'] = [{'name': item['item']['name']} for item in resp2['contents']]

    dict_of_equipment[entry['index']] = equipment_data
pprint.pprint(dict_of_equipment)


{'abacus': {'equipment-category': 'Adventuring Gear',
            'gear-category': 'Standard Gear',
            'name': 'Abacus'},
 'acid-vial': {'desc': 'As an action  you can splash the contents of this vial '
                       'onto a creature within 5 feet of you or throw the vial '
                       'up to  0 feet  shattering it on impact. In either '
                       'case  make a ranged attack against a creature or '
                       'object  treating the acid as an improvised weapon. On '
                       'a hit  the target takes  d6 acid damage.',
               'equipment-category': 'Adventuring Gear',
               'gear-category': 'Standard Gear',
               'name': 'Acid (vial)'},
 'alchemists-fire-flask': {'desc': 'This sticky  adhesive fluid ignites when '
                                   'exposed to air. As an action  you can '
                                   'throw this flask up to  0 feet  shattering '
                            

In [67]:
url = "https://www.dnd5eapi.co/api/2014/features"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_features = {}
feature_data = {}

for entry in resp['results']:
    
    time.sleep(DELAY)

    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    feature_data = {
        'name' : entry['name'],
        'desc' : "".join(preprocess_text(resp2['desc'])),
        'class': resp2['class']['name'],
        'level': resp2['level']
    }

    if resp2.get('prerequisites'):
        feature_data['prerequisites'] =  resp2['prerequisites']
    
    dict_of_features[entry['index']] = feature_data
pprint.pprint(dict_of_features)

{'action-surge-1-use': {'class': 'Fighter',
                        'desc': 'Starting at  nd level  you can push yourself '
                                'beyond your normal limits for a moment. On '
                                'your turn  you can take one additional action '
                                'on top of your regular action and a possible '
                                'bonus action. Once you use this feature  you '
                                'must finish a short or long rest before you '
                                'can use it again. Starting at 17th level  you '
                                'can use it twice before a rest  but only once '
                                'on the same turn.',
                        'level': 2,
                        'name': 'Action Surge (1 use)'},
 'action-surge-2-uses': {'class': 'Fighter',
                         'desc': 'Starting at  nd level  you can push yourself '
                                 'beyond you

In [68]:
url = "https://www.dnd5eapi.co/api/2014/magic-items"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_magic_items = {}
item_data = {}

for entry in resp['results']:
    name_of_item = entry['name']

    time.sleep(DELAY)
    
    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    item_data = {
        'name' : entry['name'],
        'desc' : "".join(preprocess_text(resp2['desc'])),
        'equipment-category': resp2['equipment_category']['name'],
        'rarity': resp2['rarity']['name'],
    }

    if resp2.get('variants'):
         item_data['variants'] = [item['name'] for item in resp2['variants']]

    dict_of_magic_items[entry['index']] = item_data
pprint.pprint(dict_of_magic_items)


{'adamantine-armor': {'desc': 'Armor medium or heavy  but not hide  uncommon '
                              'This suit of armor is reinforced with '
                              'adamantine  one of the hardest substances in '
                              "existence. While you're wearing it  any "
                              'critical hit against you becomes a normal hit.',
                      'equipment-category': 'Armor',
                      'name': 'Adamantine Armor',
                      'rarity': 'Uncommon'},
 'ammunition': {'desc': 'Weapon any ammunition  uncommon +1  rare +   or very '
                        'rare +3 You have a bonus to attack and damage rolls '
                        'made with this piece of magic ammunition. The bonus '
                        'is determined by the rarity of the ammunition. Once '
                        'it hits a target  the ammunition is no longer '
                        'magical.',
                'equipment-category': 'Ammuni

In [69]:
url = "https://www.dnd5eapi.co/api/2014/monsters"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_monsters = {}
monster_data = {}

for entry in resp['results']:
    name_of_monster = entry['name']

    time.sleep(DELAY)
    
    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    senses = {}
    for key, value in resp2['senses'].items():
        senses[key] = value
    
    movement = {}
    for key, value in resp2['speed'].items():
        movement[key] = value

    monster_data = {
        'name' : name_of_monster,
        'size': resp2['size'],
        'type':resp2['type'],
        'alignment': resp2['alignment'],
        'hit_points':resp2['hit_points'],
        'hit_dice':resp2['hit_dice'],
        'hit_points_roll': resp2['hit_points_roll'],
        'speed': movement,
        'strength': resp2['strength'],
        'dexterity':resp2['dexterity'],
        'constitution': resp2['constitution'],
        'intelligence': resp2['intelligence'],
        'wisdom': resp2['wisdom'],
        'charisma': resp2['charisma'],
        'senses': senses,
        'languages': resp2['languages'],
        'challenge_rating': resp2['challenge_rating'],
        'proficiency_bonus': resp2['proficiency_bonus'],
        'gained_experience': resp2['xp']
    }

    if resp2.get('armor_class'):
       armor_class = {}
       for item in resp2['armor_class']:
            armor_class[item['type']] = item['value']
       monster_data['armor_class'] = armor_class

    if resp2.get('damage_vulnerabilities'):
        monster_data['damage_vulnerabilites'] = resp2['damage_vulnerabilities']

    if resp2.get('damage_resistances'):
        monster_data['damage_resistances'] = resp2['damage_resistances']
    
    if resp2.get('damage_immunities'):
        monster_data['damage_immunities'] = resp2['damage_immunities']

    if resp2.get('condition_immunities'):
        monster_data['condition_immunities'] = [item['name'] for item in resp2['condition_immunities']]

    if resp2.get('special_abilites'):
        special = {}
        for items in resp2['special_abilities']:
            special['name'] = items['name']
            special['desc'] = items['desc']
            if 'damage' in items:
                special['damage'] = items['damage']
            if 'dc' in items:
                dc = {}
                dc['name'] = items['dc']['dc_type']['name']
                dc['value'] = items['dc']['dc_value']
                special['dc'] = dc
        monster_data['special_abilities'] = special
    
    if resp2.get('actions'):
        for items in resp2['actions']:
            actions = {}
            actions['name'] = items['name']
            actions['desc'] = items['desc']
        monster_data['actions'] = actions

    if resp2.get('legendary_actions'):
        legendary = {}
        for items in resp2['legendary_actions']:
            legendary['name'] = items['name']
            legendary['action_desc'] = items['desc']
        monster_data['legendary_actions'] = legendary

    if resp2.get('forms'):
        monster_data['forms'] = [item['name'] for item in resp2['forms']]

    if resp2.get('reactions'):
        reactions = {}
        for item in resp2['reactions']:
             reactions['name'] = item['name']
             reactions['desc'] = item['desc']
        monster_data['reactions'] = reactions
    
    proficiency_monster = {}
    if resp2.get('proficiencies'):
        for items in resp2['proficiencies']:
            proficiency_monster[items['proficiency']['name']] = items['value']
        monster_data['proficiencies'] = proficiency_monster

    dict_of_monsters[entry['index']] = monster_data
# pprint.pprint(dict_of_monsters)



In [None]:
url = "https://www.dnd5eapi.co/api/2014/equipment-categories"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_equipment_categories = {}
category_data = {}

for entry in resp['results']:
    name = entry['name']

    time.sleep(DELAY)
    
    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    category_data = {
        'name' : name
    }

    if resp2.get('equipment'):
        category_data['type'] = [item['name'] for item in resp2['equipment']]

   
    dict_of_equipment_categories[entry['index']] = category_data

pprint.pprint(dict_of_equipment_categories)


{'adventuring-gear': {'name': 'Adventuring Gear',
                      'type': ['Abacus',
                               'Acid (vial)',
                               "Alchemist's fire (flask)",
                               'Arrow',
                               'Blowgun needle',
                               'Crossbow bolt',
                               'Sling bullet',
                               'Alms box',
                               'Amulet',
                               'Antitoxin (vial)',
                               'Backpack',
                               'Ball bearings (bag of 1,000)',
                               'Barrel',
                               'Basket',
                               'Bedroll',
                               'Bell',
                               'Blanket',
                               'Block and tackle',
                               'Block of incense',
                               'Book',
                               'B

In [None]:
url = "https://www.dnd5eapi.co/api/2014/backgrounds"
response = requests.request("GET", url, headers=headers, data=payload)
resp = response.json()

dict_of_backgrounds = {}
background_data = {}

for entry in resp['results']:
    name = entry['name']

    time.sleep(DELAY)
    
    response2 = requests.request("GET", url+f"/{entry['index']}", headers=headers, data=payload)
    resp2 = response2.json()

    background_data = {
        'name' : name
    }

    if resp2.get('starting_proficiencies'):
        background_data['starting_proficiencies'] = [item['name'] for item in resp2['starting_proficiencies']]

    if resp2.get('language_options'):
        background_data['language_options'] = resp2['language_options']['choose']

    if resp2.get('starting_equipment'):
        background_data['starting_equipment'] = [item['equipment']['name'] for item in resp2['starting_equipment']]

    if resp2.get('starting_equipment_options'):
        starting_equip_options = {}
        starting_equip_options['choose'] = [item['choose'] for item in resp2['starting_equipment_options']]
        starting_equip_options['equipment_options'] = [item['from']['equipment_category']['name'] for item in resp2['starting_equipment_options']]
        background_data['starting_equipment_options'] = starting_equip_options

    personality = { 'options':[] } 
    personality['amount_of_options'] = resp2['personality_traits']['choose'] 
    for item in resp2['personality_traits']['from']['options']: 
        personality['options'].append(item['string'])
    background_data['personality_traits'] = personality

    if resp2.get('feature'):
        feat_dict = {}
        feat_dict['name'] = resp2['feature']['name']
        feat_dict['desc'] =  "".join(resp2['feature']['desc'])
        background_data['feature'] = feat_dict

    if resp2.get('ideals'):
        ideals = {}
        ideals['choose'] = resp2['ideals']['choose']
        ideal_option = {}
        ideals['possible_ideals'] = []
        for item in resp2['ideals']['from']['options']:
            ideal_option['desc'] = item['desc']
            ideal_option['alignments'] = [item['name'] for item in item['alignments']]
            ideals['possible_ideals'].append(ideal_option)
        background_data['ideals'] = ideals
    
    if resp2.get('bonds'):
        bonds = {}
        bonds['choose'] = resp2['bonds']['choose']
        bonds['bond_options'] = [item['string'] for item in resp2['bonds']['from']['options']]
        background_data['bonds'] = bonds

    if resp2.get('flaws'):
        flaws = {}
        flaws['choose'] = resp2['flaws']['choose']
        flaws['flaw_options'] = [item['string'] for item in resp2['flaws']['from']['options']]
        background_data['flaws'] = flaws

    dict_of_backgrounds[entry['index']] = background_data

pprint.pprint(dict_of_backgrounds)


{'acolyte': {'bonds': {'bond_options': ['I would die to recover an ancient '
                                        'relic of my faith that was lost long '
                                        'ago.',
                                        'I will someday get revenge on the '
                                        'corrupt temple hierarchy who branded '
                                        'me a heretic.',
                                        'I owe my life to the priest who took '
                                        'me in when my parents died.',
                                        'Everything I do is for the common '
                                        'people.',
                                        'I will do anything to protect the '
                                        'temple where I served.',
                                        'I seek to preserve a sacred text that '
                                        'my enemies consider heretical and '
 

In [104]:
# Now all dicts will be saved into the api_data.json. In order to structure the data in the json itself, a new dict is constructed, saving each dictionary under a thematically responding key. 
file_path = 'api_data/api_data.json'
json_dict = {
            'rules': dict_of_rules,
            'rule_sections': dict_of_rule_sections,
            'races': dict_of_races,
            'subraces': dict_of_subraces,
            'classes': dict_of_classes,
            'subclasses': dict_of_subclasses,
            'skills': dict_of_skills,
            'feats': dict_of_feats,
            'languages': dict_of_languages,
            'ability_scores': dict_of_ability_scores,
            'traits': dict_of_traits,
            'proficiencies': dict_of_proficiencies,
            'features': dict_of_features,
            'example_character_background': dict_of_backgrounds,
            'conditions': dict_of_conditions,
            'equipment': dict_of_equipment,
            'equipment_categories': dict_of_equipment_categories,
            'weapon_properties': dict_of_weapon_properties,
            'magic_items': dict_of_magic_items,
            'magic_schools': dict_of_magic_schools,
            'damage_types': dict_of_damage_types,
            'spells': dict_of_spells,
            'monsters': dict_of_monsters
        }

with open(file_path, 'w') as f:
    # The previously constructed dictionary is written to the json file:
    json.dump(json_dict,
        indent=4, # For better readability and visible structure four indents are added.
        ensure_ascii=False, # This is set to false, so f.ex. apostrophes aren't converted.
        fp=f
    )
    f.close()



### The next steps ###

What has to be done next is create a dataset and then document store out of our completed json-file, that later is used to retrieve information. However to make the important field 'desc' and 'name' our later retrieved information source and the other fields our meta-data-fields, we need our json-dict to follow the format:

dict: {
    'content': 'desc',
    'meta_data': every other field containig information
}

Also a new meta-data field called 'category' is added for better response filtering later on. The category variable orients itself on the key given to each dictionary entry in the previous dictionary.

### The sentence and later retrieval transformer: ###
multi-qa-distilbert-cos-v1 	("This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at: SBERT.net - Semantic Search" - https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1) - as it has a word limit of 512 word, before writing the documents to the Document store they are split in accordingly sized token chunks with a token overlap of 50.

In [96]:
# File path variables.
file_path_rag = 'api_data/rag_data.json'
file_path = 'api_data/api_data.json'

In [97]:
# In order to not reaccess the api and reload each dictionary, the already structured file is used to re-structure the rag-file into the desired format:
with open(file_path, 'r') as f:
    api_info = json.load(f)
    f.close()

# Every category that saves each loaded dict is added into the metadata to relieve later filtering.
def convertToRAGFormat (information):
    expected_docs = []
    for category, dicts in information.items():
        for index, items in dicts.items():
            later_content = []
            later_content.append(items.get('name'))
            if items.get('desc') and items.get('desc') != items.get('name'):
                later_content.append(items.get('desc'))

            meta_info = {intern_key: intern_value for intern_key, intern_value in items.items() if intern_key != 'desc'}
            each_doc = {
                'content': '. '.join(later_content),
                'meta': {**meta_info,'category': category}
            }
            expected_docs.append(each_doc)
    
    return expected_docs

rag_docs = convertToRAGFormat(api_info)
with open(file_path_rag, 'w') as fr:
    json.dump(rag_docs, indent=4, ensure_ascii=False, fp=fr)
    fr.close()
    # The ascii-encoding is set to false, so f.ex. apostrophes aren't converted and can later be filtered if neccessary.
    # For better readability and visible structure four indents are added.



In [98]:
# Initializing Pipeline parts:
document_store = InMemoryDocumentStore()
document_joiner = DocumentJoiner(join_mode='reciprocal_rank_fusion')
document_splitter = DocumentSplitter(split_by="word", split_length=512, split_overlap=50)

In [99]:
# In order to be able to use the LLM, that api key is used here:
os.environ["GOOGLE_API_KEY"] = 'AIzaSyD3Bb1km908nqdn39vE_0RT-hhWHFtcOJ4'

In [100]:
# In order to save each entry in the dicts, the file is re-opened and every entry is saved as a document with the new format that was previously constructed, strucutring the document into 'content' and 'meta'-data.
with open(file_path_rag, 'r') as f:
    dataset = json.load(f)
    f.close()

docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]
print(len(docs)) # In order to check whether some docs have been lost, the length of the docs-list will be printed out.

# In order to be able to use every single meta-data key entry that is supposed to be embedded with the corresponding content, every single key is added to a set.
# First an empty set is initialized.
meta_keys = set()
# Then for every document in the document-list, the keys are added to the set, if they aren't already contained via the update()-method.
for doc in docs:
    meta_keys.update(doc.meta.keys())
# After that the set is converted to a list, so that it can be added to the doc_embedder so that all the meta-fields are includded im the embdding too.
meta_keys = list(meta_keys)

# If desired they can be looked at here:
# print(meta_keys)

1985


In [101]:
# Now the entries have to be embedded with an embedder:
doc_embedder = SentenceTransformersDocumentEmbedder(model='multi-qa-distilbert-cos-v1', meta_fields_to_embed=meta_keys)
doc_embedder.warm_up()


In [102]:
# Before embedding the documents and adding them to the document store, they are split into chunks with the document_splitter:
split_docs = document_splitter.run(docs)
docs_w_embeddings = doc_embedder.run(split_docs['documents'])

Batches: 100%|██████████| 63/63 [02:00<00:00,  1.91s/it]


In [None]:
# Because it needs to be checked whether all metadata was considered:
embedded_docs = docs_w_embeddings['documents']
# with this, the correct length of all embeddings can be checked and they are all 768 dimension long as described in the official documentation: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
#for doc in embedded_docs:
    #print(len(doc.embedding))
print(embedded_docs)


In [83]:
# All the embedded documents are added to the document store:
document_store.write_documents(docs_w_embeddings["documents"])

2010

# RAG-Pipeline #

This pipeline contains apart from the standard parts (textual embedder, llm, promptbuilder, retriever) a BM25-retriever to construct a hybrid search as well as dense retriever in order to boost results.
 Results from both retrievers get joined with the Document joiner and ranked according to their score.
However even with the hybrid search a filtering mechanism is still needed. Without a filtering mechanism, a reliable finding of resources will not work reliably because of the dynamic and homogenous naming of metadata fields. 

Queries like : 'What races can I play as?' return every single document, that somewhere contains the word 'race' in it's meta-data. This is often the case when looking at race-related skill or race-related weapons or classes. To alleviate this effect the key 'catgegory' from the api_data.json has been selected to function as a filter. 

An improved filtering mechanism could be possible with for example a multi-label classifier, however this would need a lot of training data, which is not accessible in this contenxt. Another mechanism other than hard-coding filtering rules, would be by letting an LLM decide which category/categories the question falls into. As our chosen LLM only provides 5 calls per day (similar to other free plans of LLMs), we did not integrate this.

In [None]:
template = [
    ChatMessage.from_user(
        """
You are a D&D expert. Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
    Metadata:
    {% if document.meta %}
        {% for key, value in document.meta.items() %}
            {{ key }}: {{ value }}
        {% endfor %}
    {% endif %}
{% endfor %}


Question: {{question}}
Answer:
"""
    )
]

prompt_builder_hybrid = ChatPromptBuilder(template=template)

ChatPromptBuilder has 3 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.


In [85]:
chat_generator_hybrid = GoogleGenAIChatGenerator(model="gemini-2.0-flash")

In [86]:
cross_model = 'cross-encoder/ms-marco-TinyBERT-L2-v2'

text_embedder = SentenceTransformersTextEmbedder(model='multi-qa-distilbert-cos-v1')
text_embedder_retr = SentenceTransformersTextEmbedder(model='multi-qa-distilbert-cos-v1')

embedding_retriever = InMemoryEmbeddingRetriever(document_store)
embedding_retriever_retr = InMemoryEmbeddingRetriever(document_store)

bm25_retriever = InMemoryBM25Retriever(document_store)
bm25_retriever_retr = InMemoryBM25Retriever(document_store)

ranker = SentenceTransformersSimilarityRanker(model=cross_model)
ranker_retr = SentenceTransformersSimilarityRanker(model=cross_model)

document_joiner_retr = DocumentJoiner(join_mode='reciprocal_rank_fusion')

In [87]:
# Complete pipeline including the LLM
hybrid_retrieval = Pipeline()
hybrid_retrieval.add_component("text_embedder", text_embedder)
hybrid_retrieval.add_component("embedding_retriever", embedding_retriever)
hybrid_retrieval.add_component("bm25_retriever", bm25_retriever)
hybrid_retrieval.add_component("document_joiner", document_joiner)
hybrid_retrieval.add_component("ranker", ranker)

# new:
hybrid_retrieval.add_component("prompt_builder", prompt_builder_hybrid)
hybrid_retrieval.add_component("llm", chat_generator_hybrid)

hybrid_retrieval.connect("text_embedder", "embedding_retriever")
hybrid_retrieval.connect('bm25_retriever','document_joiner')
hybrid_retrieval.connect('embedding_retriever', 'document_joiner')
hybrid_retrieval.connect("document_joiner", "ranker")

# new:
hybrid_retrieval.connect("ranker", "prompt_builder")
hybrid_retrieval.connect("prompt_builder.prompt", "llm.messages")

<haystack.core.pipeline.pipeline.Pipeline object at 0x0000016993410850>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - embedding_retriever: InMemoryEmbeddingRetriever
  - bm25_retriever: InMemoryBM25Retriever
  - document_joiner: DocumentJoiner
  - ranker: SentenceTransformersSimilarityRanker
  - prompt_builder: ChatPromptBuilder
  - llm: GoogleGenAIChatGenerator
🛤️ Connections
  - text_embedder.embedding -> embedding_retriever.query_embedding (list[float])
  - embedding_retriever.documents -> document_joiner.documents (list[Document])
  - bm25_retriever.documents -> document_joiner.documents (list[Document])
  - document_joiner.documents -> ranker.documents (list[Document])
  - ranker.documents -> prompt_builder.documents (list[Document])
  - prompt_builder.prompt -> llm.messages (list[ChatMessage])

In [88]:
# In order to evaluate the results from retrieval, the exact same pipeline was built in order to acces the ranked results before the LLM tries to build a prompt:
hb_nollm_pipeline = Pipeline()
hb_nollm_pipeline.add_component("text_embedder", text_embedder_retr)
hb_nollm_pipeline.add_component("embedding_retriever", embedding_retriever_retr)
hb_nollm_pipeline.add_component("bm25_retriever", bm25_retriever_retr)
hb_nollm_pipeline.add_component("document_joiner", document_joiner_retr)
hb_nollm_pipeline.add_component("ranker", ranker_retr)

hb_nollm_pipeline.connect("text_embedder", "embedding_retriever")
hb_nollm_pipeline.connect('bm25_retriever','document_joiner')
hb_nollm_pipeline.connect('embedding_retriever', 'document_joiner')
hb_nollm_pipeline.connect("document_joiner", "ranker")

<haystack.core.pipeline.pipeline.Pipeline object at 0x0000016993403CA0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - embedding_retriever: InMemoryEmbeddingRetriever
  - bm25_retriever: InMemoryBM25Retriever
  - document_joiner: DocumentJoiner
  - ranker: SentenceTransformersSimilarityRanker
🛤️ Connections
  - text_embedder.embedding -> embedding_retriever.query_embedding (list[float])
  - embedding_retriever.documents -> document_joiner.documents (list[Document])
  - bm25_retriever.documents -> document_joiner.documents (list[Document])
  - document_joiner.documents -> ranker.documents (list[Document])

In [22]:
query = "Want to create a new character and I want to make a hollow one dwarf. So i see in lineage how to add a hollow one and it says if I add this to what my race is I get the traits of the hollow one and my dwarf, but how do you add hollow one to the race? I don't see a button or link. I see in hollow one I can 2 skills but that's it. I tried custom lineage also and didn't see anything there either. Am I missing something?"
result = hb_nollm_pipeline.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query, "top_k":5}, "embedding_retriever": {"top_k": 5}, "ranker": {"query": query}}
)
for doc in result['ranker']['documents']:
    print("Content:", doc.content)
    print("Metadata:", doc.meta['category'])
    print('final document score:', doc.score)
    print("----")

Batches: 100%|██████████| 1/1 [00:00<00:00, 19.71it/s]

Content: Reincarnate. You touch a dead humanoid or a piece of a dead humanoid. Provided that the creature has been dead no longer than 10 days  the spell forms a new adult body for it and then calls the soul to enter that body. If the target's soul isn't free or willing to do so  the spell fails. The magic fashions a new body for the creature to inhabit  which likely causes the creature's race to change. The GM rolls a d 100 and consults the following table to determine what form the creature takes when restored to life  or the GM chooses a form. | d100 | Race | |   |   | | 01 04 | Dragonborn | | 05 13 | Dwarf  hill | | 14  1 | Dwarf  mountain | |     5 | Elf  dark | |  6 34 | Elf  high | | 35 4  | Elf  wood | | 43 46 | Gnome  forest | | 47 5  | Gnome  rock | | 53 56 | Half elf | | 57 60 | Half orc | | 61 68 | Halfling  lightfoot | | 69 76 | Halfling  stout | | 77 96 | Human | | 97 00 | Tiefling | The reincarnated creature recalls its former life and experiences. It retains the capabil




In [91]:
query = "How do you add hollow one to the race? I don't see a button or link. I see in hollow one I can 2 skills but that's it. I tried custom lineage also and didn't see anything there either."

result = hybrid_retrieval.run(
    {"text_embedder": {"text": query}, "bm25_retriever": {"query": query}, "ranker": {"query": query}}
)
# hybrid_retrieval.draw("hybrid-retrieval.png")
print(result["llm"]["replies"][0])

Batches: 100%|██████████| 1/1 [00:00<00:00, 20.19it/s]


PipelineRuntimeError: The following component failed to run:
Component name: 'prompt_builder'
Component type: 'ChatPromptBuilder'
Error: 'str object' has no attribute 'meta'

Batches: 100%|██████████| 1/1 [00:00<00:00, 42.72it/s]

Content: Draconic Ancestry (White). You have draconic ancestry. Choose one type of dragon from the Draconic Ancestry table. Your breath weapon and damage resistance are determined by the dragon type  as shown in the table.
Metadata: traits
similarity/fitness: 0.5011509764689459
----
Content: Darkvision. You have superior vision in dark and dim conditions. You can see in dim light within 60 feet of you as if it were bright light  and in darkness as if it were dim light. You cannot discern color in darkness  only shades of gray.
Metadata: traits
similarity/fitness: 0.5011092888124974
----
Content: Draconic Ancestry (Black). You have draconic ancestry. Choose one type of dragon from the Draconic Ancestry table. Your breath weapon and damage resistance are determined by the dragon type  as shown in the table.
Metadata: traits
similarity/fitness: 0.5011036141695698
----
Content: Draconic Ancestry (Green). You have draconic ancestry. Choose one type of dragon from the Draconic Ancestry table




## Applying filters ##

Another more efficient way to automatically assign the correct category would be the using multi-label classification. This could assign 2 or more fitting labels to search queries. However as we don't have enough queries and data to train, the queries were constructed based on keyword-filters:

In [None]:
def extract_filters(question: str):
    if "race" in question.lower() or "races" in question.lower():
        print('race was chosen')
        return  {"field": "meta.category", "operator": "==", "value": "races"}
    if "weapon" in question.lower():
        print('weapon was chosen')
        return {"field": "meta.category", "operator": "==", "value": "equipment-categories"}
    if "spell" in question.lower():
        return {"field": "meta.category", "operator": "==", "value": "spells"}
    return {}

## Evaluation ## 

As the API used for this RAG QA pipeline is from 2014 and there have been some changes including new releases in the game and altercations, we worked with the API and therefore need to use queries that were real and similar in content. 
Some were from https://www.dndbeyond.com/?msockid=201f823801c3644c068896b30048651a

As there aren't that many ways to get to the queries: 
- 3-5 queries per category (23 categories)

Then haystack built in evaluation for the retrieval:
- Recall@k and Precision@k
- MMR 

For the LLM we used: 
- 

In [28]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

ARTICLE = """ Want to create a new character and I want to make a hollow one dwarf. So i see in lineage how to add a hollow one and it says if I add this to what my race is I get the traits of the hollow one and my dwarf, but how do you add hollow one to the race? I don't see a button or link. I see in hollow one I can 2 skills but that's it. I tried custom lineage also and didn't see anything there either. Am I missing something?"""
print(summarizer(ARTICLE, max_length=20, min_length=10, do_sample=False))

Device set to use cpu


[{'summary_text': "How do you add hollow one to the race? I don't see a button or"}]
