# LoreSmith

**LoreSmith** is an AI-powered search and retrieval system designed to help Dungeon Masters and players quickly uncover the rules, spells, and monsters hidden within the Dungeons & Dragons 5e System Reference Document (SRD). By combining semantic embeddings, hybrid search, and generative summarization, LoreSmith transforms raw SRD text into fast, accurate, and citation-grounded answers.

**Forge Fast Answers**: Ask natural questions like “Which CR 5 monsters resist poison?” or “What level 3 spells can heal multiple allies?” and get concise, structured answers.

**Rule-Accurate Guidance**: Every response is grounded in official SRD text, with citations for transparency.

**Choose-Your-Own-Adventure Search**: After each answer, LoreSmith suggests next steps—filter by class, level, challenge rating, environment, or even compare options side by side.

Built for Table Use: LoreSmith saves time during play sessions by replacing page-flipping with instant, context-rich search.

With LoreSmith, Dungeon Masters can focus less on rules lookups and more on storytelling, while players gain a reliable oracle to guide their adventures.

In [1]:
# Clean Install (run in a fresh kernel)
!pip uninstall -y keras tensorflow tensorflow-intel tf-keras protobuf || true

# CPU-only PyTorch + pinned, TF-free stack + protobuf<5
!pip install --upgrade --no-cache-dir \
  torch torchvision torchaudio \
  "protobuf<5" \
  "transformers==4.43.3" \
  "sentence-transformers==2.7.0" \
  "chromadb==0.5.5" \
  "rank_bm25==0.2.2" \
  "ipywidgets==8.1.2" \
  "tabulate==0.9.0" \
  "openai>=1.40.0"

# (Classic Notebook users only) enable widgets
# In JupyterLab 3+ you don't need this.
!jupyter nbextension enable --py widgetsnbextension -y || true

[0mFound existing installation: protobuf 4.25.8
Uninstalling protobuf-4.25.8:
  Successfully uninstalled protobuf-4.25.8
Collecting protobuf<5
  Downloading protobuf-4.25.8-cp37-abi3-macosx_10_9_universal2.whl.metadata (541 bytes)
Downloading protobuf-4.25.8-cp37-abi3-macosx_10_9_universal2.whl (394 kB)
Installing collected packages: protobuf
Successfully installed protobuf-4.25.8
usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
               [--paths] [--json] [--debug]
               [subcommand]

Jupyter: Interactive Computing

positional arguments:
  subcommand     the subcommand to launch

options:
  -h, --help     show this help message and exit
  --version      show the versions of core jupyter packages and exit
  --config-dir   show Jupyter config dir
  --data-dir     show Jupyter data dir
  --runtime-dir  show Jupyter runtime dir
  --paths        show all Jupyter paths. Add --json for machine-readable
                 format.
  --json         outpu

### Setup

In [2]:
import os, json, math, hashlib, textwrap
from dataclasses import dataclass
from typing import List, Dict, Any, Optional

import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi
from IPython.display import display, Markdown, HTML
import ipywidgets as W
from tabulate import tabulate

import pandas as pd

from rank_bm25 import BM25Okapi
import chromadb
from chromadb.utils import embedding_functions

In [11]:
# Notebook config
PERSIST_DIR = "./chroma_srd"
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
RERANK_MODEL_NAME = "BAAI/bge-reranker-large"

os.environ["OPENAI_API_KEY"] = "sk-...yourOpenAIkey"

USE_OPENAI = True  # set True if you want OpenAI for generation which we do

### Load SRD Data (Spells & Monsters)

In [42]:
# Kaggle raw file URLs
SPELLS_CSV_URL = "https://raw.githubusercontent.com/wjsutton/games_night_viz/main/challenges/9_bonus_challenges/202206_datafamcon_dnd/dnd-spells.csv"
MONSTERS_CSV_URL = "https://raw.githubusercontent.com/wjsutton/games_night_viz/main/challenges/9_bonus_challenges/202206_datafamcon_dnd/dnd_monsters.csv"

spells_df = pd.read_csv(SPELLS_CSV_URL)
monsters_df = pd.read_csv(MONSTERS_CSV_URL)

print("Spells:", spells_df.shape)
print("Monsters:", monsters_df.shape)

Spells: (554, 12)
Monsters: (762, 17)


In [43]:
spells_df.head()

Unnamed: 0,name,classes,level,school,cast_time,range,duration,verbal,somatic,material,material_cost,description
0,Acid Splash,"Artificer, Sorcerer, Wizard",0,Conjuration,1 Action,60 Feet,Instantaneous,1,1,0,,You hurl a bubble of acid. Choose one creature...
1,Blade Ward,"Bard, Sorcerer, Warlock, Wizard",0,Abjuration,1 Action,Self,1 round,1,1,0,,You extend your hand and trace a sigil of ward...
2,Booming Blade,"Artificer, Sorcerer, Warlock, Wizard",0,Evocation,1 Action,Self (5-foot radius),1 round,0,1,1,a melee weapon worth at least 1 sp,You brandish the weapon used in the spell’s ca...
3,Chill Touch,"Sorcerer, Warlock, Wizard",0,Necromancy,1 Action,120 Feet,1 round,1,1,0,,"You create a ghostly, skeletal hand in the spa..."
4,Control Flames,"Druid, Sorcerer, Wizard",0,Transmutation,1 Action,60 Feet,Instantaneous or 1 hour,0,1,0,,You choose nonmagical flame that you can see w...


In [44]:
monsters_df.head()

Unnamed: 0,name,url,cr,type,size,ac,hp,speed,align,legendary,source,str,dex,con,int,wis,cha
0,aarakocra,https://www.aidedd.org/dnd/monstres.php?vo=aar...,1/4,humanoid (aarakocra),Medium,12,13,fly,neutral good,,Monster Manual (BR),10.0,14.0,10.0,11.0,12.0,11.0
1,abjurer,,9,humanoid (any race),Medium,12,84,,any alignment,,Volo's Guide to Monsters,,,,,,
2,aboleth,https://www.aidedd.org/dnd/monstres.php?vo=abo...,10,aberration,Large,17,135,swim,lawful evil,Legendary,Monster Manual (SRD),21.0,9.0,15.0,18.0,15.0,18.0
3,abominable-yeti,,9,monstrosity,Huge,15,137,,chaotic evil,,Monster Manual,,,,,,
4,acererak,,23,undead,Medium,21,285,,neutral evil,,Adventures (Tomb of Annihilation),,,,,,


### Normalize and Unify Records with Chunking

In [45]:
def get_col(row, *names, default=None):
    for n in names:
        if n in row and pd.notna(row[n]):
            return row[n]
    return default

def as_list(val):
    if val is None or (isinstance(val, float) and math.isnan(val)): return []
    if isinstance(val, list): return val
    return [x.strip() for x in str(val).split(",") if x and str(x).strip()]

# Spells → records
def record_from_spell(r):
    title = get_col(r, "name", "Name", default="").strip()
    level = get_col(r, "level", "Level", default=None)
    school = get_col(r, "school", "School", default=None)
    classes = as_list(get_col(r, "classes", "Classes", "class", "Class", default=""))
    rng = get_col(r, "range", "Range", default=None)
    duration = get_col(r, "duration", "Duration", default=None)
    components = get_col(r, "components", "Components", default=None)
    text = (
        get_col(r, "text", "desc", "Desc", "description", "Description", default="") 
        or ""
    )
    casting_time = get_col(r, "casting_time", "Casting Time", "Cast Time", default=None)
    extra = []
    if casting_time: extra.append(f"Casting Time: {casting_time}")
    if rng: extra.append(f"Range: {rng}")
    if duration: extra.append(f"Duration: {duration}")
    if components: extra.append(f"Components: {components}")
    if extra: text = "\n".join(extra) + ("\n\n" + text if text else "")

    return {
        "id": f"spell:{title.lower().replace(' ','_')}",
        "title": title,
        "metadata": {
            "doc_type": "spell",
            "level": int(level) if str(level).isdigit() else level,
            "school": school, 
            "classes": classes,
            "range": rng, "duration": duration, "components": components,
            "source": "SRD Spells (GitHub CSV)"
        },
        "content": str(text).strip()
    }

# Monsters → records
def record_from_monster(r):
    title = get_col(r, "name", "Name", default="").strip()
    cr = get_col(r, "cr", "CR", "challenge_rating", default=None)
    mtype = get_col(r, "type", "Type", default=None)
    size = get_col(r, "size", "Size", default=None)
    ac = get_col(r, "ac", "Armor Class", "armor_class", default=None)
    hp = get_col(r, "hp", "Hit Points", "hit_points", default=None)
    resist = as_list(get_col(r, "resistances", "Damage Resistances", default=""))
    immun = as_list(get_col(r, "immunities", "Damage Immunities", default=""))
    env = as_list(get_col(r, "environment", "Environments", default=""))
    traits = get_col(r, "traits", "Traits", default="")
    actions = get_col(r, "actions", "Actions", default="")
    legendary = get_col(r, "legendary_actions", "Legendary Actions", default="")
    desc = get_col(r, "text", "desc", "Description", default="")

    parts = []
    if traits: parts.append(f"Traits:\n{traits}")
    if actions: parts.append(f"Actions:\n{actions}")
    if legendary: parts.append(f"Legendary Actions:\n{legendary}")
    if desc: parts.append(str(desc))
    content = "\n\n".join([p for p in parts if str(p).strip()])

    return {
        "id": f"monster:{title.lower().replace(' ','_')}",
        "title": title,
        "metadata": {
            "doc_type": "monster",
            "cr": float(cr) if str(cr).replace(".","",1).isdigit() else cr,
            "type": mtype, "size": size, "ac": ac, "hp": hp,
            "resistances": resist, "immunities": immun, "environments": env,
            "source": "SRD Monsters (GitHub CSV)"
        },
        "content": content or str(desc) or ""
    }

spell_records = [record_from_spell(r._asdict() if hasattr(r,'_asdict') else r) for _, r in spells_df.iterrows()]
monster_records = [record_from_monster(r._asdict() if hasattr(r,'_asdict') else r) for _, r in monsters_df.iterrows()]
records = [*spell_records, *monster_records]

def chunk_long(rec, max_chars=2000, overlap=200):
    txt = rec["content"] or ""
    if len(txt) <= max_chars: return [rec]
    out, i, start = [], 0, 0
    while start < len(txt):
        end = min(len(txt), start + max_chars)
        part = txt[start:end]
        c = dict(rec)
        c["id"] = f"{rec['id']}::part{i}"
        c["content"] = part
        out.append(c)
        start = max(end - overlap, start + 1)
        i += 1
    return out

chunks = []
for rec in records:
    chunks.extend(chunk_long(rec, max_chars=2000, overlap=200))

len(records), len(chunks)

(1316, 6140)

In [53]:
# Heuristic enrichment from monster free text

import re
import pandas as pd
from collections import Counter

# Build a text blob per monster row by concatenating all string columns
if "___blob" not in monsters_df.columns:
    str_cols = [c for c in monsters_df.columns if monsters_df[c].dtype == object]
    def row_blob(row):
        parts = []
        for c in str_cols:
            v = row.get(c)
            if isinstance(v, str) and v.strip():
                parts.append(v)
        return " ".join(parts)
    monsters_df["___blob"] = monsters_df.apply(row_blob, axis=1)

# Regex helpers
def _split_terms(s):
    parts = re.split(r"[;,]| and | or ", s, flags=re.I)
    return [p.strip().lower() for p in parts if p.strip()]

RES_PATTS = [
    r"\bresistant\s+to\s+([A-Za-z ,;/\-]+)",
    r"\bresistance(?:s)?\s+to\s+([A-Za-z ,;/\-]+)",
]
IMM_PATTS = [
    r"\bimmune\s+to\s+([A-Za-z ,;/\-]+)",
    r"\bimmunity(?:ies)?\s+to\s+([A-Za-z ,;/\-]+)",
]
ENV_KEYWORDS = {
    "abyss": ["abyss", "abyssal"],
    "underground": ["underground", "cavern", "cave", "dungeon"],
    "mountains": ["mountain", "mountains"],
    "swamp": ["swamp", "bog", "marsh"],
    "forest": ["forest", "woods", "jungle"],
    "arctic": ["arctic", "tundra", "snow", "ice"],
    "desert": ["desert", "dune", "wastes"],
    "coastal": ["coast", "coastal", "shore", "beach"],
    "urban": ["city", "town", "urban", "sewer"],
}

def extract_resists(blob):
    bag = []
    for p in RES_PATTS:
        for m in re.finditer(p, blob, flags=re.I):
            bag += _split_terms(m.group(1))
    return sorted(set(bag))

def extract_immunes(blob):
    bag = []
    for p in IMM_PATTS:
        for m in re.finditer(p, blob, flags=re.I):
            bag += _split_terms(m.group(1))
    return sorted(set(bag))

def extract_envs(blob):
    blob_l = blob.lower()
    hits = []
    for env, kws in ENV_KEYWORDS.items():
        if any(k in blob_l for k in kws):
            hits.append(env)
    return sorted(set(hits))

# Build lookup by name
name_col = next((c for c in ["name","Name","monster","Monster"] if c in monsters_df.columns), None)
row_by_name = {str(r[name_col]).strip().lower(): r for _, r in monsters_df.iterrows()} if name_col else {}

# Enrich chunks in-place ONLY if structured fields are missing/empty
fixed_poison, fixed_env = 0, 0
for ch in chunks:
    md = ch["metadata"]
    if md.get("doc_type") != "monster":
        continue
    nm = ch["title"].strip().lower()
    row = row_by_name.get(nm)
    if row is None:
        continue
    blob = row["___blob"]

    if not md.get("resistances") and not md.get("immunities"):
        res = extract_resists(blob)
        imm = extract_immunes(blob)
        if res or imm:
            md["resistances"] = res
            md["immunities"] = imm
            if any("poison" in x for x in (res + imm)):
                fixed_poison += 1

    if not md.get("environments"):
        envs = extract_envs(blob)
        if envs:
            md["environments"] = envs
            if "abyss" in envs:
                fixed_env += 1

# Rebuild chunks_map
chunks_map = {c["id"]: c for c in chunks}

# Audit
poison_count = sum(
    1 for c in chunks
    if c["metadata"].get("doc_type")=="monster" and any(
        "poison" in x for x in (c["metadata"].get("resistances", []) + c["metadata"].get("immunities", []))
    )
)
abyss_count = sum(
    1 for c in chunks
    if c["metadata"].get("doc_type")=="monster" and "abyss" in (c["metadata"].get("environments") or [])
)

# Example monster names that matched
examples_poison = [c["title"] for c in chunks 
                   if c["metadata"].get("doc_type")=="monster" and any("poison" in x for x in 
                   (c["metadata"].get("resistances", []) + c["metadata"].get("immunities", [])))]
examples_abyss  = [c["title"] for c in chunks 
                   if c["metadata"].get("doc_type")=="monster" and "abyss" in (c["metadata"].get("environments") or [])]

### Build BM25 and Chroma Vector Indices

In [54]:
# Safety: ensure chunks exist
if "chunks" not in globals() or not chunks:
    raise RuntimeError("No 'chunks' found. Run your normalization/chunking cell first to create 'chunks'.")

# BM25 lexical index
bm25_corpus = [f"{c['title']}\n{json.dumps(c['metadata'])}\n{c['content']}" for c in chunks]
bm25_tokens = [doc.lower().split() for doc in bm25_corpus]
bm25 = BM25Okapi(bm25_tokens)

# Chroma vector index
PERSIST_DIR = "./chroma_srd"
client = chromadb.PersistentClient(path=PERSIST_DIR)

EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

def sanitize_metadata(md: dict) -> dict:
    """Flatten lists/sets and JSON-serialize dicts so Chroma accepts only str/int/float/bool values."""
    clean = {}
    for k, v in md.items():
        if v is None or v == {}:
            continue
        if isinstance(v, (list, tuple, set)):
            clean[k] = ", ".join(map(str, v))
        elif isinstance(v, dict):
            clean[k] = json.dumps(v, ensure_ascii=False)
        else:
            clean[k] = v
    return clean

# Recreate (or create) the collection cleanly so we can add sanitized metadata
try:
    # delete if exists (ignores if not present)
    client.delete_collection("srd")
except Exception:
    pass

collection = client.get_or_create_collection(
    name="srd",
    embedding_function=embedding_functions.SentenceTransformerEmbeddingFunction(EMBED_MODEL)
)

# Add documents to Chroma
collection.add(
    ids=[c["id"] for c in chunks],
    documents=[c["content"] for c in chunks],
    metadatas=[{"title": c["title"], **sanitize_metadata(c["metadata"])} for c in chunks],
)

chunks_map = {c["id"]: c for c in chunks}

print("BM25 docs:", len(bm25_corpus))
print("Chroma docs:", collection.count())

res = collection.query(query_texts=["poison resistance underground"], n_results=5)
print("Sample query top IDs:", res["ids"][0] if res and res.get("ids") else [])

BM25 docs: 6140
Chroma docs: 6140
Sample query top IDs: ['spell:earthquake::part149', 'spell:earthquake::part150', 'spell:earthquake::part147', 'spell:earthquake::part146', 'spell:earthquake::part135']


### Hybrid Retrieval Functions

In [55]:
# Simple caches
Q_CACHE, R_CACHE = {}, {}

def bm25_search(q, topk=100):
    scores = bm25.get_scores(q.lower().split())
    idxs = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:topk]
    return [chunks[i]["id"] for i in idxs]

def dense_search(q, topk=100):
    res = collection.query(query_texts=[q], n_results=topk)
    return res["ids"][0]

def rrf_fuse(lists, k=80, c=60):
    scores = {}
    for lst in lists:
        for rank, doc_id in enumerate(lst, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0/(c + rank)
    return [d for d,_ in sorted(scores.items(), key=lambda x: x[1], reverse=True)][:k]

def hybrid_retrieve(q, topk=20):
    bm = bm25_search(q, topk=topk)
    dn = dense_search(q, topk=topk)
    fused = rrf_fuse([bm, dn], k=topk)
    return fused

### Generative Layer

In [56]:
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def build_context(ids):
    blocks = []
    for did in ids:
        d = chunks_map[did]
        m = d["metadata"]
        excerpt = d["content"][:800]
        blocks.append(f"{d['title']} — {m.get('doc_type','')}\n{excerpt}")
    return "\n\n".join(blocks)

def generate_answer(query, ids):
    ctx = build_context(ids)
    sys_msg = "You are LoreSmith, an SRD rules assistant. Use ONLY the provided context."
    user_msg = f"Q: {query}\n\nCONTEXT:\n{ctx}\n\nAnswer concisely with bullet points + citations."
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role":"system","content":sys_msg},
            {"role":"user","content":user_msg}
        ],
        temperature=0.2
    )
    return resp.choices[0].message.content

### Hints and Helper Functions

In [57]:
MONSTER_PATTERNS = [
    r"\bmonster(s)?\b", r"\bcreature(s)?\b", r"\bbeast(s)?\b",
    r"\bundead\b", r"\bfiend(s)?\b", r"\bdragon(s)?\b", r"\booze(s)?\b", r"\baberration(s)?\b",
    r"\bstat ?block(s)?\b", r"\bchallenge rating\b", r"\bCR\b", r"\blair\b|\blair actions\b"
]
SPELL_PATTERNS = [
    r"\bspell(s)?\b", r"\bcast(ing)?\b", r"\bspell slot(s)?\b",
    r"\bverbal\b|\bsomatic\b|\bmaterial\b", r"\bconcentration\b", r"\bspell list(s)?\b"
]
STRONG_PATTERNS = [r"\bstrong(est)?\b", r"\bhighest\b", r"\btop\b", r"\bpowerful(est)?\b", r"\bCR\s*\d+"]

def infer_intent(query: str) -> Dict[str, Any]:
    q = query.lower()
    def hit(pats): return any(re.search(p, q) for p in pats)
    prefer = None
    if hit(MONSTER_PATTERNS) and not hit(SPELL_PATTERNS):
        prefer = "monster"
    elif hit(SPELL_PATTERNS) and not hit(MONSTER_PATTERNS):
        prefer = "spell"
    # If both or neither: leave None (mixed/ambiguous)
    want_poison = bool(re.search(r"\bpoison(ed|ing)?|toxin(s)?\b", q))
    want_strong = bool(any(re.search(p, q) for p in STRONG_PATTERNS))
    return {"doc_type_preference": prefer, "needs_poison_resist": want_poison, "wants_strongest": want_strong}

# Retrieval helpers
def bm25_search(q, topk=100) -> List[str]:
    scores = bm25.get_scores(q.lower().split())
    idxs = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:topk]
    return [chunks[i]["id"] for i in idxs]

def dense_search(q, topk=100) -> List[str]:
    res = collection.query(query_texts=[q], n_results=topk)
    return res["ids"][0]

def rrf_fuse(lists, k=80, c=60):
    scores = {}
    for lst in lists:
        for rank, doc_id in enumerate(lst, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0/(c + rank)
    return [d for d,_ in sorted(scores.items(), key=lambda x: x[1], reverse=True)][:k]

def _filter_by_type(ids: List[str], doc_type: str) -> List[str]:
    return [did for did in ids if chunks_map[did]["metadata"].get("doc_type")==doc_type]

def _prefer_poison(ids: List[str]) -> List[str]:
    # Stable resort: poison-resistant/immune monsters first
    with_boost, rest = [], []
    for did in ids:
        m = chunks_map[did]["metadata"]
        res = [x.lower() for x in (m.get("resistances", []) + m.get("immunities", []))]
        ((with_boost if any("poison" in x for x in res) else rest)).append(did)
    return with_boost + rest

def _sort_monsters_by_cr_desc(ids: List[str]) -> List[str]:
    def cr_val(did):
        cr = chunks_map[did]["metadata"].get("cr")
        return float(cr) if isinstance(cr,(int,float,str)) and str(cr).replace(".","",1).isdigit() else -1.0
    return sorted(ids, key=cr_val, reverse=True)

def smart_retrieve_hardfiltered(query: str, topk=12) -> List[str]:
    intent = infer_intent(query)
    bm = bm25_search(query, topk=120)
    dn = dense_search(query, topk=120)
    fused = rrf_fuse([bm, dn], k=120)

    # If we have a preferred type, HARD-FILTER context to that type
    preferred = intent.get("doc_type_preference")
    ids = fused
    if preferred:
        ids_pref = _filter_by_type(fused, preferred)
        # Top-up pass if too few: run a second fused with a biased query (e.g., add 'monster')
        if len(ids_pref) < topk:
            bias_q = f"{query} {preferred}"
            bm2 = bm25_search(bias_q, topk=200)
            dn2 = dense_search(bias_q, topk=200)
            fused2 = rrf_fuse([bm2, dn2], k=200)
            addl = [d for d in _filter_by_type(fused2, preferred) if d not in ids_pref]
            ids_pref = ids_pref + addl
        if preferred == "monster":
            if intent.get("needs_poison_resist"): 
                ids_pref = _prefer_poison(ids_pref)
            if intent.get("wants_strongest"):
                ids_pref = _sort_monsters_by_cr_desc(ids_pref)
        ids = ids_pref

    return ids[:max(12, topk)]

try:
    reranker
except NameError:
    def rerank(q, ids, topk=12): 
        return ids[:topk]

# Generation: enforce type in prompt + validate after generation

def build_context(ids: List[str]) -> str:
    blocks = []
    for did in ids:
        d = chunks_map[did]; m = d["metadata"]
        meta_bits = []
        if m.get("doc_type") == "monster":
            for k in ("cr","type","size","ac","hp","resistances","immunities","environments"):
                v = m.get(k)
                if v not in [None,"",[],{}]: meta_bits.append(f"{k}:{v}")
        else:
            for k in ("level","school","classes","range","duration","components"):
                v = m.get(k)
                if v not in [None,"",[],{}]: meta_bits.append(f"{k}:{v}")
        meta_str = ", ".join(meta_bits)
        excerpt = (d["content"] or "")[:900]
        blocks.append(f"{d['title']} — {m.get('doc_type')} ({meta_str})\n{excerpt}")
    return "\n\n".join(blocks)

def generate_answer_smart(query: str, ids: List[str]) -> str:
    intent = infer_intent(query)
    target = intent.get("doc_type_preference")
    ctx = build_context(ids)
    if target == "monster":
        steer = (
            "ONLY list MONSTERS found in CONTEXT (no spells/items). "
            "Prefer higher CR if the question asks for 'strongest'."
        )
        fmt = "Return 5–10 concise bullets: **Name** (CR) — key traits (resist/immune, notable actions), environments. Then a table: Name | CR | Resist/Immune | Environments."
    elif target == "spell":
        steer = "ONLY list SPELLS found in CONTEXT (no monsters/items)."
        fmt = "Return 5–10 concise bullets, then table: Name | Level | School | Classes | Effect."
    else:
        steer = "Answer using whichever entity type(s) are most relevant in CONTEXT; avoid mixing unless both are clearly requested."
        fmt = "Return concise bullets and a compact table based on entities present."

    sys_msg = (
        "You are LoreSmith, a D&D 5e SRD assistant. Use ONLY the provided CONTEXT. "
        "If an entity type is requested but not present in CONTEXT, say so."
    )
    user_msg = f"Question: {query}\n\n{steer}\n\nCONTEXT:\n{ctx}\n\n{fmt}\nUse bracketed citations like [Title; SRD]."

    resp = _oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role":"system","content":sys_msg},{"role":"user","content":user_msg}],
        temperature=0.2
    )
    return resp.choices[0].message.content

### Chunk Audit

In [58]:
from collections import Counter
import itertools, math
def head(ls, n=5): 
    return list(itertools.islice(ls, n))

doc_types = [c["metadata"].get("doc_type") for c in chunks]
print("Doc type counts:", Counter(doc_types))

# Peek at a few supposed monsters by ID prefix
mon_by_prefix = [c for c in chunks if c["id"].startswith("monster:")]
print("Monster-by-prefix count:", len(mon_by_prefix))
print("Examples:", [c["title"] for c in head(mon_by_prefix, 5)])

# Peek at titles that *look* like monsters
likely_monster = [c for c in chunks if any(w in (c["metadata"].get("type","") or "").lower() for w in ["beast","undead","fiend","dragon","aberration","ooze","monstrosity","construct","humanoid"])]
print("Likely monster by type:", len(likely_monster))
print("Examples:", [c["title"] for c in head(likely_monster, 5)])

Doc type counts: Counter({'spell': 5378, 'monster': 762})
Monster-by-prefix count: 762
Examples: ['aarakocra', 'abjurer', 'aboleth', 'abominable-yeti', 'acererak']
Likely monster by type: 644
Examples: ['aarakocra', 'abjurer', 'aboleth', 'abominable-yeti', 'acererak']


In [60]:
# Normalize monster metadata
import re
from fractions import Fraction

def _listify(v):
    """Turn strings like 'Fire, Poison, Psychic' into ['fire','poison','psychic'].
       Pass through lists/tuples, ignore None/empty."""
    if v is None: return []
    if isinstance(v, (list, tuple, set)):
        return [str(x).strip().lower() for x in v if str(x).strip()]
    s = str(v).strip()
    if not s: return []
    # split on commas, semicolons, pipes
    parts = re.split(r"[,;|/]", s)
    return [p.strip().lower() for p in parts if p.strip()]

def _parse_cr(x):
    if x is None: return None
    s = str(x).strip()
    if not s or s == "-": return None
    try:
        return float(s)
    except:
        try:
            return float(Fraction(s))
        except:
            return None

# Normalize in-place on our local chunk copies
for c in chunks:
    md = c["metadata"]
    if md.get("doc_type") == "monster":
        md["cr"] = _parse_cr(md.get("cr"))
        md["resistances"] = _listify(md.get("resistances"))
        md["immunities"]  = _listify(md.get("immunities"))
        md["environments"] = _listify(md.get("environments"))

# Rebuild map used by retrieval/filters
chunks_map = {c["id"]: c for c in chunks}

# Replace structured fallback helpers to use normalized lists

ENV_SYNONYMS = {
    "underground": ["underground", "cave", "dungeon"],
    "swamp": ["swamp", "marsh", "bog"],
    "forest": ["forest", "woods", "jungle"],
    "arctic": ["arctic", "tundra", "ice", "snow"],
    "desert": ["desert", "dunes", "wastes"],
    "coastal": ["coastal", "shore", "beach"],
    "urban": ["urban", "city", "town", "sewer"],
    # planes/realms (add more if you like)
    "abyss": ["abyss", "lower planes", "demonic realms"],
    "aver nus": ["avernus"],
}

def _env_hits_from_query(q: str):
    ql = q.lower()
    hits = set()
    # match synonyms
    for env, words in ENV_SYNONYMS.items():
        if any(w in ql for w in words):
            hits.add(env)
    # also capture any single capitalized word like 'Abyss' typed by user
    # fallback: if a token from query appears in any monster environments we have, keep it
    if not hits:
        tokens = set(re.findall(r"[a-zA-Z][a-zA-Z\-']+", ql))
        # collect known envs in data
        known_envs = set(e for c in chunks if c["metadata"].get("doc_type")=="monster"
                         for e in (c["metadata"].get("environments") or []))
        inter = tokens.intersection(known_envs)
        hits.update(inter)
    return list(hits)

def _want_poison(q: str):
    return bool(re.search(r"\bpoison(ed|ing)?|toxin(s)?\b", q.lower()))

def _want_strongest(q: str):
    return bool(re.search(r"\bstrong(est)?\b|\bhighest\b|\btop\b|\bmost powerful\b", q.lower()))

def _parse_cr_range(q: str):
    ql = q.lower()
    m_range = re.search(r"\bcr\s*(\d+(?:\.\d+)?)\s*[-–]\s*(\d+(?:\.\d+)?)\b", ql)
    m_plus  = re.search(r"\bcr\s*(\d+(?:\.\d+)?)[+]\b", ql)
    m_exact = re.search(r"\bcr\s*(\d+(?:\.\d+)?)\b", ql)
    if m_range: return float(m_range.group(1)), float(m_range.group(2))
    if m_plus:  return float(m_plus.group(1)), 30.0
    if m_exact: 
        v = float(m_exact.group(1)); return v-0.25, v+0.25
    return None

def structured_monster_candidates(query: str, topk=20):
    want_poison = _want_poison(query)
    want_strong = _want_strongest(query)
    cr_range = _parse_cr_range(query)
    env_hits = _env_hits_from_query(query)

    mons = [c for c in chunks if c["metadata"].get("doc_type")=="monster"]

    def passes(c):
        m = c["metadata"]
        # CR
        if cr_range:
            lo, hi = cr_range
            cr = m.get("cr")
            if cr is None or not (lo <= cr <= hi):
                return False
        # environment (normalized list of lowercase strings)
        if env_hits:
            envs = set(m.get("environments") or [])
            if not any(e in envs for e in env_hits):
                return False
        # poison
        if want_poison:
            bag = set((m.get("resistances") or []) + (m.get("immunities") or []))
            if not any("poison" in x for x in bag):
                return False
        return True

    filt = [c for c in mons if passes(c)]
    if want_strong:
        filt = sorted(filt, key=lambda c: (c["metadata"].get("cr") or -1), reverse=True)
        if not filt:
            filt = sorted(mons, key=lambda c: (c["metadata"].get("cr") or -1), reverse=True)

    return [c["id"] for c in filt[:max(12, topk)]]


# Structured fallback for monster queries
def build_context_from_ids(ids):
    blocks = []
    for did in ids:
        d = chunks_map[did]; m = d["metadata"]
        meta_bits = []
        for k in ("cr","type","size","ac","hp","resistances","immunities","environments"):
            v = m.get(k)
            if v not in [None,"",[],{}]: meta_bits.append(f"{k}:{v}")
        meta_str = ", ".join(meta_bits)
        excerpt = (d["content"] or "")[:900]
        blocks.append(f"{d['title']} — monster ({meta_str})\n{excerpt}")
    return "\n\n".join(blocks)

def ask(query: str, topk=12) -> str:
    # First try your hybrid retrieval with hard-filtering (from your previous cell)
    ids = smart_retrieve_hardfiltered(query, topk=topk)

    # If we inferred monsters and got none (or too few), use structured fallback from metadata
    if "monster" in query.lower() or wants_strongest_from_query(query) or want_poison_from_query(query):
        if len([d for d in ids if chunks_map[d]["metadata"].get("doc_type")=="monster"]) < 5:
            ids = structured_monster_candidates(query, topk=topk)

    # Final safety: if still nothing, say so
    if not ids:
        return "No monsters were found in the SRD context for that query."

    try:
        final_ids = rerank(query, ids, topk=topk)
    except NameError:
        final_ids = ids[:topk]

    # Reuse smart generator, but it will now see **only monsters** in context
    return generate_answer_smart(query, final_ids)

print("✅ Monster normalization complete. Examples:")
# quick sanity checks
poison_count = sum(1 for c in chunks if c["metadata"].get("doc_type")=="monster" 
                   and any("poison" in x for x in (c["metadata"].get("resistances",[])+c["metadata"].get("immunities",[]))))
abyss_count = sum(1 for c in chunks if c["metadata"].get("doc_type")=="monster" 
                  and "abyss" in (c["metadata"].get("environments") or []))

✅ Monster normalization complete. Examples:


### Initial Query Testing

In [39]:
print(ask("Which monsters are the strongest?"))

- **Tarrasque** (CR 30.0) — Monstrosity (Titan), Gargantuan size, AC 25, HP 676, immune to fire and poison, resistant to bludgeoning, piercing, and slashing from nonmagical attacks. Notable for its legendary actions and regeneration. Typically found in any terrain.
  
- **Tiamat** (CR 30.0) — Fiend, Gargantuan size, AC 25, HP 615, immune to fire, poison, and lightning damage, resistant to cold and acid. Notable for its breath weapons and legendary actions. Often found in lairs or mountainous regions.

- **Demogorgon** (CR 26.0) — Fiend (Demon), Huge size, AC 22, HP 406, immune to psychic damage, resistant to cold, fire, and lightning. Notable for its multiattack and ability to cause madness. Typically found in the Abyss.

- **Orcus** (CR 26.0) — Fiend (Demon), Huge size, AC 17, HP 405, immune to necrotic damage, resistant to cold and fire. Notable for its spells and the ability to raise undead. Commonly found in the Abyss.

- **Zariel** (CR 26.0) — Fiend (Devil), Large size, AC 21, HP 

In [61]:
print(ask("Which monsters are Dragons?"))

- **adult-black-dragon** (CR 14.0) — Known for its cunning and cruelty, it has resistance to acid and is immune to poison. It often dwells in swamps and marshes.
- **adult-blue-dragon** (CR 16.0) — A master of deception, it has resistance to lightning and is immune to paralysis. Typically found in arid deserts and rocky hills.
- **adult-brass-dragon** (CR 13.0) — Friendly and talkative, it has resistance to fire and is immune to sleep. Prefers warm deserts and coastal areas.
- **adult-bronze-dragon** (CR 15.0) — Noble and wise, it has resistance to lightning and is immune to paralysis. Commonly found near coastlines and in temperate climates.
- **adult-copper-dragon** (CR 14.0) — Known for its playful nature, it has resistance to acid and is immune to paralysis. Usually inhabits hills and mountains.

| Name                   | CR   | Resist/Immune           | Environments                |
|------------------------|------|-------------------------|-----------------------------|
| adult-

In [65]:
print(ask("Is there a swamps monster?"))

- **nilbog** (CR 1.0) — Type: humanoid (goblinoid), AC: 13, HP: 7, notable actions include the ability to reverse damage dealt to it. Environments: swamp.
- **boggle** (CR 0.125) — Type: fey, AC: 14, HP: 18, notable actions include the ability to create illusions and teleport short distances. Environments: swamp.

| Name    | CR   | Resist/Immune | Environments |
|---------|------|----------------|--------------|
| nilbog  | 1.0  | None specified | swamp        |
| boggle  | 0.125| None specified | swamp        |


In [69]:
# Replaces the earlier show_eval(). Uses the SAME retrieval & fallback logic as ask().

import textwrap, html, re
from IPython.display import display, HTML

# Small helpers pulled from pipeline
def _intent_pref(query: str):
    q = query.lower()
    # plural-aware intent
    is_monster = bool(re.search(r"\b(monster|monsters|creature|creatures|stat ?block|cr|challenge rating)\b", q))
    is_spell   = bool(re.search(r"\b(spell|spells|casting|spell slot|concentration)\b", q))
    if is_monster and not is_spell: return "monster"
    if is_spell and not is_monster: return "spell"
    return None

def _want_poison(q: str):
    return bool(re.search(r"\bpoison(ed|ing)?|toxin(s)?\b", q.lower()))

def _want_strong(q: str):
    return bool(re.search(r"\bstrong(est)?\b|\bhighest\b|\btop\b|\bmost powerful\b", q.lower()))

# Fallback candidate selector
def _structured_monster_candidates(query: str, topk=20):
    # expects chunks_map to have normalized: md["cr"] (float), md["resistances"], md["immunities"], md["environments"] (lists, lowercase)
    from fractions import Fraction
    ql = query.lower()

    # env synonyms (add plurals)
    ENV_SYNONYMS = {
        "underground": ["underground","cavern","cave","caves","dungeon","dungeons"],
        "swamp": ["swamp","swamps","bog","bogs","marsh","marshes"],
        "forest": ["forest","forests","woods","jungle","jungles"],
        "arctic": ["arctic","tundra","snow","ice"],
        "desert": ["desert","deserts","dune","dunes","wastes"],
        "coastal": ["coast","coasts","coastal","shore","beach","beaches"],
        "urban": ["city","cities","town","towns","urban","sewer","sewers"],
        "abyss": ["abyss","abyssal"],
        "mountains": ["mountain","mountains","hills","rocky hills"],
    }

    def parse_cr(x):
        if x is None: return None
        s = str(x).strip()
        if not s or s == "-": return None
        try: return float(s)
        except:
            try: return float(Fraction(s))
            except: return None

    # parse CR range from query
    m_range = re.search(r"\bcr\s*(\d+(?:\.\d+)?)\s*[-–]\s*(\d+(?:\.\d+)?)\b", ql)
    m_plus  = re.search(r"\bcr\s*(\d+(?:\.\d+)?)[+]\b", ql)
    m_exact = re.search(r"\bcr\s*(\d+(?:\.\d+)?)\b", ql)
    cr_range = None
    if m_range: cr_range = (float(m_range.group(1)), float(m_range.group(2)))
    elif m_plus: cr_range = (float(m_plus.group(1)), 30.0)
    elif m_exact: v = float(m_exact.group(1)); cr_range = (v-0.25, v+0.25)

    # env hits
    env_hits = set()
    for env, words in ENV_SYNONYMS.items():
        if any(w in ql for w in words):
            env_hits.add(env)

    want_poison = _want_poison(query)
    want_strong = _want_strong(query)

    mons = [c for c in chunks if c["metadata"].get("doc_type")=="monster"]
    # ensure CR parsed
    for c in mons:
        c["metadata"]["cr"] = parse_cr(c["metadata"].get("cr"))

    def passes(c):
        m = c["metadata"]
        # CR
        if cr_range:
            cr = m.get("cr")
            if cr is None or not (cr_range[0] <= cr <= cr_range[1]): return False
        # env
        if env_hits:
            envs = set((m.get("environments") or []))
            if not envs.intersection(env_hits): return False
        # poison
        if want_poison:
            bag = set((m.get("resistances") or []) + (m.get("immunities") or []))
            if not any("poison" in x for x in bag): return False
        return True

    filt = [c for c in mons if passes(c)]
    if want_strong:
        filt = sorted(filt, key=lambda c: (c["metadata"].get("cr") or -1), reverse=True)
        if not filt:
            filt = sorted(mons, key=lambda c: (c["metadata"].get("cr") or -1), reverse=True)

    return [c["id"] for c in filt[:max(12, topk)]]

# ---- CSS for clean screenshots
_LS_CSS = """
<style>
.ls-wrap{font-family:ui-sans-serif,system-ui,-apple-system,Segoe UI,Roboto}
.ls-panel{border:1px solid #e2e8f0;border-radius:12px;margin:14px 0;padding:12px 14px;background:#fff;box-shadow:0 1px 2px rgba(16,24,40,.04)}
.ls-head{font-weight:700;margin-bottom:6px;font-size:16px}
.ls-sub{color:#475569;margin-bottom:10px}
.ls-table{width:100%;border-collapse:separate;border-spacing:0 6px;font-size:14px}
.ls-table th{background:#f8fafc;color:#334155;text-align:left;padding:8px;border-bottom:1px solid #e2e8f0;}
.ls-table td{background:#fff;padding:8px;vertical-align:top;border-top:1px solid #e5e7eb;border-bottom:1px solid #e5e7eb;}
.ls-table td.title{width:22%}
.ls-table td.type{width:10%;color:#334155}
.ls-table td.key{width:10%;color:#334155}
.ls-table td.snippet{width:48%;color:#0f172a}
.ls-table td.src{width:10%;color:#475569}
.ls-note{color:#64748b;font-size:12px;margin-top:8px}
code{background:#f1f5f9;padding:2px 6px;border-radius:6px}
.ls-answer{line-height:1.55;font-size:15px;white-space:pre-wrap}
</style>
"""

def _snippet(text, width=420):
    s = (text or "").replace("\n"," ")
    return html.escape(textwrap.shorten(s, width=width, placeholder="…"))

def _entity_key(m):
    return m.get("level","") if m.get("doc_type")=="spell" else m.get("cr","")

def _retrieve_ids_with_fallback(query: str, topk_search: int):
    """Use same flow as ask(): hybrid → hard-filter → structured monster fallback when needed."""
    pref = _intent_pref(query)
    # primary retrieval
    try:
        ids = smart_retrieve_hardfiltered(query, topk=topk_search)  # prefer your hard-filtered
    except Exception:
        try:
            ids = smart_retrieve(query, topk=topk_search)
        except Exception:
            ids = hybrid_retrieve(query, filters={}, topk=topk_search)

    # if user clearly asked about monsters and we have few/none, use structured fallback
    if pref == "monster":
        mon_ids = [d for d in ids if chunks_map[d]["metadata"].get("doc_type")=="monster"]
        if len(mon_ids) < 3:  # top up
            fallback = _structured_monster_candidates(query, topk=topk_search)
            if fallback:
                # merge unique preserving order
                seen = set(ids)
                ids = mon_ids + [d for d in fallback if d not in seen]

    # rerank
    try:
        ids = rerank(query, ids, topk=topk_search)
    except Exception:
        ids = ids[:topk_search]
    return ids

def _build_search_panel(ids, query, topn=3):
    rows_html=[]
    for did in ids[:topn]:
        d = chunks_map.get(did, {})
        m = d.get("metadata", {})
        rows_html.append(f"""
        <tr>
          <td class="title"><b>{html.escape(str(d.get('title','')))}</b></td>
          <td class="type">{html.escape(str(m.get('doc_type','')))}</td>
          <td class="key">{html.escape(str(_entity_key(m)))}</td>
          <td class="snippet">{_snippet(d.get('content',''), 420)}</td>
          <td class="src">{html.escape(str(m.get('source','SRD')))}</td>
        </tr>""")
    if not rows_html:
        rows_html.append('<tr><td colspan="5" style="text-align:center;color:#666;">No results found.</td></tr>')
    return f"""
    <div class="ls-wrap">
      <div class="ls-panel">
        <div class="ls-head">🔎 Search Layer — Top {topn}</div>
        <div class="ls-sub">Query: <code>{html.escape(query)}</code></div>
        <table class="ls-table">
          <thead><tr><th>Title</th><th>Type</th><th>Lvl/CR</th><th>Snippet</th><th>Source</th></tr></thead>
          <tbody>{''.join(rows_html)}</tbody>
        </table>
        <div class="ls-note">Retrieval output (fusion + optional rerank), before generation.</div>
      </div>
    </div>
    """

def _build_answer_panel(answer, query):
    safe = answer if isinstance(answer,str) else str(answer)
    return f"""
    <div class="ls-wrap">
      <div class="ls-panel">
        <div class="ls-head">✨ Generation Layer — Final Answer</div>
        <div class="ls-sub">Query: <code>{html.escape(query)}</code></div>
        <div class="ls-answer">{safe}</div>
      </div>
    </div>
    """

def show_eval(query: str, topk_search: int = 12, topn_show: int = 3):
    ids = _retrieve_ids_with_fallback(query, topk_search=topk_search)
    display(HTML(_LS_CSS))
    display(HTML(_build_search_panel(ids, query, topn=topn_show)))

    # generation
    try:
        answer = generate_answer_smart(query, ids[:topn_show])
    except Exception:
        answer = generate_answer(query, ids[:topn_show])
    display(HTML(_build_answer_panel(answer, query)))

In [70]:
show_eval("Is there a swamps monster?")

Title,Type,Lvl/CR,Snippet,Source
boggle,monster,0.125,,SRD Monsters (GitHub CSV)
nilbog,monster,1.0,,SRD Monsters (GitHub CSV)


In [71]:
show_eval("Which monsters are Dragons?")

Title,Type,Lvl/CR,Snippet,Source
aarakocra,monster,0.25,,SRD Monsters (GitHub CSV)
abjurer,monster,9.0,,SRD Monsters (GitHub CSV)
aboleth,monster,10.0,,SRD Monsters (GitHub CSV)


In [72]:
show_eval("Which monsters are the strongest?")

Title,Type,Lvl/CR,Snippet,Source
tarrasque,monster,30.0,,SRD Monsters (GitHub CSV)
tiamat,monster,30.0,,SRD Monsters (GitHub CSV)
demogorgon,monster,26.0,,SRD Monsters (GitHub CSV)


In [74]:
show_eval("What are the most interesting spells?")

Title,Type,Lvl/CR,Snippet,Source
Find the Path,spell,6,"Range: Self Duration: Concentration, up to 1 day This spell allows you to find the shortest, most direct physical route to a specific fixed location that you are familiar with on the same plane of existence. If you name a destination on another plan of existence, a destination that moves (such as a mobile fortress), or a destination that isn’t specific (such as ""a green dragon’s lair”), the spell fails. For the…",SRD Spells (GitHub CSV)
Prismatic Wall,spell,9,spells and magical effects.,SRD Spells (GitHub CSV)
Identify,spell,1,"Range: Touch Duration: Instantaneous You choose one object that you must touch throughout the casting of the spell. If it is a magic item or some other magic-imbued object, you learn its properties and how to use them, whether it requires attunement to use, and how many charges it has, if any. You learn whether any spells are affecting the item and what they are. If the item was created by a spell, you learn which…",SRD Spells (GitHub CSV)
