# RAG × Groq Llama 3.1 — Hindu Scripture 


**What this notebook does**  
- Downloads Corpus(Bhagavad Gita, Rig Veda, Upanishads)
- Creates embeddings for the data and saves chunks locally

> Created: 2025-11-11


## 1) Setup

In [11]:
!pip3 install -q sentence-transformers rank-bm25 requests numpy scipy beautifulsoup4 lxml

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
555.17s - pydevd: Sending message related to process being replaced timed-out after 5 seconds



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


In [12]:

# If running the first time, uncomment these installs.
# %pip install -q sentence-transformers rank-bm25 requests numpy scipy
# Optional (speeds up BLAS on some systems): 
# %pip install -q accelerate

import os, sys, json, math, time, textwrap, traceback, random
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple
import numpy as np
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import requests

# ---- Configure Groq API ----
# 1) Set your key here (or export as an env var before starting Jupyter):
# os.environ['GROQ_API_KEY'] = "paste-your-key-here"
GROQ_API_KEY = os.getenv("GROQ_API_KEY", None)

# Pick a Groq model (both are free tier as of writing)
GROQ_MODEL = "llama-3.1-8b-instant"  # alt: "llama-3.1-70b-versatile"

print("Groq key present?", bool(GROQ_API_KEY))
print("Using model:", GROQ_MODEL)


Groq key present? True
Using model: llama-3.1-8b-instant


## Data Download (gita)

In [None]:
#  example schema structure for data, enabling scraping from Vedas/Upanishads/Gita
example_schema = {
  "id": "RV.1.001.001",                # stable ID for your corpus
  "work": "Rig Veda",
  "collection": "Vedas",               # optional: Vedas/Upanishads/Gita
  "canonical_ref": "RV 1.1.1",         # human-facing reference
  "book": 1,                           # Rig: mandala/book number
  "hymn": 1,                           # Rig: hymn number (sukta)
  "verse": 1,                          # Rig: verse within hymn (ṛc)
  "upanishad": 0,                   # Upanishads: name string if used
  "chapter": 0,                     # Upanishads/Gita: chapter if applicable
  "section": 0,                     # khanda/anuvaka etc.
  "translator": "R. T. H. Griffith (1896) — Public Domain",
  "year": 1896,
  "license": "Public Domain",
  "source_url": "https://…",
  "lang": "en",
  "text": "Agni I invoke, the household priest…"
}


In [17]:
# --- Robust fetch & poem-line parser for Sacred-Texts (Arnold PD) ---
# If needed: %pip install -q beautifulsoup4 lxml

import re, json, time, pathlib, requests
from bs4 import BeautifulSoup, NavigableString

BASE = "https://www.sacred-texts.com/hin/gita/"
CHAPTERS = [f"bg{str(i).zfill(2)}.htm" for i in range(1, 19)]
OUT_DIR = pathlib.Path("data"); OUT_DIR.mkdir(exist_ok=True, parents=True)
OUT_PATH = OUT_DIR / "bg_arnold.jsonl"

HDRS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/120.0 Safari/537.36"
}

def fetch(url, tries=3, sleep=0.8):
    for _ in range(tries):
        r = requests.get(url, timeout=30, headers=HDRS)
        if r.ok: return r.text
        time.sleep(sleep)
    r.raise_for_status()

def clean(s: str) -> str:
    return re.sub(r"\s+", " ", s).strip(" \u00a0")

def looks_like_heading(s: str) -> bool:
    s2 = s.strip()
    # skip chapter titles / subtitles / navigation
    if len(s2) <= 2: return True
    if s2.upper() == s2 and any(w.isalpha() for w in s2) and len(s2) < 60: return True
    if re.match(r"^(chapter|book)\b", s2, re.I): return True
    if re.match(r"^\(?[IVXLC]+\)?\.*$", s2): return True  # roman numerals
    if re.match(r"^\d+\.$", s2): return True              # stray number
    if "sacred-texts.com" in s2.lower(): return True
    return False

def parse_poem_lines(html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")
    for tag in soup.select("script, style, header, footer, nav"): tag.decompose()

    body = soup.find("body") or soup
    # Heuristic: use the largest center/blockquote/p blocks (poem sections usually sit there)
    candidates = body.find_all(["center","blockquote","div","p"])
    lines = []
    for el in candidates:
        # split on <br> boundaries by walking strings
        chunk_lines = []
        buf = []
        for node in el.descendants:
            if isinstance(node, NavigableString):
                buf.append(str(node))
            elif getattr(node, "name", None) == "br":
                txt = clean(" ".join(buf))
                if txt:
                    chunk_lines.append(txt)
                buf = []
        # flush tail
        tail = clean(" ".join(buf))
        if tail:
            chunk_lines.append(tail)
        # filter headings / nav
        chunk_lines = [ln for ln in chunk_lines if not looks_like_heading(ln)]
        # keep only sections that look poem-y (multiple shortish lines)
        if sum(len(x) for x in chunk_lines) > 120 and len(chunk_lines) >= 5:
            lines.extend(chunk_lines)

    # if nothing found, fallback to full text split on double spaces / periods
    if not lines:
        full = clean(body.get_text(" ", strip=True))
        parts = [p.strip() for p in re.split(r"\s{2,}|\n+", full) if p.strip()]
        lines = [p for p in parts if not looks_like_heading(p)]

    # final cleaning: drop duplicate nav tails
    lines = [ln for ln in lines if len(ln) > 1]
    return lines

# ---- Download & parse all chapters into sequentially-numbered "verses" ----
full = []
for chap, fname in enumerate(CHAPTERS, start=1):
    url = BASE + fname
    print("Fetching", url)
    html = fetch(url)
    poem = parse_poem_lines(html)

    # DEBUG: show first 8 lines so you can eyeball quality
    print(f"  → poem lines parsed: {len(poem)}")
    for preview in poem[:8]:
        print("     ·", preview[:100])

    # assign sequential verse numbers per chapter
    for i, text in enumerate(poem, start=1):
        full.append({
            "canon_id": f"BG {chap}.{i}",
            "work": "Bhagavad Gita",
            "chapter": chap,
            "verse": i,
            "translator": "Sir Edwin Arnold (1885) — Public Domain",
            "license": "Public Domain",
            "source": url,
            "text": text
        })

print("Chapters parsed:", len({x['chapter'] for x in full}))
print("Total verse-units:", len(full))
print("Sample:", full[:3])

with open(OUT_PATH, "w", encoding="utf-8") as f:
    for e in full:
        f.write(json.dumps(e, ensure_ascii=False) + "\n")

print("Saved to", OUT_PATH)


Fetching https://www.sacred-texts.com/hin/gita/bg01.htm
  → poem lines parsed: 133
     · Dhritirashtra. Ranged thus for battle on the sacred plain-
     · On Kurukshetra- say, Sanjaya! say
     · What wrought my people, and the Pandavas?
     · Sanjaya. When he beheld the host of Pandavas,
     · Raja Duryodhana to Drona drew,
     · And spake these words: "Ah, Guru! see this line,
     · How vast it is of Pandu fighting-men,
     · Embattled by the son of Drupada,
Fetching https://www.sacred-texts.com/hin/gita/bg02.htm
  → poem lines parsed: 270
     · Sanjaya. Him, filled with such compassion and such grief,
     · With eyes tear-dimmed, despondent, in stern words
     · The Driver, Madhusudan, thus addressed:
     · Krishna. How hath this weakness taken thee?
     · Whence springs
     · The inglorious trouble, shameful to the brave,
     · Barring the path of virtue? Nay, Arjun!
     · Forbid thyself to feebleness! it mars
Fetching https://www.sacred-texts.com/hin/gita/bg03.htm
  

In [24]:
# ===== Upanishads (SBE 1 & 15) → download, parse, and save JSONL shards =====
# If needed:  %pip install -q beautifulsoup4 lxml
import re, json, time, pathlib, requests, urllib.parse
from bs4 import BeautifulSoup, NavigableString

BASES = [
    # Index pages for Max Müller SBE translations (Public Domain)
    "https://www.sacred-texts.com/hin/sbe01/index.htm",   # Brihadaranyaka, Chandogya, etc.
    "https://www.sacred-texts.com/hin/sbe15/index.htm",   # Isa, Kena, Katha, Prashna, Mundaka, Mandukya, etc.
]
OUT_ROOT = pathlib.Path("data/upanishads_sbe")
OUT_ROOT.mkdir(parents=True, exist_ok=True)

HDRS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/120.0 Safari/537.36"
}

def fetch(url, tries=3, sleep=0.7):
    for _ in range(tries):
        r = requests.get(url, headers=HDRS, timeout=30)
        if r.ok:
            return r.text
        time.sleep(sleep)
    r.raise_for_status()

def abs_url(base, href):
    return urllib.parse.urljoin(base, href)

def clean(s: str) -> str:
    return re.sub(r"\s+", " ", s).strip(" \u00a0")

# very loose heading detector to skip navigation and headers
def looks_like_heading(s: str) -> bool:
    s2 = s.strip()
    if not s2: return True
    if len(s2) <= 2: return True
    if re.match(r"^(chapter|book|section|khanda|adhy[aā]ya|lesson)\b", s2, re.I): return True
    if re.match(r"^[IVXLC]+\.*$", s2.strip(), re.I): return True
    if "sacred-texts.com" in s2.lower(): return True
    return False

def guess_upanishad_slug(title_or_url: str) -> str:
    t = title_or_url.lower()
    for key in ["isa", "kena", "katha", "prashna", "prasna", "mundaka", "mandukya", "taittiriya", "aitareya",
                "kaushitaki", "kau", "kaushit", "svetasvatara", "svesh", "svet", "brihadaranyaka", "brhad", "chandogya",
                "chandogya", "katha", "katha2", "katha-upanishad", "upanishad"]:
        if key in t:
            return re.sub(r"[^a-z0-9]+","_", key)
    # fallback: last path piece
    leaf = title_or_url.rsplit("/",1)[-1]
    leaf = leaf.split(".")[0]
    return re.sub(r"[^a-z0-9]+","_", leaf)

def extract_title(soup: BeautifulSoup) -> str:
    for tag in ["h1","h2","title","center","b","strong"]:
        el = soup.find(tag)
        if el and el.get_text(strip=True):
            return clean(el.get_text(" ", strip=True))
    return "Upanishad"

def parse_page_to_lines(html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")
    for t in soup.select("script, style, header, footer, nav"): t.decompose()
    body = soup.find("body") or soup
    # split on <br> and <p>, skip headings
    lines = []
    for el in body.find_all(["p","div","blockquote","center"]):
        buf = []
        for node in el.descendants:
            if isinstance(node, NavigableString):
                buf.append(str(node))
            elif getattr(node, "name", None) == "br":
                txt = clean(" ".join(buf))
                if txt and not looks_like_heading(txt):
                    lines.append(txt)
                buf = []
        tail = clean(" ".join(buf))
        if tail and not looks_like_heading(tail):
            lines.append(tail)
    # light prune
    lines = [ln for ln in lines if len(ln) > 1]
    return lines

# Crawl index pages → follow .htm links in the same directory tree
visited, pages = set(), []
for idx_url in BASES:
    idx_html = fetch(idx_url)
    soup = BeautifulSoup(idx_html, "lxml")
    base_dir = idx_url.rsplit("/",1)[0] + "/"
    for a in soup.find_all("a", href=True):
        href = a["href"]
        if not href.lower().endswith(".htm"): continue
        url = abs_url(idx_url, href)
        if not url.startswith(base_dir): continue
        if "index" in url.lower(): continue
        pages.append(url)

pages = sorted(set(pages))
print(f"Discovered {len(pages)} candidate pages from SBE indices.")

# Group pages by upanishad slug (heuristic)
groups = {}
for url in pages:
    slug = guess_upanishad_slug(url)
    groups.setdefault(slug, []).append(url)

print("Grouped into", len(groups), "upanishad buckets (heuristic).")

# Parse each bucket → sequentially numbered units per upanishad
for slug, urls in groups.items():
    urls = sorted(urls)
    out_path = OUT_ROOT / f"{slug}.jsonl"
    items = []
    unit_no = 0
    for url in urls:
        try:
            html = fetch(url)
            lines = parse_page_to_lines(html)
            # Debug preview
            print(f"[{slug}] {url} → {len(lines)} lines")
            for ln in lines:
                unit_no += 1
                items.append({
                    "id": f"{slug}.{unit_no:04d}",
                    "collection": "Upanishads",
                    "work": slug,                      # normalized slug for filtering
                    "upanishad": extract_title(BeautifulSoup(html, "lxml")),
                    "canonical_ref": f"{slug.upper()} {unit_no}",
                    "chapter": None, "section": None, "verse": unit_no,
                    "translator": "Max Müller (Sacred Books of the East) — Public Domain",
                    "year": 1884,
                    "license": "Public Domain",
                    "source_url": url,
                    "lang": "en",
                    "text": ln
                })
        except Exception as e:
            print(f"  !! Failed: {url} → {e}")
    if not items:
        print(f"  !! No items for {slug}, skipping.")
        continue
    with open(out_path, "w", encoding="utf-8") as f:
        for it in items:
            f.write(json.dumps(it, ensure_ascii=False) + "\n")
    print(f"[OK] Saved {len(items)} units → {out_path}")


Discovered 367 candidate pages from SBE indices.
Grouped into 365 upanishad buckets (heuristic).
[errata] https://www.sacred-texts.com/hin/sbe01/errata.htm → 1 lines
[errata] https://www.sacred-texts.com/hin/sbe15/errata.htm → 2 lines
[OK] Saved 3 units → data/upanishads_sbe/errata.jsonl
[pageidx] https://www.sacred-texts.com/hin/sbe01/pageidx.htm → 1 lines
[pageidx] https://www.sacred-texts.com/hin/sbe15/pageidx.htm → 1 lines
[OK] Saved 2 units → data/upanishads_sbe/pageidx.jsonl
[sbe01000] https://www.sacred-texts.com/hin/sbe01/sbe01000.htm → 10 lines
[OK] Saved 10 units → data/upanishads_sbe/sbe01000.jsonl
[sbe01001] https://www.sacred-texts.com/hin/sbe01/sbe01001.htm → 23 lines
[OK] Saved 23 units → data/upanishads_sbe/sbe01001.jsonl
[sbe01002] https://www.sacred-texts.com/hin/sbe01/sbe01002.htm → 121 lines
[OK] Saved 121 units → data/upanishads_sbe/sbe01002.jsonl
[sbe01003] https://www.sacred-texts.com/hin/sbe01/sbe01003.htm → 79 lines
[OK] Saved 79 units → data/upanishads_sbe/sbe

In [25]:
# ===== Rig Veda (Griffith) → download, parse, and save JSONL per book =====
import re, json, time, pathlib, requests, urllib.parse
from bs4 import BeautifulSoup, NavigableString

# Index with links to Books (Mandala I..X). If this ever changes, set BOOK_LINKS manually.
RV_INDEX = "https://www.sacred-texts.com/hin/rigveda/index.htm"
OUT_DIR = pathlib.Path("data/rigveda_griffith")
OUT_DIR.mkdir(parents=True, exist_ok=True)

HDRS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/120.0 Safari/537.36"
}

def fetch(url, tries=3, sleep=0.7):
    for _ in range(tries):
        r = requests.get(url, headers=HDRS, timeout=30)
        if r.ok:
            return r.text
        time.sleep(sleep)
    r.raise_for_status()

def abs_url(base, href):
    return urllib.parse.urljoin(base, href)

def clean(s: str) -> str:
    return re.sub(r"\s+", " ", s).strip(" \u00a0")

LEAD_NUM = re.compile(r"^\s*(\d{1,3})\s*[\.\):\-–—]?\s*(.*)$", re.U | re.I)

def get_links_from(url, pattern=None, same_dir=True):
    html = fetch(url)
    soup = BeautifulSoup(html, "lxml")
    base_dir = url.rsplit("/",1)[0] + "/"
    out = []
    for a in soup.find_all("a", href=True):
        href = a["href"]
        if not href.lower().endswith(".htm"): continue
        full = abs_url(url, href)
        if same_dir and not full.startswith(base_dir): continue
        if pattern and not re.search(pattern, full, re.I): continue
        if "index" in full.lower(): continue
        out.append(full)
    return sorted(set(out))

# 1) From the index, find Book pages (mandala pages)
book_pages = get_links_from(RV_INDEX, pattern=r"(book|bk|mandala|rv|rvi|rvs)")
print("Found candidate Book pages:", len(book_pages))
if not book_pages:
    print("No book pages discovered from index. You may set BOOK_PAGES manually.")

# 2) From each Book page, find Hymn pages; then parse verses inside each hymn page
def parse_hymn_page(html):
    soup = BeautifulSoup(html, "lxml")
    for t in soup.select("script, style, header, footer, nav"): t.decompose()
    body = soup.find("body") or soup
    lines = []
    # Many hymn pages number verses as "1. text", "2. text", etc.
    # We'll split on <br> and <p> and capture leading numbers.
    verse_buffer = []
    for el in body.find_all(["p","div","blockquote","center"]):
        buf = []
        for node in el.descendants:
            if isinstance(node, NavigableString):
                buf.append(str(node))
            elif getattr(node, "name", None) == "br":
                txt = clean(" ".join(buf))
                buf = []
                if not txt: continue
                m = LEAD_NUM.match(txt)
                if m:
                    # start a new numbered verse
                    lines.append((int(m.group(1)), clean(m.group(2))))
                elif lines:
                    # continuation of current verse
                    vno, prev = lines[-1]
                    lines[-1] = (vno, clean(prev + " " + txt))
        tail = clean(" ".join(buf))
        if tail:
            m = LEAD_NUM.match(tail)
            if m:
                lines.append((int(m.group(1)), clean(m.group(2))))
            elif lines:
                vno, prev = lines[-1]
                lines[-1] = (vno, clean(prev + " " + tail))
    # de-dup and prune
    seen, out = set(), []
    for vno, txt in lines:
        key = (vno, txt)
        if key in seen: continue
        seen.add(key)
        if txt and len(txt) > 1:
            out.append((vno, txt))
    return out

def parse_book(book_url, book_num_guess=None):
    # discover hymn pages from the book page
    hymn_pages = get_links_from(book_url, pattern=r"(hymn|hym|rv|rvi|rvs|bk|book)", same_dir=False)
    print(f"Book page {book_url} → {len(hymn_pages)} hymn pages")
    # guess book number from URL/title if not provided
    book_num = book_num_guess
    if book_num is None:
        m = re.search(r"book\s*(\d+)", fetch(book_url), re.I)
        if m: book_num = int(m.group(1))
    if book_num is None:
        # last-resort heuristic
        m = re.search(r"/(\d{1,2})[^/]*\.htm$", book_url)
        book_num = int(m.group(1)) if m else 0

    all_rows = []
    hymn_idx = 0
    for hymn_url in hymn_pages:
        hymn_idx += 1
        try:
            html = fetch(hymn_url)
            verses = parse_hymn_page(html)
            if not verses:
                # fallback: treat whole page as one verse
                txt = clean(BeautifulSoup(html, "lxml").get_text(" ", strip=True))
                if txt:
                    verses = [(1, txt)]
            for vno, text in verses:
                all_rows.append((book_num, hymn_idx, vno, text, hymn_url))
        except Exception as e:
            print("  !! hymn fail:", hymn_url, "→", e)
    return all_rows

# Parse each discovered book page and write per-book JSONL
total = 0
for i, b in enumerate(book_pages, start=1):
    rows = parse_book(b, book_num_guess=i if i<=10 else None)
    if not rows:
        print("!! empty book:", b)
        continue
    out_path = OUT_DIR / f"rv_book{i:02d}.jsonl"
    with open(out_path, "w", encoding="utf-8") as f:
        for (book, hymn, verse, text, src) in rows:
            item = {
                "id": f"RV.{book}.{hymn}.{verse}",
                "collection": "Vedas",
                "work": "Rig Veda",
                "book": book, "hymn": hymn, "verse": verse,
                "canonical_ref": f"RV {book}.{hymn}.{verse}",
                "translator": "R. T. H. Griffith — Public Domain",
                "year": 1896,
                "license": "Public Domain",
                "source_url": src,
                "lang": "en",
                "text": text
            }
            f.write(json.dumps(item, ensure_ascii=False) + "\n")
            total += 1
    print(f"[OK] Book {i:02d}: {len(rows)} verses → {out_path}")

print("Total Rig Veda units written:", total)


Found candidate Book pages: 11
Book page https://www.sacred-texts.com/hin/rigveda/rv01000.htm → 2 hymn pages
[OK] Book 01: 10 verses → data/rigveda_griffith/rv_book01.jsonl
Book page https://www.sacred-texts.com/hin/rigveda/rvi01.htm → 194 hymn pages
[OK] Book 02: 1982 verses → data/rigveda_griffith/rv_book02.jsonl
Book page https://www.sacred-texts.com/hin/rigveda/rvi02.htm → 46 hymn pages
[OK] Book 03: 436 verses → data/rigveda_griffith/rv_book03.jsonl
Book page https://www.sacred-texts.com/hin/rigveda/rvi03.htm → 65 hymn pages
[OK] Book 04: 620 verses → data/rigveda_griffith/rv_book04.jsonl
Book page https://www.sacred-texts.com/hin/rigveda/rvi04.htm → 61 hymn pages
[OK] Book 05: 592 verses → data/rigveda_griffith/rv_book05.jsonl
Book page https://www.sacred-texts.com/hin/rigveda/rvi05.htm → 90 hymn pages
[OK] Book 06: 729 verses → data/rigveda_griffith/rv_book06.jsonl
Book page https://www.sacred-texts.com/hin/rigveda/rvi06.htm → 78 hymn pages
[OK] Book 07: 770 verses → data/rigved

## 2. Save embeddings of these 3 scripture

In [26]:
# ===== Build embeddings for all shards under data/** =====
import os, json, pathlib, numpy as np
from sentence_transformers import SentenceTransformer

ROOT = pathlib.Path("data")
EMB_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # 384-d
embedder = SentenceTransformer(EMB_MODEL)

def read_jsonl(p: pathlib.Path):
    with p.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line: continue
            yield json.loads(line)

def build_folder(folder: pathlib.Path):
    # Collect all .jsonl files within this folder (non-recursive)
    jsonls = sorted([p for p in folder.glob("*.jsonl") if p.is_file()])
    if not jsonls:
        return None
    print(f"\n[Shard] {folder}  ({len(jsonls)} JSONL files)")
    items, texts = [], []
    for j in jsonls:
        for rec in read_jsonl(j):
            items.append(rec)
            texts.append(rec.get("text",""))
    if not items:
        print("  (no items)"); return None

    # Embed (batched), normalize, and save Float16
    print(f"  Embedding {len(items)} units …")
    embs = embedder.encode(texts, normalize_embeddings=True, batch_size=128, show_progress_bar=True)
    embs = np.asarray(embs, dtype=np.float32)
    dim = embs.shape[1]
    embs_f16 = embs.astype(np.float16)

    # Save combined chunks.jsonl and embeddings.f16.bin in the folder
    combined_chunks = folder / "chunks.jsonl"
    with combined_chunks.open("w", encoding="utf-8") as f:
        for rec in items:
            f.write(json.dumps(rec, ensure_ascii=False) + "\n")

    emb_bin = folder / "embeddings.f16.bin"
    embs_f16.tofile(emb_bin)

    # Write a small manifest
    manifest = {
        "count": len(items),
        "dim": int(dim),
        "model": EMB_MODEL,
        "files": [str(p.name) for p in jsonls],
        "combined_chunks": combined_chunks.name,
        "embeddings_bin": emb_bin.name,
    }
    with (folder / "manifest.json").open("w", encoding="utf-8") as f:
        json.dump(manifest, f, indent=2)
    print(f"  → Saved {len(items)} chunks, dim={dim}  to {folder.name}/")
    return manifest

# Walk immediate subfolders of data/ and build each shard
manifests = {}
for sub in sorted([p for p in ROOT.glob("*") if p.is_dir()]):
    m = build_folder(sub)
    if m:
        manifests[sub.name] = m

# Top-level manifest for the whole corpus
with (ROOT / "manifest.json").open("w", encoding="utf-8") as f:
    json.dump(manifests, f, indent=2)
print("\n[Done] Manifests written for shards:", ", ".join(manifests.keys()))



[Shard] data/gita_arnold  (1 JSONL files)
  Embedding 2436 units …


Batches:   0%|          | 0/20 [00:00<?, ?it/s]

  → Saved 2436 chunks, dim=384  to gita_arnold/

[Shard] data/rigveda_griffith  (11 JSONL files)
  Embedding 10562 units …


Batches:   0%|          | 0/83 [00:00<?, ?it/s]

  → Saved 10562 chunks, dim=384  to rigveda_griffith/

[Shard] data/upanishads_sbe  (365 JSONL files)
  Embedding 8412 units …


Batches:   0%|          | 0/66 [00:00<?, ?it/s]

  → Saved 8412 chunks, dim=384  to upanishads_sbe/

[Done] Manifests written for shards: gita_arnold, rigveda_griffith, upanishads_sbe


## 3) Build Hybrid Index (BM25 + MiniLM embeddings)

In [19]:

# Tokenization for BM25
def tokenize(s: str) -> List[str]:
    return s.lower().split()

bm25 = BM25Okapi([tokenize(c["text"]) for c in CORPUS])

# Sentence embeddings
EMB_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(EMB_MODEL_NAME)
doc_texts = [c["text"] for c in CORPUS]
doc_emb = embedder.encode(doc_texts, normalize_embeddings=True, show_progress_bar=False)
doc_emb = np.asarray(doc_emb, dtype=np.float32)
doc_emb.shape


(2436, 384)

## 4) Retrieval helpers

In [20]:

from typing import List, Dict, Any, Tuple

def embed_query(q: str) -> np.ndarray:
    return embedder.encode([q], normalize_embeddings=True)[0].astype(np.float32)

def cosine_topk(q_emb: np.ndarray, mat: np.ndarray, k: int = 5) -> List[Tuple[int, float]]:
    # mat: (N, D)
    scores = (mat @ q_emb).tolist()
    idxs = list(range(len(scores)))
    idxs.sort(key=lambda i: scores[i], reverse=True)
    return [(i, scores[i]) for i in idxs[:k]]

def bm25_topk(q: str, k: int = 5) -> List[Tuple[int, float]]:
    scores = bm25.get_scores(tokenize(q))
    idxs = list(range(len(scores)))
    idxs.sort(key=lambda i: scores[i], reverse=True)
    return [(i, float(scores[i])) for i in idxs[:k]]

def hybrid_retrieve(q: str, k_vec=5, k_bm=5, top_final=5) -> List[Dict[str, Any]]:
    q_emb = embed_query(q)
    vec = cosine_topk(q_emb, doc_emb, k=k_vec)
    bow = bm25_topk(q, k=k_bm)
    # merge by index with max score normalization
    combined = {}
    for i, s in vec:
        combined[i] = max(combined.get(i, 0.0), float(s))
    # Normalize BM25 scores
    if bow:
        max_bm = max(s for _, s in bow) or 1.0
        for i, s in bow:
            combined[i] = max(combined.get(i, 0.0), float(s)/max_bm * 0.9)  # weight BM25 a bit
    ranked = sorted(combined.items(), key=lambda x: x[1], reverse=True)[:top_final]
    results = [CORPUS[i] | {"_score": float(sc)} for i, sc in ranked]
    return results

# quick smoke test
hybrid_retrieve("What is nishkama karma?")


[{'canon_id': 'BG 8.3',
  'work': 'Bhagavad Gita',
  'chapter': 8,
  'verse': 3,
  'translator': 'Sir Edwin Arnold (1885) — Public Domain',
  'license': 'Public Domain',
  'source': 'https://www.sacred-texts.com/hin/gita/bg08.htm',
  'text': 'Thy work, the KARMA? Tell me what it is',
  '_score': 0.9},
 {'canon_id': 'BG 13.37',
  'work': 'Bhagavad Gita',
  'chapter': 13,
  'verse': 37,
  'translator': 'Sir Edwin Arnold (1885) — Public Domain',
  'license': 'Public Domain',
  'source': 'https://www.sacred-texts.com/hin/gita/bg13.htm',
  'text': 'And what is otherwise is ignorance!',
  '_score': 0.556723896743809},
 {'canon_id': 'BG 8.12',
  'work': 'Bhagavad Gita',
  'chapter': 8,
  'verse': 12,
  'translator': 'Sir Edwin Arnold (1885) — Public Domain',
  'license': 'Public Domain',
  'source': 'https://www.sacred-texts.com/hin/gita/bg08.htm',
  'text': 'Causing all life to live, is KARMA called:',
  '_score': 0.5285435914993286},
 {'canon_id': 'BG 2.250',
  'work': 'Bhagavad Gita',
  'c

## 5) Answer using Groq Llama (with citations), fallback to extractive

In [21]:

GROQ_URL = "https://api.groq.com/openai/v1/chat/completions"

SYSTEM_PROMPT = '''You are a Hindu scripture assistant.
Follow these rules strictly:
1. Answer using ONLY the provided passages.
2. After each claim, include citations like `BG 2.47` or `BG 3.19`.
3. If the answer is not directly supported, say "Not found in current corpus."
4. Be concise and neutral about doctrinal schools; mention when views differ.
'''

def build_context(retrieved: List[Dict[str, Any]]) -> str:
    blocks = []
    for r in retrieved:
        cid = r.get("canon_id","?")
        txt = r["text"]
        trn = r.get("translator","")
        blocks.append(f"[{cid}] {txt} (Translator: {trn})")
    return "\n\n".join(blocks)

def extractive_answer(query: str, retrieved: List[Dict[str, Any]]) -> Dict[str, Any]:
    # Simple rule-based summary of top chunks
    if not retrieved:
        return {"answer":"Not found in current corpus.", "citations":[]}
    snippets = []
    cites = []
    for r in retrieved[:3]:
        snippets.append(f"- {r['text']} ({r['canon_id']})")
        cites.append(r["canon_id"])
    answer = (
        "Here are relevant passages:\n" + "\n".join(snippets) + "\n\n"
        "Summary: Acting without attachment to outcomes; focusing on one’s duty; disciplined action. "
        "(citations: " + ", ".join(cites) + ")"
    )
    return {"answer": answer, "citations": cites}

def answer_query(query: str, k_vec=6, k_bm=6, top_final=5, temperature=0.2) -> Dict[str, Any]:
    retrieved = hybrid_retrieve(query, k_vec=k_vec, k_bm=k_bm, top_final=top_final)
    context = build_context(retrieved)
    if not GROQ_API_KEY:
        print("No GROQ_API_KEY found; using extractive fallback.\n")
        return extractive_answer(query, retrieved)

    payload = {
        "model": GROQ_MODEL,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Query: {query}\n\nPassages:\n{context}\n\nAnswer with citations."}
        ],
        "temperature": temperature,
        "max_tokens": 600,
        "stream": False
    }
    try:
        resp = requests.post(
            GROQ_URL,
            headers={
                "Authorization": f"Bearer {GROQ_API_KEY}",
                "Content-Type": "application/json"
            },
            json=payload,
            timeout=60
        )
        resp.raise_for_status()
        data = resp.json()
        text = data["choices"][0]["message"]["content"].strip()
        # naive citation scrape for demo
        cites = [c["canon_id"] for c in retrieved if c["canon_id"] in text]
        return {"answer": text, "citations": cites, "retrieved": retrieved}
    except Exception as e:
        print("Groq call failed, falling back to extractive. Error:\n", e)
        return extractive_answer(query, retrieved)


## 6) Quick Smoke Tests

In [23]:

tests = [
    "What does gita say about parents?",
    "Explain nishkama karma with references.",
    "Is it ok to be attached?",
]

for q in tests:
    print("="*80)
    print("Q:", q)
    out = answer_query(q)
    print("\nAnswer:\n", out["answer"])
    if "retrieved" in out:
        print("\nTop sources:")
        for r in out["retrieved"]:
            print(f"  - {r['canon_id']}  score={r['_score']:.3f}")


Q: What does gita say about parents?

Answer:
 The Bhagavad Gita does not directly mention parents in the provided passages. However, it emphasizes the importance of selfless actions and the concept of sacrifice, which can be applied to one's relationships with family members, including parents.

It is worth noting that in Hindu tradition, parents are considered sacred and are often referred to as "guru" or "pitru" (ancestors). The Gita's emphasis on selfless actions and sacrifice can be seen as a way of honoring and respecting one's parents.

Not found in current corpus for direct mention of parents.

Top sources:
  - BG 4.78  score=0.900
  - BG 1.2  score=0.592
  - BG 18.7  score=0.517
  - BG 17.58  score=0.486
  - BG 2.83  score=0.459
Q: Explain nishkama karma with references.

Answer:
 Nishkama karma refers to selfless action, where one performs their duties without attachment to the outcome. This concept is described in the Bhagavad Gita as follows:

BG 5.36 states, "With life, wi