# RAG-Lite with TaskVine + BM25

This notebook shows how to build a **simple RAG-style pipeline** using:

- **TaskVine** to clean and chunk the books in parallel.
- **BM25** (keyword-based retrieval) to find relevant chunks for a question.

This version is **RAG-lite**: no embeddings, no GPU, just classical IR + distributed preprocessing.


## Imports and basic setup

In [1]:
import os
import math
import json
from pathlib import Path

import ndcctools.taskvine as vine

print("Python + TaskVine imports OK")

# Data directory with your local Gutenberg files
DATA_DIR = Path("data")

book_paths = sorted(DATA_DIR.glob("*.txt"))

if not book_paths:
    raise RuntimeError(f"No .txt files found in {DATA_DIR}")

print("Found books:")
for p in book_paths:
    print(f" - {p.name} ({p.stat().st_size} bytes)")

# Simple book_id from filename (without extension)
def book_id_from_path(path: Path) -> str:
    return path.stem

book_ids = [book_id_from_path(p) for p in book_paths]
print("\nBook IDs:", book_ids)

Python + TaskVine imports OK
Found books:
 - alice.txt (151191 bytes)
 - frankenstein.txt (421633 bytes)
 - pg64317.txt (306594 bytes)
 - shakespeare_complete.txt (5638525 bytes)

Book IDs: ['alice', 'frankenstein', 'pg64317', 'shakespeare_complete']


## Worker Function
This function will be executed on TaskVine workers to process each book: clean the text, split into chunks and return the chunks for rag retrieval.


In [2]:
def clean_and_chunk_book(local_filename: str, book_id: str):
    """
    Read a local Gutenberg .txt file (staged by TaskVine on the worker),
    clean it, and chunk it into ~1000-character segments for RAG-style use.

    Args:
        local_filename: filename as seen on the worker (e.g., "book.txt")
        book_id:        identifier for the book (e.g., "1960")

    Returns:
        list[dict]: one dict per chunk, e.g.
            {
              "book_id": ...,
              "chunk_id": ...,
              "total_chunks": ...,
              "relative_position": 0.0–1.0,
              "text": "...",
              "chunk_length": ...,
              "n_chars": ...,
              "n_words": ...,
              "preview": "...",
            }
    """
    # Imports INSIDE so the function is self-contained on the worker
    import re
    from langchain_text_splitters import RecursiveCharacterTextSplitter

    # --- 1. Read full file ---
    with open(local_filename, "r", encoding="utf-8", errors="ignore") as f:
        text = f.read()

    # --- 2. Strip Gutenberg boilerplate (best-effort) ---

    # Header: everything before a "START OF" marker
    start_markers = [
        "*** START OF THIS PROJECT GUTENBERG",
        "*** START OF THE PROJECT GUTENBERG",
        "***START OF THE PROJECT GUTENBERG",
        "*END*THE SMALL PRINT",  # older texts
    ]
    for marker in start_markers:
        idx = text.find(marker)
        if idx != -1:
            text = text[idx + len(marker):]
            break

    # Footer: everything after an "END OF" marker
    end_markers = [
        "*** END OF THIS PROJECT GUTENBERG",
        "*** END OF THE PROJECT GUTENBERG",
        "***END OF THE PROJECT GUTENBERG",
    ]
    for marker in end_markers:
        idx = text.find(marker)
        if idx != -1:
            text = text[:idx]
            break

    # --- 3. Basic normalization ---

    # Normalize line endings
    text = text.replace("\r\n", "\n").replace("\r", "\n")

    # Collapse multiple blank lines
    text = re.sub(r"\n\s*\n\s*\n+", "\n\n", text)

    # Collapse multiple spaces
    text = re.sub(r" +", " ", text)

    # Strip leading/trailing whitespace
    text = text.strip()

    # --- 4. Chunking with LangChain's RecursiveCharacterTextSplitter ---

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", ". ", " ", ""],
        length_function=len,
    )

    chunks = splitter.split_text(text)
    total_chunks = len(chunks)

    # --- 5. Build result records with richer metadata ---

    results = []
    for i, chunk in enumerate(chunks):
        # Word count
        words = re.findall(r"\b\w+\b", chunk)
        n_words = len(words)

        # Relative position of this chunk in the book (0.0 = start, 1.0 = end)
        if total_chunks > 1:
            relative_position = i / (total_chunks - 1)
        else:
            relative_position = 0.0

        # Short single-line preview
        preview = chunk[:160].replace("\n", " ")

        results.append({
            "book_id": book_id,
            "chunk_id": i,
            "total_chunks": total_chunks,
            "relative_position": relative_position,
            "text": chunk,
            "chunk_length": len(chunk),
            "n_chars": len(chunk),
            "n_words": n_words,
            "preview": preview,
        })

    return results

print("✓ worker function clean_and_chunk_book() defined")

✓ worker function clean_and_chunk_book() defined


## Setup Vine Manager

In [3]:
import os

manager_name = api_key = os.environ.get("VINE_MANAGER_NAME")
print(f"Manager name: {manager_name}")

ports_str = os.environ.get("VINE_MANAGER_PORTS", "9123, 9150")
ports = [int(p.strip()) for p in ports_str.split(",")]

if len(ports) == 1:
    ports = ports[0]
else:
    ports = [int(p) for p in ports]

print(f"Manager Ports: {ports}")

Manager name: floability-910fe741-dcac-447c-9f5e-5171d1cbd423
Manager Ports: [9123, 9150]


In [4]:
import ndcctools.taskvine as vine
m = vine.Manager(ports, name=manager_name)

In [5]:
print(f"TaskVine Manager listening on port {m.port}, {m.name}")

TaskVine Manager listening on port 9123, floability-910fe741-dcac-447c-9f5e-5171d1cbd423


## Submit Tasks

In [6]:
task_meta = {}
total_tasks = 0

for path in book_paths:
    book_id = book_id_from_path(path)
    file_size = path.stat().st_size

    # Declare the local file; TaskVine will cache it on workers
    f = m.declare_file(str(path), cache="worker")

    # Create PythonTask: function + arguments
    #   first arg: the filename as it will appear on the worker ("book.txt")
    #   second arg: book_id
    task = vine.PythonTask(
        clean_and_chunk_book,
        "book.txt",
        book_id,
    )

    # Attach declared file as input, staged on worker as "book.txt"
    task.add_input(f, "book.txt")

    # Optionally set cores
    task.set_cores(1)

    t_id = m.submit(task)
    task_meta[t_id] = {
        "book_id": book_id,
        "path": str(path),
        "file_size": file_size,
    }
    total_tasks += 1

    print(f"Submitted task {t_id} for book {book_id} ({path.name}, {file_size} bytes)")

print(f"\nTotal tasks submitted: {total_tasks}")

Submitted task 1 for book alice (alice.txt, 151191 bytes)
Submitted task 2 for book frankenstein (frankenstein.txt, 421633 bytes)
Submitted task 3 for book pg64317 (pg64317.txt, 306594 bytes)
Submitted task 4 for book shakespeare_complete (shakespeare_complete.txt, 5638525 bytes)

Total tasks submitted: 4


## Wait for tasks and collect the corpus

In [7]:
corpus = []
completed = 0

while not m.empty():
    t = m.wait(5)  # wait up to 5 seconds
    if not t:
        continue

    completed += 1
    meta = task_meta[t.id]
    book_id = meta["book_id"]

    if t.successful():
        book_chunks = t.output
        corpus.extend(book_chunks)
        print(f"[{completed}/{total_tasks}] ✓ Task {t.id} for book {book_id} -> {len(book_chunks)} chunks")
    else:
        print(f"[{completed}/{total_tasks}] ✗ Task {t.id} for book {book_id} FAILED: {t.result}")

print("\nAll tasks done.")
print(f"Total chunks collected: {len(corpus)}")

[1/4] ✓ Task 1 for book alice -> 189 chunks
[2/4] ✓ Task 3 for book pg64317 -> 359 chunks
[3/4] ✓ Task 2 for book frankenstein -> 616 chunks
[4/4] ✓ Task 4 for book shakespeare_complete -> 7198 chunks

All tasks done.
Total chunks collected: 8362


## Save corpus and basic stats

In [8]:
import json
from collections import Counter

output_file = "gutenberg_corpus.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(corpus, f, ensure_ascii=False, indent=2)

print(f"✓ Saved {len(corpus)} chunks to {output_file}")

# Quick stats
book_counts = Counter(c["book_id"] for c in corpus)
print("\nChunks per book:")
for b, cnt in sorted(book_counts.items()):
    print(f"  {b}: {cnt} chunks")

chunk_lengths = [c["chunk_length"] for c in corpus]
if chunk_lengths:
    print("\nChunk length stats:")
    print(f"  min: {min(chunk_lengths)}")
    print(f"  max: {max(chunk_lengths)}")
    print(f"  avg: {sum(chunk_lengths)/len(chunk_lengths):.1f}")

✓ Saved 8362 chunks to gutenberg_corpus.json

Chunks per book:
  alice: 189 chunks
  frankenstein: 616 chunks
  pg64317: 359 chunks
  shakespeare_complete: 7198 chunks

Chunk length stats:
  min: 35
  max: 999
  avg: 822.2


In [9]:
corpus[0]

{'book_id': 'alice',
 'chunk_id': 0,
 'total_chunks': 189,
 'relative_position': 0.0,
 'text': 'EBOOK 11 ***\n\n[Illustration]\n\nAlice’s Adventures in Wonderland\n\nby Lewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3.0\n\nContents\n\n CHAPTER I. Down the Rabbit-Hole\n CHAPTER II. The Pool of Tears\n CHAPTER III. A Caucus-Race and a Long Tale\n CHAPTER IV. The Rabbit Sends in a Little Bill\n CHAPTER V. Advice from a Caterpillar\n CHAPTER VI. Pig and Pepper\n CHAPTER VII. A Mad Tea-Party\n CHAPTER VIII. The Queen’s Croquet-Ground\n CHAPTER IX. The Mock Turtle’s Story\n CHAPTER X. The Lobster Quadrille\n CHAPTER XI. Who Stole the Tarts?\n CHAPTER XII. Alice’s Evidence\n\nCHAPTER I.\nDown the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into\nthe book her sister was reading, but it had no pictures or\nconversations in it, “and what is the use of a book,” thought Alice\n“without pictur

## BM25 retriever + simple RAG helper

In [10]:
# STEP: Build BM25 retriever and a simple RAG-style query function

from langchain_core.documents import Document
from langchain_community.retrievers import BM25Retriever  # uses rank-bm25 under the hood

# Convert corpus -> LangChain Documents
documents = [
    Document(
        page_content=c["text"],
        metadata={
            "book_id": c["book_id"],
            "chunk_id": c["chunk_id"],
            "relative_position": c["relative_position"],
            "n_words": c["n_words"],
        },
    )
    for c in corpus
]

print(f"Built {len(documents)} Documents from corpus")

# Create BM25 retriever
retriever = BM25Retriever.from_documents(documents)
retriever.k = 4  # top-k context chunks per query

print("✓ BM25 retriever ready (k=4)")

def rag_query(query: str, k: int = 4):
    """
    RAG-style helper using modern LangChain 'invoke' interface.
    """
    retriever.k = k
    results = retriever.invoke(query)  # invoke() is correct for LC 1.x

    print("\n" + "="*70)
    print(f"RAG query: {query!r}")
    print("="*70)
    print(f"Retrieved {len(results)} chunks:\n")

    for i, doc in enumerate(results, 1):
        meta = doc.metadata
        pos_pct = f"{100 * meta.get('relative_position', 0.0):.1f}%"
        print(f"[{i}] book_id={meta['book_id']}  chunk_id={meta['chunk_id']}  pos={pos_pct}")
        preview = doc.page_content[:200].replace("\n", " ")
        print(f"    {preview}...")
        print()

    context = "\n\n".join(
        f"[book {doc.metadata['book_id']} | chunk {doc.metadata['chunk_id']}] {doc.page_content}"
        for doc in results
    )

    prompt = f"""You are a helpful assistant answering questions about classic literature.

Use ONLY the following context to answer the question. If the answer is not in the context, say you don't know.

Context:
{context}

Question: {query}
Answer:"""

    # print("="*70)
    # print("Example LLM prompt (truncated):")
    # print("="*70)
    # print(prompt[:800] + ("..." if len(prompt) > 800 else ""))

    return results, prompt


Built 8362 Documents from corpus
✓ BM25 retriever ready (k=4)


## Sample Rag Query

In [11]:
query_alice = "What happens when Alice falls down the rabbit hole?"
query_hamlet = "What does Hamlet mean when he says ‘To be or not to be’?"

In [12]:
r,p = rag_query(query_alice)


RAG query: 'What happens when Alice falls down the rabbit hole?'
Retrieved 4 chunks:

[1] book_id=alice  chunk_id=2  pos=1.1%
    There was nothing so _very_ remarkable in that; nor did Alice think it so _very_ much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over af...

[2] book_id=shakespeare_complete  chunk_id=5999  pos=83.4%
    QUINTUS. My sight is very dull, whate’er it bodes.  MARTIUS. And mine, I promise you. Were it not for shame, Well could I leave our sport to sleep awhile.  [_He falls into the pit._]  QUINTUS. What, a...

[3] book_id=alice  chunk_id=104  pos=55.3%
    “Well, I’d hardly finished the first verse,” said the Hatter, “when the Queen jumped up and bawled out, ‘He’s murdering the time! Off with his head!’”  “How dreadfully savage!” exclaimed Alice.  “And ...

[4] book_id=alice  chunk_id=8  pos=4.3%
    Alice was not a bit hurt, and she jumped up on to her feet in a moment: she looked up, but it was al

In [13]:
print(p)

You are a helpful assistant answering questions about classic literature.

Use ONLY the following context to answer the question. If the answer is not in the context, say you don't know.

Context:
[book alice | chunk 2] There was nothing so _very_ remarkable in that; nor did Alice think it
so _very_ much out of the way to hear the Rabbit say to itself, “Oh
dear! Oh dear! I shall be late!” (when she thought it over afterwards,
it occurred to her that she ought to have wondered at this, but at the
time it all seemed quite natural); but when the Rabbit actually _took a
watch out of its waistcoat-pocket_, and looked at it, and then hurried
on, Alice started to her feet, for it flashed across her mind that she
had never before seen a rabbit with either a waistcoat-pocket, or a
watch to take out of it, and burning with curiosity, she ran across the
field after it, and fortunately was just in time to see it pop down a
large rabbit-hole under the hedge.

In another moment down went Alice after

In [27]:
r,p = rag_query(query_hamlet)


RAG query: 'What does Hamlet mean when he says ‘To be or not to be’?'
Retrieved 4 chunks:

[1] book_id=shakespeare_complete  chunk_id=1491  pos=20.7%
    Enter King, Queen, Laertes, Lords, Osric and Attendants with foils &c.  KING. Come, Hamlet, come, and take this hand from me.  [_The King puts Laertes’s hand into Hamlet’s._]  HAMLET. Give me your par...

[2] book_id=shakespeare_complete  chunk_id=1464  pos=20.3%
    SECOND CLOWN. Go to.  FIRST CLOWN. What is he that builds stronger than either the mason, the shipwright, or the carpenter?  SECOND CLOWN. The gallows-maker; for that frame outlives a thousand tenants...

[3] book_id=shakespeare_complete  chunk_id=469  pos=6.5%
    CLEOPATRA. [_Aside to Enobarbus_.] What does he mean?  ENOBARBUS. [_Aside to Cleopatra_.] To make his followers weep.  ANTONY. Tend me tonight; May be it is the period of your duty. Haply you shall no...

[4] book_id=shakespeare_complete  chunk_id=6581  pos=91.4%
    into the company of three or four gentleman

In [28]:
print(p)

You are a helpful assistant answering questions about classic literature.

Use ONLY the following context to answer the question. If the answer is not in the context, say you don't know.

Context:
[book shakespeare_complete | chunk 1491] Enter King, Queen, Laertes, Lords, Osric and Attendants with foils &c.

KING.
Come, Hamlet, come, and take this hand from me.

[_The King puts Laertes’s hand into Hamlet’s._]

HAMLET.
Give me your pardon, sir. I have done you wrong;
But pardon’t as you are a gentleman.
This presence knows, and you must needs have heard,
How I am punish’d with sore distraction.
What I have done
That might your nature, honour, and exception
Roughly awake, I here proclaim was madness.
Was’t Hamlet wrong’d Laertes? Never Hamlet.
If Hamlet from himself be ta’en away,
And when he’s not himself does wrong Laertes,
Then Hamlet does it not, Hamlet denies it.
Who does it, then? His madness. If’t be so,
Hamlet is of the faction that is wrong’d;
His madness is poor Hamlet’s enemy.