# 5. Document parsing/extraction
As their name indicates, LLMs work with language i.e. text. This poses a problem if and when we want to work with files. While it's trivial to load text from formats such as TXT, JSON, and YAML, more complex file formats like PDF, DOCX, etc. are more challenging - to the point that there are many entire libraries dedicated to this sole purpose, and that it remains an active area of research and development.

For educational purposes, the corpus of source_documents used in this notebook constitutes of a selection of 10 recent articles from Canadian publications Maclean's and The Walrus. Since these are all very recent, we can be confident that they were not part of our LLM's training data, which is helpful to demonstrate that our RAG system is in fact working.

Feel free to replace these with your choice of PDF documents such as other recent articles, scientific papers, corporate or nonprofit reports, government/policy briefs, etc.

In [1]:
from pathlib import Path
import pymupdf

In [2]:
def list_pdfs(directory="source_documents"):
    base = Path(directory)
    return sorted([p for p in base.glob("*.pdf") if p.is_file()])

def extract_pdf_text(pdf_path):
    parts = []
    with pymupdf.open(pdf_path) as doc:
        for page in doc:
            text = page.get_text("text")
            if text:
                parts.append(text)
    return "\n".join(parts).strip()

def extract_many_pdf_texts(paths):
    results = []
    for path in paths:
        try:
            text = extract_pdf_text(path)
        except Exception as e:
            text = f"<error extracting: {e}>"
        results.append((path, text))
    return results


In [3]:
pdf_dir = Path('source_documents')

pdf_paths = list_pdfs(pdf_dir)
if not pdf_paths:
    print(f"No PDF files found in {pdf_dir.resolve()}")
else:
    results = extract_many_pdf_texts(pdf_paths)
    for path, text in results:
        print("=" * 80)
        print(f"FILE: {path.name}")
        print("-" * 80)
        # Print first 500 chars to keep output readable; adjust as needed
        preview = text[:500] + ("..." if len(text) > 500 else "")
        print(preview)
        print()


FILE: Canada Protects Banks From Fraud. Why Not Investors_ - Macleans.ca.pdf
--------------------------------------------------------------------------------
Think bigger, Canada. Subscribe today.
Illustration by Maclean’s/iStock
Canada Protects Banks From Fraud. Why
Not Investors?
One scam can erase a lifetime of savings. A compensation fund can bring some of it back.
BY JEAN-PAUL BUREAUD
 Listen to this article
SEPTEMBER 18, 2025
B
ridging Finance Inc. was once one of Canada’s largest private
lenders, managing more than $2 billion for more than 26,000
investors—many of them everyday Canadians. In 2024, David and

Natasha Sharpe, the husband-and-wif...

FILE: How I Managed to Write a Book without Going (Too) Broke _ The Walrus.pdf
--------------------------------------------------------------------------------
ADVERTISEMENT
Arts & Culture
How I Managed to Write a Book
without Going (Too) Broke
A grant, a small advance, a supportive spouse, and the $100 I found outside the library
BY D

# 6. Chunking
In the context of RAG applications, we rarely want to feed entire documents into our LLM. This leads to the need for chunking, or splitting source documents into more manageable "chunks".

Note that the limiting factor for chunk size is not LLM context size, but rather embedding model context size which can be quite limited - see https://huggingface.co/spaces/mteb/leaderboard

As a quick rule of thumb and with lots of caveats, you can think of one token as equivalent to roughly 4 characters.

In [4]:
# Chunk each parsed document and collect chunks for embedding; also print a preview

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Configure the splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # characters per chunk
    chunk_overlap=150,     # overlap to preserve context
    separators=["\n\n", "\n", " ", ""]  # recursion order
)

# Number of chunks to show per document
num_preview_chunks = 3

# Accumulate across all documents for downstream embedding/storage
all_chunks = []
all_metadatas = []
chunk_ids = []

for doc_idx, (path, text) in enumerate(results):
    chunks = splitter.split_text(text or "")
    avg_len = int(round(sum(len(c) for c in chunks) / len(chunks))) if chunks else 0
    print("=" * 80)
    print(f"FILE: {Path(path).name} | total chunks: {len(chunks)} | avg chunk length: {avg_len}")
    print("-" * 80)
    for i, chunk in enumerate(chunks[:num_preview_chunks]):
        print(f"[chunk {i}] (length: {len(chunk)})")
        print(chunk)
        print()
    # Collect for embedding/vector storage
    for chunk_idx, chunk in enumerate(chunks):
        all_chunks.append(chunk)
        all_metadatas.append({
            "source": str(Path(path).name),
            "doc_index": doc_idx,
            "chunk_index": chunk_idx,
        })
        chunk_ids.append(f"{Path(path).stem}-{doc_idx}-{chunk_idx}")

print(f"Total chunks collected across documents: {len(all_chunks)}")

FILE: Canada Protects Banks From Fraud. Why Not Investors_ - Macleans.ca.pdf | total chunks: 15 | avg chunk length: 791
--------------------------------------------------------------------------------
[chunk 0] (length: 463)
Think bigger, Canada. Subscribe today.
Illustration by Maclean’s/iStock
Canada Protects Banks From Fraud. Why
Not Investors?
One scam can erase a lifetime of savings. A compensation fund can bring some of it back.
BY JEAN-PAUL BUREAUD
 Listen to this article
SEPTEMBER 18, 2025
B
ridging Finance Inc. was once one of Canada’s largest private
lenders, managing more than $2 billion for more than 26,000
investors—many of them everyday Canadians. In 2024, David and

[chunk 1] (length: 955)
Natasha Sharpe, the husband-and-wife duo who ran the firm, were found
guilty of fraud. Their misconduct included accepting kickbacks from
borrowers and diverting a $40-million loan to themselves. 
In June, a tribunal ordered them to pay more than $27 million in penalties.
But the real 

In [None]:
# Better splitter: full sentence or paragraph overlap, etc.

# 7. Embedding generation

In [5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
if not all_chunks:
    raise ValueError("No chunks available. Run the chunking cell first.")
embeddings = model.encode(all_chunks).tolist()

print(f"{len(embeddings)} vector embeddings generated with {len(embeddings[0])} dimensions")

281 vector embeddings generated with 384 dimensions


# 8. Vector storage using Chroma

While we can work with embeddings directly, it quickly becomes unwieldy at scale so we turn to specialized databases: vector stores.

In [6]:
import chromadb

# Create an in-memory Chroma client and (re)create the collection with cosine distance
client = chromadb.Client()
try:
    client.delete_collection("chunks")
except Exception:
    pass
collection = client.get_or_create_collection(name="chunks", metadata={"hnsw:space": "cosine"})

# Add all chunks with their embeddings and metadata
collection.add(
    ids=chunk_ids,
    documents=all_chunks,
    metadatas=all_metadatas,
    embeddings=embeddings,
)

print(f"Added {collection.count()} items to collection '{collection.name}' (in-memory)")

Added 281 items to collection 'chunks' (in-memory)


# 9. Retrieval

In [None]:
# Simple top-k retrieval with similarity threshold

query_text = "Why did the condo bubble burst??"
top_k = 5
min_similarity = 0.50  # keep results with cosine similarity >= threshold

# Compute embedding for the query
query_embedding = model.encode([query_text])[0].tolist()

# Query the vector store for top-k results
res = collection.query(
    query_embeddings=[query_embedding],
    n_results=top_k,
    include=["documents", "metadatas", "distances"],
)

# Convert distances to similarities (Chroma returns cosine distance by default: sim = 1 - dist)
ids = res.get("ids", [[]])[0]
docs = res.get("documents", [[]])[0]
metas = res.get("metadatas", [[]])[0]
dists = res.get("distances", [[]])[0]

ranked = []
for rank, (_id, doc, meta, dist) in enumerate(zip(ids, docs, metas, dists), start=1):
    sim = 1.0 - float(dist) if dist is not None else None
    if sim is None or sim >= min_similarity:
        ranked.append({
            "rank": rank,
            "id": _id,
            "similarity": sim,
            "distance": dist,
            "document": doc,
            "metadata": meta,
        })

print(f"Query: {query_text}")
print(f"Top-k requested: {top_k}; min_similarity: {min_similarity}")
if not ranked:
    print("No results passed the threshold.")
else:
    for r in ranked:
        print("-" * 80)
        if r["similarity"] is not None:
            print(f"[rank {r['rank']}] id={r['id']} sim={r['similarity']:.3f} dist={r['distance']:.3f}")
        else:
            print(f"[rank {r['rank']}] id={r['id']} dist={r['distance']}")
        print(f"doc: {r['document']}")
        print(f"meta: {r['metadata']}")


# 10. Retrieval-augmented generation (RAG)

Putting it all together!

In [7]:
import os
import json
from dotenv import load_dotenv
from openai import OpenAI

In [8]:
load_dotenv()
if not os.getenv("OPENROUTER_API_KEY"):
    print("OPENROUTER_API_KEY not found!")
    
api_key = os.getenv("OPENROUTER_API_KEY")

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=api_key,
)

In [17]:
TOP_K = 5  # number of chunks to retrieve for context

In [36]:
def answer_rag(question, top_k=TOP_K):
    # Embed the question
    query_embedding = model.encode([question])[0].tolist()

    # Retrieve similar chunks
    res = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas"],
    )

    docs = res.get("documents", [[]])[0]
    metas = res.get("metadatas", [[]])[0]

    # Build a simple context string with sources
    context_parts = []
    for i, (doc, meta) in enumerate(zip(docs, metas), start=1):
        src = meta.get("source", "unknown")
        context_parts.append(f"[{i}] ({src})\n{doc}")
    context_text = "\n\n".join(context_parts) if context_parts else "No context retrieved."

    # Minimal prompt: instruction + context + question
    messages = [
        {
            "role": "system",
            "content": (
                f"""You answer user queries using ONLY the provided context. If the answer is not in the context, say you don't know. 
                Keep answers brief (maximum 3-6 sentences) and always cite sources by filename.\n\n
                Context:\n{context_text}"""
            ),
        },
        {
            "role": "user",
            "content": question,
        },
    ]

    completion = client.chat.completions.create(
        model="google/gemma-2-9b-it:free",
        messages=messages,
        temperature=0.0,
    )

    return (completion.choices[0].message.content, context_text)

In [43]:
# Minimal smoke test

resp = answer_rag(answer_rag("Why did the condo bubble pop?"))

print(a[0])
print(a[1])



The condo bubble was caused by a combination of factors, including low interest rates, strong immigration, a booming economy, and a lack of government regulation. 


Investors were attracted to the high returns on condo investments, and the belief that prices would continue to rise. 

[1] (The Condo Crash - Macleans.ca.pdf)
photo illustration by andrew b. myers, photo retouching by kevin luc
The Condo Crash
For years, low interest rates fuelled a big-city condo-flipping frenzy. Profits got bigger and condos
got smaller. Now the bubble has popped, leaving behind thousands of unsellable, unlivable units.
BY ALI AMAD

[2] (The Condo Crash - Macleans.ca.pdf)
I
n 2005, the benchmark price for a condo in the Greater Toronto
Area was about $200,000. A decade later, it was more than $300,000.
By 2020, it was nearly $600,000—tripling in value in 15 years. By
comparison, the total value of the Toronto stock exchange merely doubled
over the same period. And unlike the stock market, condos seeme

In [42]:
q = input("User: ").strip()

a = answer_rag(q)

print("\nAssistant:\n", a[0])
print("\nRetrieved context: \n", a[1])

User:  What caused the condo bubble



Assistant:
 The condo bubble was caused by a combination of factors, including low interest rates, strong immigration, a booming economy, and a lack of government regulation. 


Investors were attracted to the high returns on condo investments, and the belief that prices would continue to rise.

Retrieved context: 
 [1] (The Condo Crash - Macleans.ca.pdf)
photo illustration by andrew b. myers, photo retouching by kevin luc
The Condo Crash
For years, low interest rates fuelled a big-city condo-flipping frenzy. Profits got bigger and condos
got smaller. Now the bubble has popped, leaving behind thousands of unsellable, unlivable units.
BY ALI AMAD

[2] (The Condo Crash - Macleans.ca.pdf)
I
n 2005, the benchmark price for a condo in the Greater Toronto
Area was about $200,000. A decade later, it was more than $300,000.
By 2020, it was nearly $600,000—tripling in value in 15 years. By
comparison, the total value of the Toronto stock exchange merely doubled
over the same period. And unlike

You should see by now that where and how we inject the context is a significant aspect of RAG systems, especially in a conversation (as opposed to one-off question answering). Experimenting with this is left as an exercise for the reader.

# 11. A brief note on long-term memory

While long-term memory (a chatbot's ability to recall information from previous conversations) is out of scope for today's workshop, you now have all the pieces necessary to understand it as it is based on RAG. In essence, facts are extracted from every conversation and stored as a collection of "memories". During subsequent conversations, RAG is then used to surface relevant information from this collection and enrich the model context with the hope of producing more relevant responses.