# Enhanced Contextual Semantic Chunking Demo

**Best of both worlds:**
1. Semantic chunking (natural boundaries)
2. LLM contextual enhancement (situating context)
3. Hybrid search (vector + FTS)

**Best for**: Production systems, 20-30% better retrieval accuracy

## Setup

In [None]:
import sys
from pathlib import Path

project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))
pdf_dir = project_root / "data" / "pdfs"

print(f"âœ… Project root: {project_root}")

## 1. Download Sample PDFs

In [None]:
# import subprocess

# if pdf_dir.exists() and list(pdf_dir.glob("*.pdf")):
#     print(f"âœ… PDFs already downloaded: {len(list(pdf_dir.glob('*.pdf')))} files")
# else:
#     print("ðŸ“¥ Downloading sample PDFs...")
#     result = subprocess.run([sys.executable, str(project_root / "scripts" / "download_pdfs.py")], 
#                           capture_output=True, text=True, cwd=str(project_root))
#     print(result.stdout)

## 2. Initialize Enhanced Knowledge Base

In [None]:
from src.rag.agno import ContextualAgnoKnowledgeBase

kb = ContextualAgnoKnowledgeBase(table_name="economics_enhanced_gemini")
print("âœ… Enhanced knowledge base initialized")
print("   - Semantic chunking: ON")
print("   - Contextual enhancement: ON")
print("   - Hybrid search: ON")
print(f"   - Table: {kb.knowledge.vector_db.table_name}")

## 3. Ingest PDF with Context Enhancement

This will:
1. Extract text from PDF
2. Chunk semantically (natural boundaries)
3. Add LLM-generated context to each chunk
4. Store with hybrid indexing

In [None]:
# pdf_path = pdf_dir / "The Richest Man In Babylon.pdf"
# kb.ingest_pdf(str(pdf_path))
# print("\nâœ… PDF ingested with contextual enhancement")

kb.ingest_directory(str(pdf_dir))
print("âœ… All PDFs ingested")

## 4. Compare: Regular vs Enhanced Chunks

Enhanced chunks have `[CONTEXT: ...]` prefix explaining their role in the document.

In [None]:
results = kb.search("Explique de forma resumida e concisa o que Ã© a quinta regra de ouro", limit=2)

print("Enhanced Chunks with Context:\n" + "="*80)
for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(result.content[:500])
    print("-"*80)

## 5. Query with Agent

In [None]:
from agno.agent import Agent
from agno.models.google import Gemini
from src.config import settings

agent = Agent(
    model=Gemini(id="gemini-2.5-flash", api_key=settings.google_api_key),
    knowledge=kb.knowledge,
    search_knowledge=True,
    markdown=True,
    # debug_mode=True,
)

agent.print_response(
    "Explique de forma resumida e concisa o que Ã© a quinta regra de ouro",
    stream=True
)

## 6. Accuracy Comparison

Test the same query against different approaches:

In [None]:
from src.rag.agno import AgnoKnowledgeBase, ContextualAgnoKnowledgeBase

kb_regular = AgnoKnowledgeBase(table_name="economics_docs_gemini")
kb_enhanced = ContextualAgnoKnowledgeBase(table_name="economics_enhanced_gemini")

query = "Explique de forma resumida e concisa o que Ã© a quinta regra de ouro"

print("REGULAR SEMANTIC CHUNKING:")
print("="*80)
results_regular = kb_regular.search(query, limit=1)
print(results_regular[0].content[:600] if results_regular else "No results")

print("\n\nENHANCED CONTEXTUAL SEMANTIC CHUNKING:")
print("="*80)
results_enhanced = kb_enhanced.search(query, limit=1)
print(results_enhanced[0].content[:600] if results_enhanced else "No results")

## 7. Multi-Document Queries

In [None]:
queries = [
    "What are the key principles of economics?",
    "How do markets self-regulate?",
    "What is the relationship between labor and value?"
]

for query in queries:
    print(f"\n{'='*80}")
    print(f"Q: {query}")
    print('='*80)
    agent.print_response(query, stream=True)
    print("\n")