# 🔍 Advanced RAG Techniques

In this notebook, we'll go beyond the basics of Retrieval-Augmented Generation (RAG) and explore advanced techniques that significantly improve the quality of generated answers.

### 🧠 What we'll build:

We'll start by loading 10-K filings from multiple companies — **Amazon**, **Tesla**, **Nvidia**, and **Apple** — and store them in a **vector database**.

Then, we'll build a simple RAG pipeline and progressively apply the following advanced retrieval techniques:

- 🔄 **Re-ranking**: Reorder retrieved chunks based on relevance to improve answer quality.
- 🔗 **Multi-hop Retrieval**: Decompose complex questions and retrieve supporting information across multiple documents.
- 🧭 **Hybrid Search**: Combine sparse (keyword-based) and dense (embedding-based) retrieval for better recall.

> This notebook gives you a working playground — not just slides — to see how these techniques really perform on real-world financial filings.

**Note:** Download the 10-K documents from SEC - https://www.sec.gov/search-filings

## 📋 Notebook Progress Tracker

✅ **Step 1**: Environment Setup & Configuration  
⏳ **Step 2**: Document Loading & Chunking  
⏳ **Step 3**: Embedding Generation & Storage  
⏳ **Step 4**: Basic RAG Implementation  
⏳ **Step 5**: Re-ranking with Cohere  
⏳ **Step 6**: Multi-Hop Retrieval  
⏳ **Step 7**: Hybrid Search (BM25 + Dense)  
⏳ **Step 8**: Evaluation & Comparison  

---

**Current Status**: Setting up environment and dependencies

## 🔧 Step 1: Environment Setup & Configuration

First, let's set up our environment variables and import all necessary libraries.

**Progress**: Loading environment variables and checking dependencies...

In [3]:
# Environment Setup
import os
from dotenv import load_dotenv
import re
from typing import Dict, List, Tuple
import uuid

# Load environment variables from .env file
load_dotenv()

# Verify environment variables are loaded
required_vars = ['PINECONE_API_KEY', 'PINECONE_INDEX', 'PINECONE_URL', 'OPENAI_API_KEY','COHERE_API_KEY']

print("🔧 Environment Variables Status:")
print("-" * 30)
for var in required_vars:
    value = os.getenv(var)
    if value:
        print(f"✅ {var}: Set")
    else:
        print(f"❌ {var}: Missing")

# Check if all required variables are present
missing_vars = [var for var in required_vars if not os.getenv(var)]

if missing_vars:
    print(f"\n❌ Missing variables: {missing_vars}")
    print("Please create a .env file and add all required variables")
else:
    print(f"\n🎉 All environment variables loaded successfully!")
    print(f"📋 Pinecone Index: {os.getenv('PINECONE_INDEX')}")

# Setup a data directory for PDFs
DATA_DIR = "../data"
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)
    print(f"📁 Created directory: {DATA_DIR}")
    print(f"Please add your PDF files to the '{DATA_DIR}' directory.")

# Verify PDF files in the data directory
print(f"\n📂 Checking for PDF files in '{DATA_DIR}' directory...")
pdf_files = {}
expected_companies = ['Amazon', 'Apple', 'Nvidia', 'Tesla']

for filename in os.listdir(DATA_DIR):
    if filename.endswith('.pdf'):
        # Extract company name from filename
        company = filename.replace('.pdf', '').capitalize()
        if company in expected_companies:
            pdf_files[company.lower()] = os.path.join(DATA_DIR, filename)
            file_size = os.path.getsize(os.path.join(DATA_DIR, filename)) / (1024 * 1024)  # Size in MB
            print(f"✅ Found {filename} - {file_size:.1f} MB")

# Check if all expected files are present
PDF_FILES = pdf_files
total_files = len(PDF_FILES)

if total_files == 4:
    print(f"\n🎉 All {total_files} PDF files found successfully!")
    print("Ready to proceed with document loading and chunking.")
else:
    print(f"\n⚠️  Found {total_files}/4 expected files.")
    print("Please make sure Amazon.pdf, Apple.pdf, Nvidia.pdf, and Tesla.pdf are in the 'data' directory.")

print("\n✅ Step 1 Complete: Environment setup finished!")

🔧 Environment Variables Status:
------------------------------
✅ PINECONE_API_KEY: Set
✅ PINECONE_INDEX: Set
✅ PINECONE_URL: Set
✅ OPENAI_API_KEY: Set
✅ COHERE_API_KEY: Set

🎉 All environment variables loaded successfully!
📋 Pinecone Index: advance-rag

📂 Checking for PDF files in '../data' directory...
✅ Found Nvidia.pdf - 4.1 MB
✅ Found Tesla.pdf - 8.7 MB
✅ Found Apple.pdf - 3.9 MB
✅ Found Amazon.pdf - 3.1 MB

🎉 All 4 PDF files found successfully!
Ready to proceed with document loading and chunking.

✅ Step 1 Complete: Environment setup finished!


## 📄 Step 2: Document Loading & Chunking

Now we'll load the 10-K documents and split them into manageable chunks with rich metadata.

**Progress**: Loading PDFs and creating chunks with metadata...

In [4]:
# Document Loading and Chunking
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len
)

def extract_year_from_filename(filename: str) -> str:
    """Extract year from filename, default to 2023 if not found."""
    year_match = re.search(r'20\d{2}', filename)
    return year_match.group() if year_match else "2023"

def detect_section(text: str, page_num: int = None) -> str:
    """
    Detect 10K section based on text content and common section headers.
    Returns the most likely section name.
    """
    text_upper = text.upper()

    # Common 10K sections with their typical identifiers
    section_patterns = [
        ("Business", ["ITEM 1", "BUSINESS", "OUR BUSINESS", "THE BUSINESS"]),
        ("Risk Factors", ["ITEM 1A", "RISK FACTORS", "RISKS", "RISK FACTOR"]),
        ("Legal Proceedings", ["ITEM 3", "LEGAL PROCEEDINGS", "LITIGATION"]),
        ("Management Discussion", ["ITEM 7", "MD&A", "MANAGEMENT'S DISCUSSION", "MANAGEMENT DISCUSSION"]),
        ("Financial Statements", ["ITEM 8", "FINANCIAL STATEMENTS", "CONSOLIDATED STATEMENTS", "BALANCE SHEET"]),
        ("Controls and Procedures", ["ITEM 9A", "CONTROLS AND PROCEDURES", "INTERNAL CONTROL"]),
        ("Directors and Officers", ["ITEM 10", "DIRECTORS", "EXECUTIVE OFFICERS", "GOVERNANCE"]),
        ("Executive Compensation", ["ITEM 11", "EXECUTIVE COMPENSATION", "COMPENSATION"]),
        ("Security Ownership", ["ITEM 12", "SECURITY OWNERSHIP", "BENEFICIAL OWNERSHIP"]),
        ("Exhibits", ["ITEM 15", "EXHIBITS", "INDEX TO EXHIBITS"]),
    ]

    # Score each section based on keyword matches
    section_scores = {}
    for section_name, keywords in section_patterns:
        score = 0
        for keyword in keywords:
            if keyword in text_upper:
                score += text_upper.count(keyword)
        section_scores[section_name] = score

    # Return section with highest score, or "General" if no clear match
    best_section = max(section_scores.items(), key=lambda x: x[1])
    return best_section[0] if best_section[1] > 0 else "General"

def create_chunk_id(company: str, year: str, section: str, chunk_index: int) -> str:
    """Create a standardized chunk ID."""
    company_clean = company.lower().replace(" ", "_")
    section_clean = section.lower().replace(" ", "_").replace("'", "")
    return f"{company_clean}_{year}_{section_clean}_{chunk_index:02d}"

def get_source_doc_id(filename: str) -> str:
    """Extract clean document ID from filename."""
    import os
    base_name = os.path.basename(filename)
    return base_name

def process_company_documents(company: str, filename: str) -> List[Document]:
    """Process a single company's 10K document with enhanced metadata."""
    print(f"\n📄 Processing {company.upper()}: {filename}")
    print("-" * 40)

    try:
        # Load PDF using PyMuPDFLoader
        loader = PyMuPDFLoader(filename)
        documents = loader.load()
        print(f"   ✅ Loaded {len(documents)} pages")

        # Extract metadata
        year = "2024"
        source_doc_id = get_source_doc_id(filename)

        company_chunks = []
        chunk_index = 0

        # Process each page separately to maintain page number tracking
        for page_num, doc in enumerate(documents, 1):
            page_content = doc.page_content
            page_chars = len(page_content)

            if page_chars < 50:  # Skip very short pages
                continue

            # Detect section for this page
            section = detect_section(page_content, page_num)

            # Split page into chunks
            page_chunks = text_splitter.split_text(page_content)

            # Create Document objects for each chunk
            for chunk_text in page_chunks:
                chunk_id = create_chunk_id(company, year, section, chunk_index)

                chunk_doc = Document(
                    page_content=chunk_text,
                    metadata={
                        "company": company,
                        "year": int(year),
                        "section": section,
                        "chunk_id": chunk_id,
                        "source_doc_id": source_doc_id,
                        "page_number": page_num,
                        "chunk_text": chunk_text,
                        "chunk_index": chunk_index,
                        "chunk_size": len(chunk_text),
                        "source_file": filename
                    }
                )

                company_chunks.append(chunk_doc)
                chunk_index += 1

        print(f"   ✂️  Created {len(company_chunks)} chunks across {len(documents)} pages")
        print(f"   📊 Total characters processed: {sum(len(doc.page_content) for doc in documents):,}")

        # Section summary
        sections_found = {}
        for chunk in company_chunks:
            section = chunk.metadata['section']
            sections_found[section] = sections_found.get(section, 0) + 1

        print(f"   📋 Sections detected: {', '.join(sections_found.keys())}")

        return company_chunks

    except Exception as e:
        print(f"   ❌ Error processing {filename}: {str(e)}")
        return []

# Main processing loop
all_documents = []
chunk_counts = {}
section_breakdown = {}

print("📚 Loading and chunking PDF documents with enhanced metadata...")
print("=" * 60)

for company, filename in PDF_FILES.items():
    company_chunks = process_company_documents(company, filename)

    if company_chunks:
        all_documents.extend(company_chunks)
        chunk_counts[company] = len(company_chunks)

        # Track sections per company
        company_sections = {}
        for chunk in company_chunks:
            section = chunk.metadata['section']
            company_sections[section] = company_sections.get(section, 0) + 1
        section_breakdown[company] = company_sections

        print(f"   ✅ {company.capitalize()}: {len(company_chunks)} chunks processed")
    else:
        chunk_counts[company] = 0

print("\n" + "=" * 60)
print("📊 ENHANCED PROCESSING SUMMARY")
print("=" * 60)

# Print chunks per company
for company, count in chunk_counts.items():
    print(f"📋 {company.capitalize()}: {count:,} chunks")
    if company in section_breakdown:
        for section, section_count in section_breakdown[company].items():
            print(f"   └── {section}: {section_count} chunks")

# Overall summary
total_chunks = len(all_documents)
total_companies = len([c for c in chunk_counts.values() if c > 0])

print(f"\n🎯 TOTALS:")
print(f"   📚 Total chunks: {total_chunks:,}")
print(f"   🏢 Companies processed: {total_companies}/{len(PDF_FILES)}")
if total_companies > 0:
    print(f"   📄 Average chunks per company: {total_chunks/total_companies:.0f}")

print(f"\n✅ Step 2 Complete: Document loading and chunking finished!")

📚 Loading and chunking PDF documents with enhanced metadata...

📄 Processing NVIDIA: ../data/Nvidia.pdf
----------------------------------------
   ✅ Loaded 230 pages
   ✂️  Created 1461 chunks across 230 pages
   📊 Total characters processed: 555,707
   📋 Sections detected: Controls and Procedures, Directors and Officers, Business, Risk Factors, General, Executive Compensation, Management Discussion, Financial Statements, Legal Proceedings, Exhibits
   ✅ Nvidia: 1461 chunks processed

📄 Processing TESLA: ../data/Tesla.pdf
----------------------------------------
   ✅ Loaded 188 pages
   ✂️  Created 1292 chunks across 188 pages
   📊 Total characters processed: 499,103
   📋 Sections detected: General, Controls and Procedures, Business, Risk Factors, Financial Statements, Directors and Officers, Executive Compensation, Exhibits
   ✅ Tesla: 1292 chunks processed

📄 Processing APPLE: ../data/Apple.pdf
----------------------------------------
   ✅ Loaded 123 pages
   ✂️  Created 776 chunks 

## 🤖 Step 3: Embedding Generation & Storage

Now we'll generate embeddings for all document chunks and store them in Pinecone vector database.

**Progress**: Loading embedding model and storing vectors in Pinecone...

In [5]:
# Embedding Generation and Storage
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone

# Initialize embedding model
print("🤖 Loading multilingual-e5-large model...")
model = SentenceTransformer('intfloat/multilingual-e5-large')
print("✅ Model loaded successfully")

# Test embedding to verify dimensions
test_embedding = model.encode("test", normalize_embeddings=True)
print(f"📊 Embedding dimensions: {len(test_embedding)}")

# Initialize Pinecone
print("\n🔗 Connecting to Pinecone...")
pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))
index_name = os.getenv('PINECONE_INDEX')

# Check if the index exists, create if it doesn't
if index_name not in pc.list_indexes().names():
    print(f"⚠️ Index '{index_name}' not found. Please create it in your Pinecone project.")
    # Example of how to create an index (adjust dimension as needed)
    # from pinecone import ServerlessSpec
    # pc.create_index(
    #     name=index_name,
    #     dimension=len(test_embedding),
    #     metric="cosine",
    #     spec=ServerlessSpec(
    #         cloud='aws',
    #         region='us-west-2'
    #     )
    # )
    # print(f"✅ Created index: {index_name}")

index = pc.Index(index_name)
print(f"✅ Connected to index: {index_name}")

# Generate embeddings and store in Pinecone
print("\n🚀 Generating embeddings and storing in Pinecone...")
print("=" * 60)

batch_size = 100  # Process in batches
total_stored = 0
company_stored = {}

for i in range(0, len(all_documents), batch_size):
    batch_docs = all_documents[i:i + batch_size]

    print(f"\n📦 Processing batch {i//batch_size + 1}/{(len(all_documents)-1)//batch_size + 1}")
    print(f"   📄 Documents {i+1}-{min(i+batch_size, len(all_documents))} of {len(all_documents)}")

    # Extract texts from batch
    texts = [doc.page_content for doc in batch_docs]

    # Generate embeddings
    print("   🤖 Generating embeddings...")
    embeddings = model.encode(texts, normalize_embeddings=True)

    # Prepare vectors for Pinecone
    vectors = []
    for doc, embedding in zip(batch_docs, embeddings):
        vector_id = str(uuid.uuid4())

        # Prepare metadata with requested fields
        metadata = {
            'company': doc.metadata['company'],
            'year': doc.metadata['year'],
            'section': doc.metadata.get('section', 'Financial Statements'),
            'chunk_id': f"{doc.metadata['company'].lower().replace(' (1)', '')}_{doc.metadata['year']}_financial_statements_{doc.metadata.get('chunk_id', str(i).zfill(2))}",
            'source_doc_id': doc.metadata['source_file'],
            'page_number': doc.metadata.get('page_number', 1),
            'chunk_size': f"{len(doc.page_content)} characters",
            'source': doc.metadata['source_file'],
            'chunk_text': doc.page_content
        }

        vector = {
            'id': vector_id,
            'values': embedding.tolist(),
            'metadata': metadata
        }
        vectors.append(vector)

    # Store in Pinecone
    print("   📤 Uploading to Pinecone...")
    try:
        index.upsert(vectors=vectors)

        # Count by company
        for doc in batch_docs:
            company = doc.metadata['company']
            company_stored[company] = company_stored.get(company, 0) + 1

        total_stored += len(vectors)
        print(f"   ✅ Batch stored successfully ({len(vectors)} vectors)")

    except Exception as e:
        print(f"   ❌ Error storing batch: {str(e)}")

print("\n" + "=" * 60)
print("🎯 EMBEDDING & STORAGE SUMMARY")
print("=" * 60)

# Print storage by company
for company, count in company_stored.items():
    print(f"📋 {company.capitalize()}: {count:,} vectors stored")

print(f"\n📊 TOTALS:")
print(f"   🗄️  Total vectors stored: {total_stored:,}")
print(f"   🏢 Companies: {len(company_stored)}")
print(f"   📐 Embedding dimensions: {len(test_embedding)}")
print(f"   🤖 Model: intfloat/multilingual-e5-large")

# Verify index stats
try:
    print(f"\n🔍 Verifying Pinecone index...")
    stats = index.describe_index_stats()
    print(f"   📈 Total vectors in index: {stats.total_vector_count}")
    if hasattr(stats, 'namespaces') and stats.namespaces:
        print(f"   📁 Namespaces: {list(stats.namespaces.keys())}")
except Exception as e:
    print(f"   ⚠️  Could not retrieve index stats: {str(e)}")

if total_stored == len(all_documents):
    print(f"\n🎉 SUCCESS! All {total_stored} document chunks embedded and stored!")
    print("✅ Ready for RAG querying!")
else:
    print(f"\n⚠️  Stored {total_stored}/{len(all_documents)} chunks")
    print("Some chunks may have failed to store.")

print(f"\n✅ Step 3 Complete: Embedding generation and storage finished!")

  from .autonotebook import tqdm as notebook_tqdm


🤖 Loading multilingual-e5-large model...
✅ Model loaded successfully


0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
  return forward_call(*args, **kwargs)


📊 Embedding dimensions: 1024

🔗 Connecting to Pinecone...
✅ Connected to index: advance-rag

🚀 Generating embeddings and storing in Pinecone...

📦 Processing batch 1/43
   📄 Documents 1-100 of 4282
   🤖 Generating embeddings...


  return forward_call(*args, **kwargs)


   📤 Uploading to Pinecone...
   ✅ Batch stored successfully (100 vectors)

📦 Processing batch 2/43
   📄 Documents 101-200 of 4282
   🤖 Generating embeddings...
   📤 Uploading to Pinecone...
   ✅ Batch stored successfully (100 vectors)

📦 Processing batch 3/43
   📄 Documents 201-300 of 4282
   🤖 Generating embeddings...
   📤 Uploading to Pinecone...
   ✅ Batch stored successfully (100 vectors)

📦 Processing batch 4/43
   📄 Documents 301-400 of 4282
   🤖 Generating embeddings...
   📤 Uploading to Pinecone...
   ✅ Batch stored successfully (100 vectors)

📦 Processing batch 5/43
   📄 Documents 401-500 of 4282
   🤖 Generating embeddings...
   📤 Uploading to Pinecone...
   ✅ Batch stored successfully (100 vectors)

📦 Processing batch 6/43
   📄 Documents 501-600 of 4282
   🤖 Generating embeddings...
   📤 Uploading to Pinecone...
   ✅ Batch stored successfully (100 vectors)

📦 Processing batch 7/43
   📄 Documents 601-700 of 4282
   🤖 Generating embeddings...
   📤 Uploading to Pinecone...
   ✅