# Multimodal RAG Pipeline with LangChain + Ollama + Unstructured

## Complete RAG System Using Open-Source Tools

This notebook implements a production-ready multimodal RAG pipeline using:

### 🛠️ Tech Stack:
- **LangChain**: Orchestration framework
- **Ollama**: Local LLM inference (llama3.1, mistral, etc.)
- **Unstructured**: Advanced document parsing (PDFs, tables, images)
- **ChromaDB**: Vector database
- **MultiVectorRetriever**: Handle summaries + raw documents

### 📦 Features:
✅ Extract text, tables, and images from PDFs
✅ Summarize each element for better retrieval
✅ Store summaries in vector DB, raw content separately
✅ Retrieve relevant context and generate grounded answers
✅ 100% local and open-source
✅ No API costs
✅ **🆕 Process multiple PDFs with source tracking**
✅ **🆕 Query specific documents or across all PDFs**
✅ **🆕 Compare information between documents**
✅ **🆕 Incrementally add new PDFs to existing database**

## Step 0: Install Dependencies

In [75]:
# Install all required packages
%pip install -q langchain langchain-community langchain-ollama
%pip install -q chromadb
%pip install -q "unstructured[all-docs]" python-magic
%pip install -q pillow pdf2image pdfminer-six
%pip install -q pandas openpyxl tabulate
%pip install -q pytesseract

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 6.32.1 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [76]:
%pip install --force-reinstall numpy pandas scikit-learn

Collecting numpy
  Using cached numpy-2.3.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting pandas
  Using cached pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.7.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting python-dateutil>=2.8.2 (from pandas)
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Using cached scipy-1.16.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (62 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>

## 🚀 Quick Start: Multi-PDF Processing

### Single PDF Mode
```python
pdf_path = "paper.pdf"
use_batch_mode = False
```

### Multi-PDF Mode (Recommended)
```python
pdf_paths = [
    "paper1.pdf",
    "paper2.pdf",
    "research_doc.pdf"
]
use_batch_mode = True
```

### Key Features:
- **Batch Processing**: Process multiple PDFs in one go
- **Source Tracking**: Every document knows which PDF it came from
- **Filtered Queries**: Query specific documents or all at once
- **Incremental Updates**: Add new PDFs without reprocessing everything
- **Document Comparison**: Compare information across sources

See Step 4 to configure your PDF paths!

## Prerequisites: Setup Ollama

### 1. Install Ollama
Download from: **https://ollama.ai/download**

### 2. Start Ollama Server
```bash
ollama serve
```

### 3. Pull Required Models
```bash
# For text generation and summarization
ollama pull gemma3:4b

# For embeddings (lighter model)
ollama pull nomic-embed-text

# Optional: For vision tasks
ollama pull llama3.2-vision
```

### 4. Verify Installation
```bash
ollama list
```

## Step 1: Import Libraries and Test Connection

In [1]:
import os
import uuid
import base64
from pathlib import Path
from typing import List, Dict, Any
import warnings
warnings.filterwarnings('ignore')

# LangChain imports
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.storage import InMemoryStore
from langchain.schema.document import Document
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Unstructured imports
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.auto import partition

# Other imports
import pandas as pd
from IPython.display import Image as IPImage, display, HTML
import requests

print("✓ All imports successful!")

✓ All imports successful!


In [2]:
# Test Ollama connection
def test_ollama():
    """Test Ollama connection and list available models."""
    try:
        response = requests.get("http://localhost:11434/api/tags", timeout=5)
        if response.status_code == 200:
            models = response.json().get('models', [])
            print("✓ Ollama is running!")
            print(f"✓ Found {len(models)} models:\n")
            for model in models:
                print(f"  • {model['name']}")
                print(f"    Size: {model.get('size', 0) / 1e9:.2f} GB\n")
            return True
        else:
            print(f"✗ Ollama returned status: {response.status_code}")
            return False
    except Exception as e:
        print(f"✗ Could not connect to Ollama: {e}")
        print("\nMake sure to:")
        print("  1. Install Ollama from https://ollama.ai/download")
        print("  2. Run: ollama serve")
        print("  3. Pull models: ollama pull llama3.1:8b")
        return False

test_ollama()

✗ Could not connect to Ollama: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/tags (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x733791c40890>: Failed to establish a new connection: [Errno 111] Connection refused'))

Make sure to:
  1. Install Ollama from https://ollama.ai/download
  2. Run: ollama serve
  3. Pull models: ollama pull llama3.1:8b


False

## Step 2: Initialize LangChain Components

In [3]:
# Initialize Ollama models
llm = ChatOllama(
    model="gemma3:4b",
    temperature=0.2,
)

# Initialize embeddings
embeddings = OllamaEmbeddings(
    model="embeddinggemma:300m",
)

print("✓ LLM initialized: gemma3:4b")
print("✓ Embeddings initialized: embeddinggemma:300m")
print("\nYou can change models by modifying the model parameter")

✓ LLM initialized: gemma3:4b
✓ Embeddings initialized: embeddinggemma:300m

You can change models by modifying the model parameter


## Step 3: Setup Vector Store and MultiVectorRetriever

The MultiVectorRetriever allows us to:
- Store **summaries** in the vector database (for efficient retrieval)
- Store **raw content** separately (for context to the LLM)
- Retrieve using summaries but return full content

In [4]:
# Initialize vector store
vectorstore = Chroma(
    collection_name="nasa_multimodal_rag",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db"
)

# Initialize document store for raw content
docstore = InMemoryStore()

# Setup ID key
id_key = "doc_id"

# Create MultiVectorRetriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    id_key=id_key,
)

print("✓ Vector store initialized (ChromaDB)")
print("✓ Document store initialized (In-Memory)")
print("✓ MultiVectorRetriever ready")

  vectorstore = Chroma(


✓ Vector store initialized (ChromaDB)
✓ Document store initialized (In-Memory)
✓ MultiVectorRetriever ready


## List the files

In [7]:
from pathlib import Path

# Get all PDF files in the pdfs folder
pdfs = [str(p) for p in Path("pdfs").glob("*.pdf")]

# Sort them alphabetically (optional)
pdfs.sort()

print(f"Found {len(pdfs)} PDF files:")
for path in pdfs:
    print(f"  - {path}")

Found 15 PDF files:
  - pdfs/PMC11500582.pdf
  - pdfs/PMC11988870.pdf
  - pdfs/PMC3040128.pdf
  - pdfs/PMC3177255.pdf
  - pdfs/PMC3630201.pdf
  - pdfs/PMC4095884.pdf
  - pdfs/PMC4136787.pdf
  - pdfs/PMC5387210.pdf
  - pdfs/PMC5460236.pdf
  - pdfs/PMC5587110.pdf
  - pdfs/PMC5666799.pdf
  - pdfs/PMC6222041.pdf
  - pdfs/PMC6813909.pdf
  - pdfs/PMC7998608.pdf
  - pdfs/PMC8396460.pdf


## Step 4: Document Parsing with Unstructured

Unstructured is a powerful library that can:
- Parse PDFs with high accuracy
- Extract tables and preserve structure
- Extract images embedded in documents
- Handle various document formats

In [9]:
def load_and_parse_pdf(pdf_path: str):
    """
    Parse PDF using Unstructured library.
    Extracts text, tables, and images.
    """
    print(f"Processing: {pdf_path}")
    print("This may take a few minutes depending on PDF size...\n")
    
    # Partition PDF into elements
    raw_pdf_elements = partition_pdf(
        filename=pdf_path,
        
        # Extract tables
        infer_table_structure=True,
        
        # Processing strategy
        strategy="hi_res",  # High resolution for better quality
        
        # Extract images
        extract_images_in_pdf=True,
        extract_image_block_types=["Image", "Table"],
        extract_image_block_to_payload=True,
        
        # Chunking strategy
        chunking_strategy="by_title",
        max_characters=10000,
        combine_text_under_n_chars=2000,
        new_after_n_chars=6000,
    )
    
    print(f"✓ Extracted {len(raw_pdf_elements)} elements from PDF\n")
    
    return raw_pdf_elements

def process_multiple_pdfs(pdf_paths: List[str]):
    """
    Process multiple PDFs and return all elements with source tracking.
    
    Args:
        pdf_paths: List of PDF file paths to process
        
    Returns:
        dict: Dictionary with source filename as key and elements as value
    """
    all_pdf_elements = {}
    
    print(f"Processing {len(pdf_paths)} PDF files...\n")
    print("="*80)
    
    for pdf_path in pdf_paths:
        if Path(pdf_path).exists():
            try:
                elements = load_and_parse_pdf(pdf_path)
                filename = Path(pdf_path).name
                all_pdf_elements[filename] = elements
                print(f"✓ Successfully processed: {filename}")
                print(f"  Elements extracted: {len(elements)}\n")
            except Exception as e:
                print(f"✗ Error processing {pdf_path}: {e}\n")
        else:
            print(f"✗ File not found: {pdf_path}\n")
    
    print("="*80)
    print(f"\n✓ Total PDFs processed: {len(all_pdf_elements)}")
    total_elements = sum(len(elements) for elements in all_pdf_elements.values())
    print(f"✓ Total elements extracted: {total_elements}\n")
    
    return all_pdf_elements

# Example: Single PDF processing (original behavior)
pdf_path = "paper.pdf"  # Change this to your PDF file

# Example: Multiple PDFs processing (new feature)
pdf_paths = pdfs

# Choose single or batch processing
use_batch_mode = True  # Set to False for single PDF

if use_batch_mode:
    # Process multiple PDFs
    all_elements_by_source = process_multiple_pdfs(pdf_paths)
    
    # Flatten all elements for backward compatibility
    elements = []
    for source, source_elements in all_elements_by_source.items():
        # Add source metadata to each element
        for elem in source_elements:
            if hasattr(elem, 'metadata'):
                elem.metadata.source_file = source
            elements.append(elem)
    
    if elements:
        print(f"Element types: {set([str(type(el).__name__) for el in elements])}")
else:
    # Process single PDF (original behavior)
    if Path(pdf_path).exists():
        elements = load_and_parse_pdf(pdf_path)
        # Add source metadata
        for elem in elements:
            if hasattr(elem, 'metadata'):
                elem.metadata.source_file = Path(pdf_path).name
        all_elements_by_source = {Path(pdf_path).name: elements}
        print(f"Element types: {set([str(type(el).__name__) for el in elements])}")
    else:
        print(f"✗ File not found: {pdf_path}")
        print("Please update the pdf_path variable with your PDF file")
        elements = []
        all_elements_by_source = {}


Processing 15 PDF files...

Processing: pdfs/PMC11500582.pdf
This may take a few minutes depending on PDF size...





The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.


✓ Extracted 10 elements from PDF

✓ Successfully processed: PMC11500582.pdf
  Elements extracted: 10

Processing: pdfs/PMC11988870.pdf
This may take a few minutes depending on PDF size...

✓ Extracted 16 elements from PDF

✓ Successfully processed: PMC11988870.pdf
  Elements extracted: 16

Processing: pdfs/PMC3040128.pdf
This may take a few minutes depending on PDF size...

✓ Extracted 12 elements from PDF

✓ Successfully processed: PMC3040128.pdf
  Elements extracted: 12

Processing: pdfs/PMC3177255.pdf
This may take a few minutes depending on PDF size...

✓ Extracted 16 elements from PDF

✓ Successfully processed: PMC3177255.pdf
  Elements extracted: 16

Processing: pdfs/PMC3630201.pdf
This may take a few minutes depending on PDF size...

✓ Extracted 19 elements from PDF

✓ Successfully processed: PMC3630201.pdf
  Elements extracted: 19

Processing: pdfs/PMC4095884.pdf
This may take a few minutes depending on PDF size...

✓ Extracted 15 elements from PDF

✓ Successfully processed: PM

## Step 5: Categorize Elements by Type

In [10]:
def categorize_elements(raw_elements):
    """
    Categorize elements into text, tables, and images.
    Preserves source file metadata.
    """
    texts = []
    
    for element in raw_elements:
        element_type = str(type(element).__name__)
        
        if element_type == "CompositeElement":
            # Text element - preserve as object to keep metadata
            texts.append(element)
    
    return texts

# Get the tables from the CompositeElement objects
def get_tables(chunks):
    tables = []
    for chunk in chunks:
        if "CompositeElement" in str(type(chunk)):
            chunk_els = chunk.metadata.orig_elements
            for el in chunk_els:
                if "Table" in str(type(el)):
                    # Store table with source metadata
                    source_file = getattr(chunk.metadata, 'source_file', 'unknown')
                    tables.append({
                        'content': el.metadata.text_as_html,
                        'source': source_file
                    })
    return tables

# Get the images from the CompositeElement objects
def get_images_base64(chunks):
    images_b64 = []
    for chunk in chunks:
        if "CompositeElement" in str(type(chunk)):
            chunk_els = chunk.metadata.orig_elements
            for el in chunk_els:
                if "Image" in str(type(el)):
                    # Store image with source metadata
                    source_file = getattr(chunk.metadata, 'source_file', 'unknown')
                    images_b64.append({
                        'content': el.metadata.image_base64,
                        'source': source_file
                    })
    return images_b64

def filter_unknown_sources(texts, tables, images):
    """
    Filter out all elements with 'unknown' source.
    
    Args:
        texts: List of text elements (CompositeElement objects with metadata)
        tables: List of table dicts with 'content' and 'source' keys
        images: List of image dicts with 'content' and 'source' keys
    
    Returns:
        Tuple of filtered (texts, tables, images)
    """
    # Filter texts - check metadata.source_file
    filtered_texts = [
        text for text in texts 
        if hasattr(text, 'metadata') and 
        getattr(text.metadata, 'source_file', 'unknown') != 'unknown'
    ]
    
    # Filter tables - check 'source' key
    filtered_tables = [
        table for table in tables 
        if isinstance(table, dict) and table.get('source', 'unknown') != 'unknown'
    ]
    
    # Filter images - check 'source' key
    filtered_images = [
        image for image in images 
        if isinstance(image, dict) and image.get('source', 'unknown') != 'unknown'
    ]
    
    return filtered_texts, filtered_tables, filtered_images

if elements:
    texts = categorize_elements(elements)
    tables = get_tables(elements)
    images = get_images_base64(elements)
    
    print(f"✓ Categorized elements (before filtering):")
    print(f"  Text chunks: {len(texts)}")
    print(f"  Tables: {len(tables)}")
    print(f"  Images: {len(images)}")
    
    # Filter out unknown sources
    texts, tables, images = filter_unknown_sources(texts, tables, images)
    
    print(f"\n✓ After filtering out 'unknown' sources:")
    print(f"  Text chunks: {len(texts)}")
    print(f"  Tables: {len(tables)}")
    print(f"  Images: {len(images)}")
    
    # Show breakdown by source file
    if all_elements_by_source:
        print(f"\n📁 Breakdown by source file:")
        for source, source_elements in all_elements_by_source.items():
            source_texts = categorize_elements(source_elements)
            source_tables = get_tables(source_elements)
            source_images = get_images_base64(source_elements)
            # Filter each source's elements
            source_texts, source_tables, source_images = filter_unknown_sources(
                source_texts, source_tables, source_images
            )
            print(f"  {source}:")
            print(f"    - Text chunks: {len(source_texts)}")
            print(f"    - Tables: {len(source_tables)}")
            print(f"    - Images: {len(source_images)}")
else:
    texts, tables, images = [], [], []
    print("No elements to categorize")

✓ Categorized elements (before filtering):
  Text chunks: 213
  Tables: 28
  Images: 184

✓ After filtering out 'unknown' sources:
  Text chunks: 213
  Tables: 28
  Images: 184

📁 Breakdown by source file:
  PMC11500582.pdf:
    - Text chunks: 10
    - Tables: 0
    - Images: 6
  PMC11988870.pdf:
    - Text chunks: 16
    - Tables: 1
    - Images: 6
  PMC3040128.pdf:
    - Text chunks: 12
    - Tables: 2
    - Images: 44
  PMC3177255.pdf:
    - Text chunks: 16
    - Tables: 1
    - Images: 8
  PMC3630201.pdf:
    - Text chunks: 19
    - Tables: 4
    - Images: 7
  PMC4095884.pdf:
    - Text chunks: 15
    - Tables: 0
    - Images: 9
  PMC4136787.pdf:
    - Text chunks: 19
    - Tables: 6
    - Images: 12
  PMC5387210.pdf:
    - Text chunks: 14
    - Tables: 0
    - Images: 9
  PMC5460236.pdf:
    - Text chunks: 12
    - Tables: 2
    - Images: 4
  PMC5587110.pdf:
    - Text chunks: 17
    - Tables: 3
    - Images: 25
  PMC5666799.pdf:
    - Text chunks: 14
    - Tables: 3
    - Images:

## Step 6: Generate Summaries for All Elements

We'll generate summaries for:
- Text chunks
- Tables (describe structure and content)
- Images (use vision model if available)

In [11]:
# Prompt for text and table summarization
prompt_text = """
You are an assistant tasked with summarizing tables and text.
Give a concise comprehensive summary of the table or text.

Respond only with the summary, no additional comment.
Do not start your message by saying "Here is a summary" or anything like that.
Just give the summary as it is.

HTML Table or text chunk: {element}

"""
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain for text and tables
model = ChatOllama(temperature=0.2, model="gemma3:4b")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

print("✓ Text and table summarization chain ready")

# Image description prompt
prompt_template = """Describe the image in detail. For context,
                  the image is part of a research paper. Be specific about graphs, such as bar plots."""
messages = [
    (
        "user",
        [
            {"type": "text", "text": prompt_template},
            {
                "type": "image_url",
                "image_url": {"url": "data:image/jpeg;base64,{image}"},
            },
        ],
    )
]

image_prompt = ChatPromptTemplate.from_messages(messages)

# Image description chain
image_chain = image_prompt | ChatOllama(model="llava:7b") | StrOutputParser()

print("✓ Image description chain ready")

✓ Text and table summarization chain ready
✓ Image description chain ready


In [13]:
# Generate summaries for all elements using batch processing
print("Generating summaries...\n")

text_summaries = []
table_summaries = []
image_summaries = []

# Summarize texts using batch processing
if texts:
    print(f"Summarizing {len(texts)} text chunks...")
    # Extract text content from CompositeElement objects
    text_contents = [text_elem.text if hasattr(text_elem, 'text') else str(text_elem) for text_elem in texts]
    text_summaries = summarize_chain.batch(text_contents, {"max_concurrency": 3})
    print(f"✓ Generated {len(text_summaries)} text summaries")

# Summarize tables using batch processing
if tables:
    print(f"Summarizing {len(tables)} tables...")
    # Extract table content
    table_contents = [table_dict['content'] if isinstance(table_dict, dict) else table_dict for table_dict in tables]
    table_summaries = summarize_chain.batch(table_contents, {"max_concurrency": 3})
    print(f"✓ Generated {len(table_summaries)} table summaries")

# Summarize images using batch processing
if images:
    print(f"Describing {len(images)} images...")
    try:
        # Extract image content
        image_contents = [image_dict['content'] if isinstance(image_dict, dict) else image_dict for image_dict in images]
        image_summaries = image_chain.batch(image_contents)
        print(f"✓ Generated {len(image_summaries)} image summaries")
    except Exception as e:
        print(f"⚠ Error with vision model: {e}")
        print("Falling back to placeholder descriptions...")
        image_summaries = [f"Image content (vision model not available)" for _ in images]

print(f"\n✓ Summarization complete!")
print(f"   Total summaries: {len(text_summaries) + len(table_summaries) + len(image_summaries)}")

Generating summaries...

Summarizing 213 text chunks...
✓ Generated 213 text summaries
Summarizing 28 tables...
✓ Generated 28 table summaries
Describing 184 images...
✓ Generated 184 image summaries

✓ Summarization complete!
   Total summaries: 425


## Step 7: Add to MultiVectorRetriever

Store:
- **Summaries** → Vector store (for retrieval)
- **Raw content** → Document store (for LLM context)

In [14]:
def add_documents_to_retriever(texts, text_summaries, tables, table_summaries, images, image_summaries):
    """
    Add all documents to the MultiVectorRetriever.
    Now includes source file metadata for multi-PDF tracking.
    """
    
    # Add text documents
    text_ids = [str(uuid.uuid4()) for _ in texts]
    
    if texts:
        # Store summaries in vector store with source metadata
        summary_docs = []
        for i, (text_elem, summary) in enumerate(zip(texts, text_summaries)):
            source_file = getattr(text_elem.metadata, 'source_file', 'unknown') if hasattr(text_elem, 'metadata') else 'unknown'
            summary_docs.append(
                Document(
                    page_content=summary, 
                    metadata={
                        id_key: text_ids[i], 
                        "type": "text",
                        "source_file": source_file
                    }
                )
            )
        retriever.vectorstore.add_documents(summary_docs)
        
        # Store raw content in docstore
        text_contents = [elem.text if hasattr(elem, 'text') else str(elem) for elem in texts]
        retriever.docstore.mset(list(zip(text_ids, text_contents)))
        print(f"✓ Added {len(texts)} text documents")
    
    # Add table documents
    table_ids = [str(uuid.uuid4()) for _ in tables]
    
    if tables:
        # Store summaries in vector store with source metadata
        summary_docs = []
        for i, (table_dict, summary) in enumerate(zip(tables, table_summaries)):
            source_file = table_dict.get('source', 'unknown') if isinstance(table_dict, dict) else 'unknown'
            summary_docs.append(
                Document(
                    page_content=summary,
                    metadata={
                        id_key: table_ids[i],
                        "type": "table",
                        "source_file": source_file
                    }
                )
            )
        retriever.vectorstore.add_documents(summary_docs)
        
        # Store raw content in docstore
        table_contents = [t['content'] if isinstance(t, dict) else t for t in tables]
        retriever.docstore.mset(list(zip(table_ids, table_contents)))
        print(f"✓ Added {len(tables)} table documents")
    
    # Add image documents
    image_ids = [str(uuid.uuid4()) for _ in images]
    
    if images:
        # Store summaries in vector store with source metadata
        summary_docs = []
        for i, (image_dict, summary) in enumerate(zip(images, image_summaries)):
            source_file = image_dict.get('source', 'unknown') if isinstance(image_dict, dict) else 'unknown'
            summary_docs.append(
                Document(
                    page_content=summary,
                    metadata={
                        id_key: image_ids[i],
                        "type": "image",
                        "source_file": source_file
                    }
                )
            )
        retriever.vectorstore.add_documents(summary_docs)
        
        # Store raw content (base64) in docstore
        image_contents = [img['content'] if isinstance(img, dict) else img for img in images]
        retriever.docstore.mset(list(zip(image_ids, image_contents)))
        print(f"✓ Added {len(images)} image documents")
    
    # Show summary by source file
    if texts or tables or images:
        print("\n✓ All documents added to retriever!")
        print("\n📊 Documents by source file:")
        
        # Collect source statistics
        source_stats = {}
        
        # Count texts by source
        for text_elem in texts:
            source = getattr(text_elem.metadata, 'source_file', 'unknown') if hasattr(text_elem, 'metadata') else 'unknown'
            source_stats[source] = source_stats.get(source, {'texts': 0, 'tables': 0, 'images': 0})
            source_stats[source]['texts'] += 1
        
        # Count tables by source
        for table_dict in tables:
            source = table_dict.get('source', 'unknown') if isinstance(table_dict, dict) else 'unknown'
            source_stats[source] = source_stats.get(source, {'texts': 0, 'tables': 0, 'images': 0})
            source_stats[source]['tables'] += 1
        
        # Count images by source
        for image_dict in images:
            source = image_dict.get('source', 'unknown') if isinstance(image_dict, dict) else 'unknown'
            source_stats[source] = source_stats.get(source, {'texts': 0, 'tables': 0, 'images': 0})
            source_stats[source]['images'] += 1
        
        # Display statistics
        for source, stats in source_stats.items():
            total = stats['texts'] + stats['tables'] + stats['images']
            print(f"  {source}: {total} documents")
            print(f"    - Texts: {stats['texts']}, Tables: {stats['tables']}, Images: {stats['images']}")

# Add all documents
if texts or tables or images:
    add_documents_to_retriever(texts, text_summaries, tables, table_summaries, images, image_summaries)
else:
    print("No documents to add")


✓ Added 213 text documents
✓ Added 28 table documents
✓ Added 184 image documents

✓ All documents added to retriever!

📊 Documents by source file:
  PMC11500582.pdf: 16 documents
    - Texts: 10, Tables: 0, Images: 6
  PMC11988870.pdf: 23 documents
    - Texts: 16, Tables: 1, Images: 6
  PMC3040128.pdf: 58 documents
    - Texts: 12, Tables: 2, Images: 44
  PMC3177255.pdf: 25 documents
    - Texts: 16, Tables: 1, Images: 8
  PMC3630201.pdf: 30 documents
    - Texts: 19, Tables: 4, Images: 7
  PMC4095884.pdf: 24 documents
    - Texts: 15, Tables: 0, Images: 9
  PMC4136787.pdf: 37 documents
    - Texts: 19, Tables: 6, Images: 12
  PMC5387210.pdf: 23 documents
    - Texts: 14, Tables: 0, Images: 9
  PMC5460236.pdf: 18 documents
    - Texts: 12, Tables: 2, Images: 4
  PMC5587110.pdf: 45 documents
    - Texts: 17, Tables: 3, Images: 25
  PMC5666799.pdf: 40 documents
    - Texts: 14, Tables: 3, Images: 23
  PMC6222041.pdf: 23 documents
    - Texts: 12, Tables: 2, Images: 9
  PMC6813909.pdf: 

## Step 8: Create RAG Chain

Build a LangChain RAG pipeline that:
1. Retrieves relevant documents
2. Formats context
3. Generates grounded answers

## Step 7.5: Query Specific Documents

When working with multiple PDFs, you can filter results by source file.

In [15]:
# Create prompt template
template = """Answer the question based on the following context. Don't cite any sources in your response. Don't say "Based on the resource" or any other simialar sentences. Respond with Markdown format.

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Create RAG chain
def format_docs(docs):
    """Format retrieved documents for the prompt."""
    formatted = []
    for i, doc in enumerate(docs):
        content = doc if isinstance(doc, str) else str(doc)
        formatted.append(f"[Source {i+1}]\n{content[:1000]}...\n")
    return "\n\n".join(formatted)

def format_docs_with_source(docs, metadata_list):
    """Format retrieved documents with source file information."""
    formatted = []
    for i, (doc, meta) in enumerate(zip(docs, metadata_list)):
        content = doc if isinstance(doc, str) else str(doc)
        source = meta.get('source_file', 'unknown') if meta else 'unknown'
        formatted.append(f"[Source {i+1} - from {source}]\n{content[:1000]}...\n")
    return "\n\n".join(formatted)

rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

print("✓ RAG chain created!")
print("✓ Ready to answer questions from multiple PDFs")


✓ RAG chain created!
✓ Ready to answer questions from multiple PDFs


## Step 9: Query the System

Now you can ask questions about your documents!

In [16]:
def query(question: str, show_sources: bool = True):
    """Query the RAG system and optionally show source files."""
    print(f"\nQuestion: {question}")
    print("="*80)
    
    # Get answer
    answer = rag_chain.invoke(question)
    
    print(f"\nAnswer:\n{answer}")
    print("\n" + "="*80)
    
    # Show retrieved documents with sources
    docs = retriever.get_relevant_documents(question)
    print(f"\n📚 Retrieved {len(docs)} relevant documents")
    
    if show_sources:
        # Get metadata from vector store
        results = retriever.vectorstore.similarity_search(question, k=5)
        sources = {}
        for r in results:
            source = r.metadata.get('source_file', 'unknown')
            doc_type = r.metadata.get('type', 'unknown')
            if source not in sources:
                sources[source] = []
            sources[source].append(doc_type)
        
        print(f"\n📁 Sources used:")
        for source, types in sources.items():
            type_counts = {}
            for t in types:
                type_counts[t] = type_counts.get(t, 0) + 1
            type_str = ", ".join([f"{count} {type}" for type, count in type_counts.items()])
            print(f"  • {source}: {type_str}")
    
    return answer

# Example queries
if texts or tables or images:
    print("Try asking questions about your documents!\n")
    print("Examples:")
    print('  query("What is this document about?")')
    print('  query("What are the main findings?")')
    print('  query("Are there any tables or data?")')
    print('  query("Compare information across documents")')
    print('\nFor source-specific queries:')
    print('  query_by_source("What does this paper discuss?", source_file="paper.pdf")')
else:
    print("Add documents first before querying")


Try asking questions about your documents!

Examples:
  query("What is this document about?")
  query("What are the main findings?")
  query("Are there any tables or data?")
  query("Compare information across documents")

For source-specific queries:
  query_by_source("What does this paper discuss?", source_file="paper.pdf")


In [18]:
# Example query 1
query("what do you know about Mice experiments?")


Question: what do you know about Mice experiments?



Answer:
The provided sources detail experiments involving RNA sequencing with various mouse groups. Specifically, there are data from “Flight experiment (SF),” “Flight vivarium control (SFV),” and “Ground control experiment (GC)” groups. The “Ground control experiment - vivarium control (GCV)” group is also mentioned. The RNA sequencing utilized different fluorescent labels (FAM, HEX, Cy5) across various experimental conditions, including singleplex, duplex, and triplex configurations. Data includes measurements like “E. coliRNA,” “dnaK-FAM,” “rpoA-HEX,” and “srIR-Cy5,” alongside numerical values representing expression levels.


📚 Retrieved 3 relevant documents

📁 Sources used:
  • PMC4136787.pdf: 1 table
  • paper2.pdf: 1 table
  • PMC8396460.pdf: 1 text
  • PMC5587110.pdf: 1 table
  • PMC7998608.pdf: 1 table


'The provided sources detail experiments involving RNA sequencing with various mouse groups. Specifically, there are data from “Flight experiment (SF),” “Flight vivarium control (SFV),” and “Ground control experiment (GC)” groups. The “Ground control experiment - vivarium control (GCV)” group is also mentioned. The RNA sequencing utilized different fluorescent labels (FAM, HEX, Cy5) across various experimental conditions, including singleplex, duplex, and triplex configurations. Data includes measurements like “E. coliRNA,” “dnaK-FAM,” “rpoA-HEX,” and “srIR-Cy5,” alongside numerical values representing expression levels.'

In [20]:
# Example query 2
query("What are the key findings or results or coclusions?")


Question: What are the key findings or results or coclusions?



Answer:
The provided sources do not contain any key findings, results, or conclusions. They consist of gene primer sequences, a gene list enrichment analysis reference, and a table of gene fold regulations, alongside a seemingly nonsensical table with no apparent data.


📚 Retrieved 4 relevant documents

📁 Sources used:
  • PMC7998608.pdf: 1 table
  • PMC8396460.pdf: 1 text, 1 table
  • PMC5460236.pdf: 1 table, 1 text


'The provided sources do not contain any key findings, results, or conclusions. They consist of gene primer sequences, a gene list enrichment analysis reference, and a table of gene fold regulations, alongside a seemingly nonsensical table with no apparent data.'

In [21]:
# Example query 3 - Custom question
query("What are the experiments performed on mice?")


Question: What are the experiments performed on mice?



Answer:
The table in Source 3 details experiments performed on mice, specifically within the “Flight experiment (SF)” and “Flight vivarium control (SFV)” groups. These experiments involved RNA isolation and multiplex quantitative real-time PCR analysis of gene expression.


📚 Retrieved 3 relevant documents

📁 Sources used:
  • PMC4136787.pdf: 1 table, 1 text
  • paper2.pdf: 1 table
  • PMC5587110.pdf: 1 table, 1 text


'The table in Source 3 details experiments performed on mice, specifically within the “Flight experiment (SF)” and “Flight vivarium control (SFV)” groups. These experiments involved RNA isolation and multiplex quantitative real-time PCR analysis of gene expression.'

In [22]:
import csv
import json
import numpy as np
from datetime import datetime

def export_chroma_to_csv(output_file: str = "chroma_export.csv"):
    """
    Export ChromaDB contents to CSV format for PostgreSQL/pgvector import.
    
    Exports:
    - Document ID
    - Content (summary)
    - Embedding vector
    - Metadata (type, source_file)
    - Original content ID (for linking to raw content)
    
    Args:
        output_file: Path to output CSV file
    """
    print(f"Exporting ChromaDB to {output_file}...")
    print("="*80)
    
    try:
        # Get all documents from vectorstore with embeddings
        collection = retriever.vectorstore._collection
        
        # Get all data from the collection
        results = collection.get(
            include=['embeddings', 'documents', 'metadatas']
        )
        
        # Prepare data for CSV
        rows = []
        for i in range(len(results['ids'])):
            # Convert numpy array to list for JSON serialization
            embedding = results['embeddings'][i]
            if isinstance(embedding, np.ndarray):
                embedding = embedding.tolist()
            elif not isinstance(embedding, list):
                embedding = list(embedding)
            
            row = {
                'id': results['ids'][i],
                'content': results['documents'][i],
                'embedding': json.dumps(embedding),  # Now serializable
                'metadata': json.dumps(results['metadatas'][i]),  # Store metadata as JSON
                'doc_type': results['metadatas'][i].get('type', 'unknown'),
                'source_file': results['metadatas'][i].get('source_file', 'unknown'),
                'doc_id': results['metadatas'][i].get(id_key, ''),  # Link to raw content
                'created_at': datetime.now().isoformat()
            }
            rows.append(row)
        
        # Write to CSV
        if rows:
            fieldnames = ['id', 'content', 'embedding', 'metadata', 'doc_type', 
                         'source_file', 'doc_id', 'created_at']
            
            with open(output_file, 'w', newline='', encoding='utf-8') as f:
                writer = csv.DictWriter(f, fieldnames=fieldnames, quoting=csv.QUOTE_MINIMAL)
                writer.writeheader()
                writer.writerows(rows)
            
            print(f"✓ Exported {len(rows)} documents to {output_file}")
            
            # Get embedding dimension
            if rows:
                first_embedding = json.loads(rows[0]['embedding'])
                print(f"✓ Embedding dimension: {len(first_embedding)}")
            
            print(f"\nFile structure:")
            print(f"  - id: Unique document identifier")
            print(f"  - content: Document summary text")
            print(f"  - embedding: Vector embedding (JSON array of floats)")
            print(f"  - metadata: Full metadata (JSON)")
            print(f"  - doc_type: Type (text/table/image)")
            print(f"  - source_file: Original PDF filename")
            print(f"  - doc_id: Link to raw content")
            print(f"  - created_at: Export timestamp")
            
            # Show statistics
            by_type = {}
            by_source = {}
            for row in rows:
                doc_type = row['doc_type']
                source = row['source_file']
                by_type[doc_type] = by_type.get(doc_type, 0) + 1
                by_source[source] = by_source.get(source, 0) + 1
            
            print(f"\n📊 Export Statistics:")
            print(f"  Total documents: {len(rows)}")
            print(f"  By type: {by_type}")
            print(f"  By source: {by_source}")
            
        else:
            print("⚠ No documents found in ChromaDB")
            
    except Exception as e:
        print(f"✗ Error exporting: {e}")
        import traceback
        traceback.print_exc()

def export_raw_content_to_csv(output_file: str = "raw_content_export.csv"):
    """
    Export raw document content from docstore to CSV.
    This includes the full text/tables/images that aren't in the vector DB.
    
    Args:
        output_file: Path to output CSV file
    """
    print(f"Exporting raw content to {output_file}...")
    print("="*80)
    
    try:
        # Get all doc_ids from vectorstore
        collection = retriever.vectorstore._collection
        results = collection.get(include=['metadatas'])
        
        doc_ids = [meta.get(id_key) for meta in results['metadatas'] if meta.get(id_key)]
        
        # Get raw content from docstore
        raw_contents = retriever.docstore.mget(doc_ids)
        
        # Prepare data for CSV
        rows = []
        for doc_id, content in zip(doc_ids, raw_contents):
            if content:
                # Find corresponding metadata
                meta = next((m for m in results['metadatas'] if m.get(id_key) == doc_id), {})
                
                # Handle different content types
                content_str = content
                if isinstance(content, bytes):
                    # For binary content (images), encode as base64
                    import base64
                    content_str = base64.b64encode(content).decode('utf-8')
                elif not isinstance(content, str):
                    content_str = str(content)
                
                row = {
                    'doc_id': doc_id,
                    'content': content_str,
                    'doc_type': meta.get('type', 'unknown'),
                    'source_file': meta.get('source_file', 'unknown'),
                    'created_at': datetime.now().isoformat()
                }
                if row['source_file'] != 'unknown':
                    rows.append(row)
        
        # Write to CSV
        if rows:
            fieldnames = ['doc_id', 'content', 'doc_type', 'source_file', 'created_at']
            
            with open(output_file, 'w', newline='', encoding='utf-8') as f:
                writer = csv.DictWriter(f, fieldnames=fieldnames, quoting=csv.QUOTE_MINIMAL)
                writer.writeheader()
                writer.writerows(rows)
            
            print(f"✓ Exported {len(rows)} raw documents to {output_file}")
            print(f"\nFile structure:")
            print(f"  - doc_id: Links to main export")
            print(f"  - content: Full raw content")
            print(f"  - doc_type: Type (text/table/image)")
            print(f"  - source_file: Original PDF filename")
            print(f"  - created_at: Export timestamp")
            
            # Show content size statistics
            total_size = sum(len(row['content']) for row in rows)
            avg_size = total_size / len(rows) if rows else 0
            print(f"\n📏 Content Statistics:")
            print(f"  Total characters: {total_size:,}")
            print(f"  Average size: {avg_size:,.0f} chars")
            
        else:
            print("⚠ No raw content found in docstore")
            
    except Exception as e:
        print(f"✗ Error exporting raw content: {e}")
        import traceback
        traceback.print_exc()

def export_complete_dataset(base_name: str = "nasa_rag_export"):
    """
    Export both vector embeddings and raw content in one go.
    Creates two CSV files that can be imported together.
    
    Args:
        base_name: Base name for output files (will add _vectors.csv and _content.csv)
    """
    print("📦 Exporting Complete Dataset")
    print("="*80 + "\n")
    
    # Export vectors
    vectors_file = f"{base_name}_vectors.csv"
    export_chroma_to_csv(vectors_file)
    
    print("\n")
    
    # Export raw content
    content_file = f"{base_name}_content.csv"
    export_raw_content_to_csv(content_file)
    
    print("\n" + "="*80)
    print("✓ Export complete!")
    print(f"\n📁 Files created:")
    print(f"  1. {vectors_file} - Vector embeddings and summaries")
    print(f"  2. {content_file} - Raw document content")
    print(f"\n💡 Import instructions for backend team:")
    print(f"  - Both files should be imported")
    print(f"  - Link records using 'doc_id' field")
    print(f"  - Parse 'embedding' JSON array for pgvector")
    print(f"  - Parse 'metadata' JSON for additional fields")
    print(f"\n📋 Example Python import code:")
    print("""
    import json
    import pandas as pd
    
    # Read the CSVs
    vectors_df = pd.read_csv('nasa_rag_export_vectors.csv')
    content_df = pd.read_csv('nasa_rag_export_content.csv')
    
    # Parse embeddings
    vectors_df['embedding'] = vectors_df['embedding'].apply(json.loads)
    vectors_df['metadata'] = vectors_df['metadata'].apply(json.loads)
    """)

# Example usage
if texts or tables or images:
    print("Export Functions Available:\n")
    print("1. Export vector embeddings only:")
    print('   export_chroma_to_csv("my_export.csv")\n')
    print("2. Export raw content only:")
    print('   export_raw_content_to_csv("raw_content.csv")\n')
    print("3. Export complete dataset (recommended):")
    print('   export_complete_dataset("nasa_rag_export")\n')
else:
    print("Process documents first before exporting")

Export Functions Available:

1. Export vector embeddings only:
   export_chroma_to_csv("my_export.csv")

2. Export raw content only:
   export_raw_content_to_csv("raw_content.csv")

3. Export complete dataset (recommended):
   export_complete_dataset("nasa_rag_export")



In [24]:
# Export the complete dataset
if texts or tables or images:
    export_complete_dataset("nasa_rag_export")
else:
    print("No documents to export. Process PDFs first.")

📦 Exporting Complete Dataset

Exporting ChromaDB to nasa_rag_export_vectors.csv...


✓ Exported 700 documents to nasa_rag_export_vectors.csv
✓ Embedding dimension: 768

File structure:
  - id: Unique document identifier
  - content: Document summary text
  - embedding: Vector embedding (JSON array of floats)
  - metadata: Full metadata (JSON)
  - doc_type: Type (text/table/image)
  - source_file: Original PDF filename
  - doc_id: Link to raw content
  - created_at: Export timestamp

📊 Export Statistics:
  Total documents: 700
  By type: {'text': 365, 'table': 70, 'image': 265}
  By source: {'unknown': 74, 'paper1.pdf': 90, 'paper2.pdf': 111, 'PMC11500582.pdf': 16, 'PMC11988870.pdf': 23, 'PMC3040128.pdf': 58, 'PMC3177255.pdf': 25, 'PMC3630201.pdf': 30, 'PMC4095884.pdf': 24, 'PMC4136787.pdf': 37, 'PMC5387210.pdf': 23, 'PMC5460236.pdf': 18, 'PMC5587110.pdf': 45, 'PMC5666799.pdf': 40, 'PMC6222041.pdf': 23, 'PMC6813909.pdf': 19, 'PMC7998608.pdf': 23, 'PMC8396460.pdf': 21}


Exporting raw content to nasa_rag_export_content.csv...
✓ Exported 425 raw documents to nasa_rag_expo