Document RAG: Multimodal Retrieval-Augmented Generation

Overview

Document RAG is a production-ready, multimodal Retrieval-Augmented Generation (RAG) system designed to enable intelligent document question-answering through semantic search and LLM-powered synthesis. The system processes PDF documents, extracts both textual and visual content, creates semantic embeddings for multimodal retrieval, and generates contextually accurate answers using GPT-4.

Problem Statement

Organizations struggle with extracting actionable insights from large document repositories. Traditional document search relies on keyword matching, which fails to understand semantic meaning and cannot effectively cross-reference visual and textual content. This project solves that by implementing a sophisticated RAG pipeline that understands document semantics, retrieves relevant information across modalities, and synthesizes answers that maintain source attribution.

Key Capabilities

Multimodal Document Processing: Extracts text, images, and tables from PDFs with OCR fallback
Semantic Embeddings: Creates 1536-dimensional text embeddings and 512-dimensional image embeddings
Hybrid Retrieval: Uses Reciprocal Rank Fusion to intelligently combine text and image search results
Context-Aware Generation: Leverages GPT-4.1-Mini with source attribution for accurate answer synthesis
Production-Ready: Includes comprehensive error handling, logging, and health checks

Architecture

PDF Upload → Text/Image/Table Extraction → Semantic Chunking
                                    ↓
                        Create Dual Embeddings
                  (Text: OpenAI 1536-dim | Image: CLIP 512-dim)
                                    ↓
                        Pinecone Vector Store
                      (Namespace-based Organization)
                                    ↓
                          Query Processing
                  Reciprocal Rank Fusion (fair ranking)
                                    ↓
                    RAG Pipeline & LLM Orchestration
          Context Retrieval → Prompt Construction → GPT-4.1-Mini → Answer + Sources

System Components

Document Processor: PDF parsing, text extraction, image extraction, table recognition
Text Chunker: Semantic chunking with configurable overlap for context preservation
Embedding Engine: Dual-embedding generation using REST APIs and HuggingFace models
Vector Store: Pinecone integration with namespace-based multi-document support
Retriever: Multimodal fusion with RRF algorithm for balanced results
RAG Pipeline: LLM orchestration with prompt engineering and source tracking
FastAPI Application: RESTful interface with dependency injection and validation

Tech Stack

Component	Technology	Purpose
Framework	FastAPI 0.100+	REST API & async request handling
LLM	OpenAI GPT-4.1-mini	Answer generation & synthesis
Text Embeddings	OpenAI text-embedding-3-small	Semantic text representation (1536-dim)
Image Embeddings	CLIP (openai/clip-vit-base-patch32)	Visual content representation (512-dim)
Vector Database	Pinecone	Semantic search & retrieval
Document Processing	PyMuPDF, pdfplumber, pytesseract	PDF parsing & OCR
Validation	Pydantic	Request/response schemas
Async Runtime	asyncio, uvicorn	Concurrent request handling
Testing	pytest	Unit & integration testing
Deployment	Docker, Docker Compose	Containerization & orchestration

Project Structure

document-rag/
├── app/                                    # Main application package
│   ├── main.py                            # FastAPI app initialization
│   ├── dependencies.py                    # Dependency injection & singletons
│   │
│   ├── api/                               # REST API layer
│   │   ├── models.py                     # Pydantic request/response schemas
│   │   └── routes/
│   │       ├── documents.py              # POST /documents/upload
│   │       ├── search.py                 # POST /search (RAG query)
│   │       └── health.py                 # GET /health
│   │
│   ├── core/                              # Configuration & constants
│   │   ├── config_loader.py              # Environment & settings
│   │   └── constants.py                  # Model names, parameters
│   │
│   ├── services/                          # Business logic layer
│   │   ├── document_processor.py         # PDF extraction (text/images/tables)
│   │   ├── text_chunker.py               # Semantic chunking with overlap
│   │   ├── embeddings.py                 # Dual embedding generation
│   │   ├── vector_store.py               # Pinecone operations
│   │   ├── retriever.py                  # Multimodal fusion & ranking
│   │   └── rag_pipeline.py               # End-to-end RAG orchestration
│   │
│   └── utils/                             # Utilities
│       ├── logging.py                    # Structured logging
│       └── helpers.py                    # Common utilities
│
├── tests/                                 # Test suite
│   ├── test_rag_pipeline.py
│   └── __init__.py
│
├── data/                                  # Test documents
├── extracted_images/                      # Image storage
├── requirements.txt                       # Python dependencies
├── .env.example                           # Environment template
├── docker-compose.yml                     # Container orchestration
├── Dockerfile                             # Container image
└── README.md                              # This file

Getting Started

Prerequisites

Python 3.10 or higher
OpenAI API Key (GPT-4 access required): Get API Keys
Pinecone Account (free tier available): Create Account
Optional: Docker & Docker Compose for containerized deployment

Installation

1. Clone Repository

git clone <repository-url>
cd document-rag

2. Create Virtual Environment

python3.10 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

4. Setup Environment Variables

cp .env.example .env

Edit .env with your credentials:

# Required - Core Services
OPENAI_API_KEY=sk-your-key-here
PINECONE_API_KEY=pk-your-key-here

# Optional - Configuration
PINECONE_ENVIRONMENT=us-east-1-aws
LOG_LEVEL=INFO
DEBUG=False

Running the Application

Option A: Direct Execution

# Start FastAPI development server
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Option B: Docker Compose

docker-compose up -d

The API will be available at:

Base URL: http://localhost:8000
API Documentation (Swagger UI): http://localhost:8000/docs
Interactive Docs (ReDoc): http://localhost:8000/redoc

Quick Start

1. Health Check

curl http://localhost:8000/health

Expected response:

{
  "status": "healthy",
  "openai_api": true,
  "pinecone_api": true,
  "timestamp": "2024-..."
}

2. Upload Document

curl -X POST http://localhost:8000/documents/upload \
  -F "file=@/path/to/document.pdf"

Response:

{
  "document_id": "doc_1234567890",
  "filename": "document.pdf",
  "status": "completed",
  "message": "Successfully processed 25 chunks from 12 pages"
}

3. Query the Document

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the main topics covered?",
    "top_k": 5,
    "document_id": "doc_1234567890"
  }'

Response:

{
  "query": "What are the main topics covered?",
  "answer": "Based on the document, the main topics include...",
  "sources": [
    {
      "content": "...",
      "content_type": "text",
      "metadata": {
        "page_number": 1,
        "chunk_id": "chunk_001",
        "score": 0.95
      }
    }
  ],
  "total_sources_retrieved": 3
}

Features

Document Processing

✓ Multi-format Extraction: PDF text, image, and table extraction with fidelity preservation
✓ OCR Support: Automatic OCR fallback (pytesseract) for scanned PDFs
✓ Semantic Chunking: Intelligent text segmentation with configurable overlap
✓ Metadata Preservation: Maintains page numbers, chunk IDs, and document references

Embedding & Retrieval

✓ Dual-Modal Embeddings:
- Text: OpenAI text-embedding-3-small (1536-dim, up to 8k tokens)
- Image: CLIP openai/clip-vit-base-patch32 (512-dim)
✓ Batch Processing: Efficient vectorization of large documents
✓ Similarity Search: Cosine similarity-based retrieval with configurable K
✓ Reciprocal Rank Fusion: Fair multimodal result combination algorithm

Vector Database

✓ Pinecone Integration: Scalable, managed vector storage
✓ Namespace Support: Multi-document organization and isolation
✓ Metadata Indexing: Rich metadata-aware searching
✓ Upsert Operations: Efficient bulk insert/update cycles

RAG Pipeline

✓ Context-Aware Synthesis: GPT-4.1-Mini powered answer generation
✓ Source Attribution: Maintains and returns source references
✓ Prompt Engineering: Optimized system and user prompts
✓ Token Management: Configurable context window and token limits
✓ Streaming Support: Real-time answer generation capability

API & Application

✓ RESTful Endpoints: Clean, documented REST API
✓ Pydantic Validation: Strict request/response validation
✓ Health Checks: Comprehensive service health monitoring
✓ Error Handling: Graceful error responses with detailed messages
✓ Async Support: asyncio throughout for concurrent processing
✓ Dependency Injection: Singleton pattern for service initialization
✓ OpenAPI/Swagger: Auto-generated API documentation
✓ CORS Support: Cross-origin request handling

Observability

✓ Structured Logging: Configurable logging with debug/info/error levels
✓ Request Tracking: Traceable request/response flows
✓ Performance Metrics: Execution timing for key operations
✓ Error Diagnostics: Detailed error logging for troubleshooting

Configuration

Environment Variables

All configuration is managed through .env file. See .env.example for template:

# ============ REQUIRED ============
# OpenAI Configuration
OPENAI_API_KEY=sk-...              # Your OpenAI API key

# Pinecone Configuration  
PINECONE_API_KEY=pk-...            # Your Pinecone API key

# ============ OPTIONAL ============
# Pinecone Settings
PINECONE_ENVIRONMENT=us-east-1-aws # Pinecone region (default: us-east-1-aws)

# Application Settings
LOG_LEVEL=INFO                      # Logging level: DEBUG|INFO|WARNING|ERROR
DEBUG=False                         # Enable debug mode for detailed logging

# Model Parameters (use constants.py for additional tuning)
MODEL_NAME=gpt-4.1-Mini             # LLM model name
EMBEDDING_MODEL=text-embedding-3-small
CHUNK_SIZE=512                     # Characters per chunk
CHUNK_OVERLAP=50                   # Overlap between chunks
TOP_K=5                            # Number of results to retrieve

Advanced Configuration

Edit app/core/constants.py for advanced tuning:

# Text Processing
CHUNK_SIZE = 512                    # Adjust for context preservation vs. precision
CHUNK_OVERLAP = 50                  # Prevent context loss at boundaries

# Embedding Models
TEXT_EMBEDDING_MODEL = "text-embedding-3-small"  # 1536-dim
IMAGE_EMBEDDING_MODEL = "openai/clip-vit-base-patch32"  # 512-dim

# LLM Parameters
LLM_MODEL = "gpt-4.1-Mini"
LLM_TEMPERATURE = 0.3               # 0.0=deterministic, 1.0=creative
MAX_TOKENS = 500                    # Response length limit
REQUEST_TIMEOUT = 60                # API timeout in seconds

# Retrieval
TOP_K = 5                           # Results to retrieve
RRF_K = 60                          # RRF algorithm parameter

# Performance
BATCH_SIZE = 10                     # Embedding batch size
MAX_FILE_SIZE_MB = 50               # Maximum PDF size

API Reference

Complete API documentation available at /docs (Swagger UI) and /redoc (ReDoc).

Health Check

Endpoint: GET /health

Verify service availability and dependency health.

Response:

{
  "status": "healthy|degraded|unhealthy",
  "openai_api": true,
  "pinecone_api": true,
  "timestamp": "2024-01-15T10:30:00Z"
}

Status Codes:

200: Service is healthy
503: Service degraded or unavailable

Upload Document

Endpoint: POST /documents/upload

Process and index a PDF document for semantic search.

Request:

Content-Type: multipart/form-data
Body: file (PDF file)

Response (Status: 201):

{
  "document_id": "doc_1704283200000",
  "filename": "annual_report_2024.pdf",
  "status": "completed",
  "message": "Successfully processed 127 chunks from 42 pages",
  "metadata": {
    "total_pages": 42,
    "chunks_created": 127,
    "text_chunks": 120,
    "image_chunks": 7,
    "processing_time_seconds": 45.23
  }
}

Error Responses:

400: Invalid file format or corrupted PDF
413: File size exceeds maximum limit (50MB)
500: Processing failed - check logs

Search & Query

Endpoint: POST /search

Query documents using semantic search and RAG-powered answer generation.

Request:

{
  "query": "What are the key financial metrics?",
  "document_id": "doc_1704283200000",
  "top_k": 5,
  "include_metadata": true
}

Parameters:

query (string, required): Natural language question
document_id (string, optional): Specific document to search, or searches all if omitted
top_k (integer, optional, default: 5): Number of results to retrieve (1-20)
include_metadata (boolean, optional, default: true): Include source metadata

Response (Status: 200):

{
  "query": "What are the key financial metrics?",
  "answer": "Based on the document analysis, the key financial metrics include revenue growth of 15% YoY, EBITDA margin of 32%, and return on equity of 18%...",
  "sources": [
    {
      "content": "Our company achieved revenue of $2.5B in 2024, representing 15% YoY growth. EBITDA margin improved to 32% from 28% in 2023.",
      "content_type": "text",
      "score": 0.96,
      "metadata": {
        "document_id": "doc_1704283200000",
        "page_number": 12,
        "chunk_id": "chunk_0042",
        "chunk_position": "2:3"
      }
    },
    {
      "content": "[Image: Financial Dashboard]",
      "content_type": "image",
      "score": 0.87,
      "metadata": {
        "document_id": "doc_1704283200000",
        "page_number": 14,
        "chunk_id": "chunk_0048"
      }
    }
  ],
  "total_sources_retrieved": 2,
  "execution_time_seconds": 2.34
}

Error Responses:

400: Invalid query or parameters
404: Document not found
503: Vector database unavailable

Batch Upload (Future)

Endpoint: POST /documents/batch

Coming Soon: Upload and process multiple documents concurrently.

Testing

Running Tests

# Run all tests
pytest -v

# Run with coverage report
pytest --cov=app --cov-report=html tests/

# Run specific test file
pytest tests/test_rag_pipeline.py -v

# Run with markers
pytest -m "integration" -v

Test Structure

tests/
├── test_rag_pipeline.py          # End-to-end RAG flow
├── test_document_processor.py    # PDF processing
├── test_embeddings.py            # Embedding generation
├── test_retriever.py             # Retrieval logic
└── conftest.py                   # Fixtures & setup

Writing Tests

import pytest
from app.services.embeddings import EmbeddingService

@pytest.mark.asyncio
async def test_text_embedding_generation():
    service = EmbeddingService()
    result = await service.embed_texts(["Hello world"])
    assert result.shape == (1, 1536)
    assert result.dtype == "float32"

Core Concepts

Embeddings

Text Embeddings: Transforms text into 1536-dimensional vectors using OpenAI's text-embedding-3-small model. These vectors capture semantic meaning, enabling similarity-based search.

Image Embeddings: Transforms images into 512-dimensional vectors using OpenAI's CLIP model. Enables visual content retrieval and multimodal search.

Why it matters: Embeddings convert unstructured text/images into structured numerical representations that allow semantic similarity comparison.

Semantic Chunking

Chunks documents intelligently to preserve semantic meaning while maintaining context. Unlike naive token-based splitting, semantic chunking respects sentence and paragraph boundaries.

Parameters:

Chunk size: 512 chars (preserves ~200 token context)
Overlap: 50 chars (prevents context loss)

Vector Search

Retrieves semantically similar content using cosine similarity metric: $$\text{similarity}(a, b) = \frac{a \cdot b}{||a|| \cdot ||b||}$$

Returns top-K most similar chunks based on distance to query embedding.

Reciprocal Rank Fusion (RRF)

Combines rankings from multiple retrieval sources (text & image) fairly: $$\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + r(d)}$$

Where:

$k$ = constant (typically 60)
$r(d)$ = rank of document in result set

Advantage: Prevents any single modality from dominating results.

Retrieval-Augmented Generation (RAG)

Three-step process:

Retrieval: Fetch relevant context from vector database
Augmentation: Insert context into LLM prompt
Generation: Generate answer using LLM with context

Formula: $\text{Answer} = \text{LLM}(\text{Prompt} + \text{Context} + \text{Query})$

Benefits:

Improves answer accuracy with source documents
Enables fact-checking against sources
Reduces hallucinations
Provides source attribution

Context Window & Token Management

Query Embedding: ~50 tokens
Retrieved Context: ~400 tokens (5 chunks × 80 tokens)
System Prompt: ~100 tokens
Answer Generation: ~500 tokens allowed

Total: ~1000 tokens (well within GPT-4's 8K limit)

References & Resources

Documentation

FastAPI Documentation - Web framework
OpenAI API Reference - LLM API
Pinecone Documentation - Vector database
Pydantic Docs - Data validation

Models & Embeddings

Libraries & Tools

PyMuPDF (fitz) - PDF text extraction
pdfplumber - Table extraction
pytesseract - OCR

Related Papers

Cost Estimation (Monthly)

Assuming 1000 document uploads + 5000 queries:

Service	Cost	Notes
OpenAI Embeddings	$1-2	~6M tokens
OpenAI GPT-4.1-Mini	$10-20	~500K input + 100K output tokens
Pinecone (Free Tier)	$0	Up to 1M vectors
Hosting (AWS t3.medium)	$40	2 vCPU, 4GB RAM
Total	~$50-65	Excluding any storage overage

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
app		app
tests		tests
ui		ui
.env.example		.env.example
.gitignore		.gitignore
D-RAG.png		D-RAG.png
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Document RAG: Multimodal Retrieval-Augmented Generation

Overview

Problem Statement

Key Capabilities

Architecture

System Components

Tech Stack

Project Structure

Getting Started

Prerequisites

Installation

1. Clone Repository

2. Create Virtual Environment

3. Install Dependencies

4. Setup Environment Variables

Running the Application

Option A: Direct Execution

Option B: Docker Compose

Quick Start

1. Health Check

2. Upload Document

3. Query the Document

Features

Document Processing

Embedding & Retrieval

Vector Database

RAG Pipeline

API & Application

Observability

Configuration

Environment Variables

Advanced Configuration

API Reference

Health Check

Upload Document

Search & Query

Batch Upload (Future)

Testing

Running Tests

Test Structure

Writing Tests

Core Concepts

Embeddings

Semantic Chunking

Vector Search

Reciprocal Rank Fusion (RRF)

Retrieval-Augmented Generation (RAG)

Context Window & Token Management

References & Resources

Documentation

Models & Embeddings

Libraries & Tools

Related Papers

Cost Estimation (Monthly)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages