Skip to content

Vishnu-19/Document-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document RAG: Multimodal Retrieval-Augmented Generation

Python 3.10+ FastAPI License: MIT Code style: python

Overview

Document RAG is a production-ready, multimodal Retrieval-Augmented Generation (RAG) system designed to enable intelligent document question-answering through semantic search and LLM-powered synthesis. The system processes PDF documents, extracts both textual and visual content, creates semantic embeddings for multimodal retrieval, and generates contextually accurate answers using GPT-4.

Problem Statement

Organizations struggle with extracting actionable insights from large document repositories. Traditional document search relies on keyword matching, which fails to understand semantic meaning and cannot effectively cross-reference visual and textual content. This project solves that by implementing a sophisticated RAG pipeline that understands document semantics, retrieves relevant information across modalities, and synthesizes answers that maintain source attribution.

Key Capabilities

  • Multimodal Document Processing: Extracts text, images, and tables from PDFs with OCR fallback
  • Semantic Embeddings: Creates 1536-dimensional text embeddings and 512-dimensional image embeddings
  • Hybrid Retrieval: Uses Reciprocal Rank Fusion to intelligently combine text and image search results
  • Context-Aware Generation: Leverages GPT-4.1-Mini with source attribution for accurate answer synthesis
  • Production-Ready: Includes comprehensive error handling, logging, and health checks

Architecture

PDF Upload → Text/Image/Table Extraction → Semantic Chunking
                                    ↓
                        Create Dual Embeddings
                  (Text: OpenAI 1536-dim | Image: CLIP 512-dim)
                                    ↓
                        Pinecone Vector Store
                      (Namespace-based Organization)
                                    ↓
                          Query Processing
                  Reciprocal Rank Fusion (fair ranking)
                                    ↓
                    RAG Pipeline & LLM Orchestration
          Context Retrieval → Prompt Construction → GPT-4.1-Mini → Answer + Sources

System Components

  1. Document Processor: PDF parsing, text extraction, image extraction, table recognition
  2. Text Chunker: Semantic chunking with configurable overlap for context preservation
  3. Embedding Engine: Dual-embedding generation using REST APIs and HuggingFace models
  4. Vector Store: Pinecone integration with namespace-based multi-document support
  5. Retriever: Multimodal fusion with RRF algorithm for balanced results
  6. RAG Pipeline: LLM orchestration with prompt engineering and source tracking
  7. FastAPI Application: RESTful interface with dependency injection and validation

Tech Stack

Component Technology Purpose
Framework FastAPI 0.100+ REST API & async request handling
LLM OpenAI GPT-4.1-mini Answer generation & synthesis
Text Embeddings OpenAI text-embedding-3-small Semantic text representation (1536-dim)
Image Embeddings CLIP (openai/clip-vit-base-patch32) Visual content representation (512-dim)
Vector Database Pinecone Semantic search & retrieval
Document Processing PyMuPDF, pdfplumber, pytesseract PDF parsing & OCR
Validation Pydantic Request/response schemas
Async Runtime asyncio, uvicorn Concurrent request handling
Testing pytest Unit & integration testing
Deployment Docker, Docker Compose Containerization & orchestration

Project Structure

document-rag/
├── app/                                    # Main application package
│   ├── main.py                            # FastAPI app initialization
│   ├── dependencies.py                    # Dependency injection & singletons
│   │
│   ├── api/                               # REST API layer
│   │   ├── models.py                     # Pydantic request/response schemas
│   │   └── routes/
│   │       ├── documents.py              # POST /documents/upload
│   │       ├── search.py                 # POST /search (RAG query)
│   │       └── health.py                 # GET /health
│   │
│   ├── core/                              # Configuration & constants
│   │   ├── config_loader.py              # Environment & settings
│   │   └── constants.py                  # Model names, parameters
│   │
│   ├── services/                          # Business logic layer
│   │   ├── document_processor.py         # PDF extraction (text/images/tables)
│   │   ├── text_chunker.py               # Semantic chunking with overlap
│   │   ├── embeddings.py                 # Dual embedding generation
│   │   ├── vector_store.py               # Pinecone operations
│   │   ├── retriever.py                  # Multimodal fusion & ranking
│   │   └── rag_pipeline.py               # End-to-end RAG orchestration
│   │
│   └── utils/                             # Utilities
│       ├── logging.py                    # Structured logging
│       └── helpers.py                    # Common utilities
│
├── tests/                                 # Test suite
│   ├── test_rag_pipeline.py
│   └── __init__.py
│
├── data/                                  # Test documents
├── extracted_images/                      # Image storage
├── requirements.txt                       # Python dependencies
├── .env.example                           # Environment template
├── docker-compose.yml                     # Container orchestration
├── Dockerfile                             # Container image
└── README.md                              # This file

Getting Started

Prerequisites

  • Python 3.10 or higher
  • OpenAI API Key (GPT-4 access required): Get API Keys
  • Pinecone Account (free tier available): Create Account
  • Optional: Docker & Docker Compose for containerized deployment

Installation

1. Clone Repository

git clone <repository-url>
cd document-rag

2. Create Virtual Environment

python3.10 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

4. Setup Environment Variables

cp .env.example .env

Edit .env with your credentials:

# Required - Core Services
OPENAI_API_KEY=sk-your-key-here
PINECONE_API_KEY=pk-your-key-here

# Optional - Configuration
PINECONE_ENVIRONMENT=us-east-1-aws
LOG_LEVEL=INFO
DEBUG=False

Running the Application

Option A: Direct Execution

# Start FastAPI development server
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Option B: Docker Compose

docker-compose up -d

The API will be available at:

Quick Start

1. Health Check

curl http://localhost:8000/health

Expected response:

{
  "status": "healthy",
  "openai_api": true,
  "pinecone_api": true,
  "timestamp": "2024-..."
}

2. Upload Document

curl -X POST http://localhost:8000/documents/upload \
  -F "file=@/path/to/document.pdf"

Response:

{
  "document_id": "doc_1234567890",
  "filename": "document.pdf",
  "status": "completed",
  "message": "Successfully processed 25 chunks from 12 pages"
}

3. Query the Document

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the main topics covered?",
    "top_k": 5,
    "document_id": "doc_1234567890"
  }'

Response:

{
  "query": "What are the main topics covered?",
  "answer": "Based on the document, the main topics include...",
  "sources": [
    {
      "content": "...",
      "content_type": "text",
      "metadata": {
        "page_number": 1,
        "chunk_id": "chunk_001",
        "score": 0.95
      }
    }
  ],
  "total_sources_retrieved": 3
}

Features

Document Processing

  • Multi-format Extraction: PDF text, image, and table extraction with fidelity preservation
  • OCR Support: Automatic OCR fallback (pytesseract) for scanned PDFs
  • Semantic Chunking: Intelligent text segmentation with configurable overlap
  • Metadata Preservation: Maintains page numbers, chunk IDs, and document references

Embedding & Retrieval

  • Dual-Modal Embeddings:
    • Text: OpenAI text-embedding-3-small (1536-dim, up to 8k tokens)
    • Image: CLIP openai/clip-vit-base-patch32 (512-dim)
  • Batch Processing: Efficient vectorization of large documents
  • Similarity Search: Cosine similarity-based retrieval with configurable K
  • Reciprocal Rank Fusion: Fair multimodal result combination algorithm

Vector Database

  • Pinecone Integration: Scalable, managed vector storage
  • Namespace Support: Multi-document organization and isolation
  • Metadata Indexing: Rich metadata-aware searching
  • Upsert Operations: Efficient bulk insert/update cycles

RAG Pipeline

  • Context-Aware Synthesis: GPT-4.1-Mini powered answer generation
  • Source Attribution: Maintains and returns source references
  • Prompt Engineering: Optimized system and user prompts
  • Token Management: Configurable context window and token limits
  • Streaming Support: Real-time answer generation capability

API & Application

  • RESTful Endpoints: Clean, documented REST API
  • Pydantic Validation: Strict request/response validation
  • Health Checks: Comprehensive service health monitoring
  • Error Handling: Graceful error responses with detailed messages
  • Async Support: asyncio throughout for concurrent processing
  • Dependency Injection: Singleton pattern for service initialization
  • OpenAPI/Swagger: Auto-generated API documentation
  • CORS Support: Cross-origin request handling

Observability

  • Structured Logging: Configurable logging with debug/info/error levels
  • Request Tracking: Traceable request/response flows
  • Performance Metrics: Execution timing for key operations
  • Error Diagnostics: Detailed error logging for troubleshooting

Configuration

Environment Variables

All configuration is managed through .env file. See .env.example for template:

# ============ REQUIRED ============
# OpenAI Configuration
OPENAI_API_KEY=sk-...              # Your OpenAI API key

# Pinecone Configuration  
PINECONE_API_KEY=pk-...            # Your Pinecone API key

# ============ OPTIONAL ============
# Pinecone Settings
PINECONE_ENVIRONMENT=us-east-1-aws # Pinecone region (default: us-east-1-aws)

# Application Settings
LOG_LEVEL=INFO                      # Logging level: DEBUG|INFO|WARNING|ERROR
DEBUG=False                         # Enable debug mode for detailed logging

# Model Parameters (use constants.py for additional tuning)
MODEL_NAME=gpt-4.1-Mini             # LLM model name
EMBEDDING_MODEL=text-embedding-3-small
CHUNK_SIZE=512                     # Characters per chunk
CHUNK_OVERLAP=50                   # Overlap between chunks
TOP_K=5                            # Number of results to retrieve

Advanced Configuration

Edit app/core/constants.py for advanced tuning:

# Text Processing
CHUNK_SIZE = 512                    # Adjust for context preservation vs. precision
CHUNK_OVERLAP = 50                  # Prevent context loss at boundaries

# Embedding Models
TEXT_EMBEDDING_MODEL = "text-embedding-3-small"  # 1536-dim
IMAGE_EMBEDDING_MODEL = "openai/clip-vit-base-patch32"  # 512-dim

# LLM Parameters
LLM_MODEL = "gpt-4.1-Mini"
LLM_TEMPERATURE = 0.3               # 0.0=deterministic, 1.0=creative
MAX_TOKENS = 500                    # Response length limit
REQUEST_TIMEOUT = 60                # API timeout in seconds

# Retrieval
TOP_K = 5                           # Results to retrieve
RRF_K = 60                          # RRF algorithm parameter

# Performance
BATCH_SIZE = 10                     # Embedding batch size
MAX_FILE_SIZE_MB = 50               # Maximum PDF size

API Reference

Complete API documentation available at /docs (Swagger UI) and /redoc (ReDoc).

Health Check

Endpoint: GET /health

Verify service availability and dependency health.

Response:

{
  "status": "healthy|degraded|unhealthy",
  "openai_api": true,
  "pinecone_api": true,
  "timestamp": "2024-01-15T10:30:00Z"
}

Status Codes:

  • 200: Service is healthy
  • 503: Service degraded or unavailable

Upload Document

Endpoint: POST /documents/upload

Process and index a PDF document for semantic search.

Request:

Content-Type: multipart/form-data
Body: file (PDF file)

Response (Status: 201):

{
  "document_id": "doc_1704283200000",
  "filename": "annual_report_2024.pdf",
  "status": "completed",
  "message": "Successfully processed 127 chunks from 42 pages",
  "metadata": {
    "total_pages": 42,
    "chunks_created": 127,
    "text_chunks": 120,
    "image_chunks": 7,
    "processing_time_seconds": 45.23
  }
}

Error Responses:

  • 400: Invalid file format or corrupted PDF
  • 413: File size exceeds maximum limit (50MB)
  • 500: Processing failed - check logs

Search & Query

Endpoint: POST /search

Query documents using semantic search and RAG-powered answer generation.

Request:

{
  "query": "What are the key financial metrics?",
  "document_id": "doc_1704283200000",
  "top_k": 5,
  "include_metadata": true
}

Parameters:

  • query (string, required): Natural language question
  • document_id (string, optional): Specific document to search, or searches all if omitted
  • top_k (integer, optional, default: 5): Number of results to retrieve (1-20)
  • include_metadata (boolean, optional, default: true): Include source metadata

Response (Status: 200):

{
  "query": "What are the key financial metrics?",
  "answer": "Based on the document analysis, the key financial metrics include revenue growth of 15% YoY, EBITDA margin of 32%, and return on equity of 18%...",
  "sources": [
    {
      "content": "Our company achieved revenue of $2.5B in 2024, representing 15% YoY growth. EBITDA margin improved to 32% from 28% in 2023.",
      "content_type": "text",
      "score": 0.96,
      "metadata": {
        "document_id": "doc_1704283200000",
        "page_number": 12,
        "chunk_id": "chunk_0042",
        "chunk_position": "2:3"
      }
    },
    {
      "content": "[Image: Financial Dashboard]",
      "content_type": "image",
      "score": 0.87,
      "metadata": {
        "document_id": "doc_1704283200000",
        "page_number": 14,
        "chunk_id": "chunk_0048"
      }
    }
  ],
  "total_sources_retrieved": 2,
  "execution_time_seconds": 2.34
}

Error Responses:

  • 400: Invalid query or parameters
  • 404: Document not found
  • 503: Vector database unavailable

Batch Upload (Future)

Endpoint: POST /documents/batch

Coming Soon: Upload and process multiple documents concurrently.


Testing

Running Tests

# Run all tests
pytest -v

# Run with coverage report
pytest --cov=app --cov-report=html tests/

# Run specific test file
pytest tests/test_rag_pipeline.py -v

# Run with markers
pytest -m "integration" -v

Test Structure

tests/
├── test_rag_pipeline.py          # End-to-end RAG flow
├── test_document_processor.py    # PDF processing
├── test_embeddings.py            # Embedding generation
├── test_retriever.py             # Retrieval logic
└── conftest.py                   # Fixtures & setup

Writing Tests

import pytest
from app.services.embeddings import EmbeddingService

@pytest.mark.asyncio
async def test_text_embedding_generation():
    service = EmbeddingService()
    result = await service.embed_texts(["Hello world"])
    assert result.shape == (1, 1536)
    assert result.dtype == "float32"

Core Concepts

Embeddings

Text Embeddings: Transforms text into 1536-dimensional vectors using OpenAI's text-embedding-3-small model. These vectors capture semantic meaning, enabling similarity-based search.

Image Embeddings: Transforms images into 512-dimensional vectors using OpenAI's CLIP model. Enables visual content retrieval and multimodal search.

Why it matters: Embeddings convert unstructured text/images into structured numerical representations that allow semantic similarity comparison.

Semantic Chunking

Chunks documents intelligently to preserve semantic meaning while maintaining context. Unlike naive token-based splitting, semantic chunking respects sentence and paragraph boundaries.

Parameters:

  • Chunk size: 512 chars (preserves ~200 token context)
  • Overlap: 50 chars (prevents context loss)

Vector Search

Retrieves semantically similar content using cosine similarity metric: $$\text{similarity}(a, b) = \frac{a \cdot b}{||a|| \cdot ||b||}$$

Returns top-K most similar chunks based on distance to query embedding.

Reciprocal Rank Fusion (RRF)

Combines rankings from multiple retrieval sources (text & image) fairly: $$\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + r(d)}$$

Where:

  • $k$ = constant (typically 60)
  • $r(d)$ = rank of document in result set

Advantage: Prevents any single modality from dominating results.

Retrieval-Augmented Generation (RAG)

Three-step process:

  1. Retrieval: Fetch relevant context from vector database
  2. Augmentation: Insert context into LLM prompt
  3. Generation: Generate answer using LLM with context

Formula: $\text{Answer} = \text{LLM}(\text{Prompt} + \text{Context} + \text{Query})$

Benefits:

  • Improves answer accuracy with source documents
  • Enables fact-checking against sources
  • Reduces hallucinations
  • Provides source attribution

Context Window & Token Management

  • Query Embedding: ~50 tokens
  • Retrieved Context: ~400 tokens (5 chunks × 80 tokens)
  • System Prompt: ~100 tokens
  • Answer Generation: ~500 tokens allowed

Total: ~1000 tokens (well within GPT-4's 8K limit)


References & Resources

Documentation

Models & Embeddings

Libraries & Tools

Related Papers


Cost Estimation (Monthly)

Assuming 1000 document uploads + 5000 queries:

Service Cost Notes
OpenAI Embeddings $1-2 ~6M tokens
OpenAI GPT-4.1-Mini $10-20 ~500K input + 100K output tokens
Pinecone (Free Tier) $0 Up to 1M vectors
Hosting (AWS t3.medium) $40 2 vCPU, 4GB RAM
Total ~$50-65 Excluding any storage overage

License

This project is licensed under the MIT License - see the LICENSE file for details.


About

Document RAG is a production-ready, multimodal Retrieval-Augmented Generation (RAG) system designed to enable intelligent document question-answering through semantic search and LLM-powered synthesis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors