Document RAG is a production-ready, multimodal Retrieval-Augmented Generation (RAG) system designed to enable intelligent document question-answering through semantic search and LLM-powered synthesis. The system processes PDF documents, extracts both textual and visual content, creates semantic embeddings for multimodal retrieval, and generates contextually accurate answers using GPT-4.
Organizations struggle with extracting actionable insights from large document repositories. Traditional document search relies on keyword matching, which fails to understand semantic meaning and cannot effectively cross-reference visual and textual content. This project solves that by implementing a sophisticated RAG pipeline that understands document semantics, retrieves relevant information across modalities, and synthesizes answers that maintain source attribution.
- Multimodal Document Processing: Extracts text, images, and tables from PDFs with OCR fallback
- Semantic Embeddings: Creates 1536-dimensional text embeddings and 512-dimensional image embeddings
- Hybrid Retrieval: Uses Reciprocal Rank Fusion to intelligently combine text and image search results
- Context-Aware Generation: Leverages GPT-4.1-Mini with source attribution for accurate answer synthesis
- Production-Ready: Includes comprehensive error handling, logging, and health checks
PDF Upload → Text/Image/Table Extraction → Semantic Chunking
↓
Create Dual Embeddings
(Text: OpenAI 1536-dim | Image: CLIP 512-dim)
↓
Pinecone Vector Store
(Namespace-based Organization)
↓
Query Processing
Reciprocal Rank Fusion (fair ranking)
↓
RAG Pipeline & LLM Orchestration
Context Retrieval → Prompt Construction → GPT-4.1-Mini → Answer + Sources
- Document Processor: PDF parsing, text extraction, image extraction, table recognition
- Text Chunker: Semantic chunking with configurable overlap for context preservation
- Embedding Engine: Dual-embedding generation using REST APIs and HuggingFace models
- Vector Store: Pinecone integration with namespace-based multi-document support
- Retriever: Multimodal fusion with RRF algorithm for balanced results
- RAG Pipeline: LLM orchestration with prompt engineering and source tracking
- FastAPI Application: RESTful interface with dependency injection and validation
| Component | Technology | Purpose |
|---|---|---|
| Framework | FastAPI 0.100+ | REST API & async request handling |
| LLM | OpenAI GPT-4.1-mini | Answer generation & synthesis |
| Text Embeddings | OpenAI text-embedding-3-small | Semantic text representation (1536-dim) |
| Image Embeddings | CLIP (openai/clip-vit-base-patch32) | Visual content representation (512-dim) |
| Vector Database | Pinecone | Semantic search & retrieval |
| Document Processing | PyMuPDF, pdfplumber, pytesseract | PDF parsing & OCR |
| Validation | Pydantic | Request/response schemas |
| Async Runtime | asyncio, uvicorn | Concurrent request handling |
| Testing | pytest | Unit & integration testing |
| Deployment | Docker, Docker Compose | Containerization & orchestration |
document-rag/
├── app/ # Main application package
│ ├── main.py # FastAPI app initialization
│ ├── dependencies.py # Dependency injection & singletons
│ │
│ ├── api/ # REST API layer
│ │ ├── models.py # Pydantic request/response schemas
│ │ └── routes/
│ │ ├── documents.py # POST /documents/upload
│ │ ├── search.py # POST /search (RAG query)
│ │ └── health.py # GET /health
│ │
│ ├── core/ # Configuration & constants
│ │ ├── config_loader.py # Environment & settings
│ │ └── constants.py # Model names, parameters
│ │
│ ├── services/ # Business logic layer
│ │ ├── document_processor.py # PDF extraction (text/images/tables)
│ │ ├── text_chunker.py # Semantic chunking with overlap
│ │ ├── embeddings.py # Dual embedding generation
│ │ ├── vector_store.py # Pinecone operations
│ │ ├── retriever.py # Multimodal fusion & ranking
│ │ └── rag_pipeline.py # End-to-end RAG orchestration
│ │
│ └── utils/ # Utilities
│ ├── logging.py # Structured logging
│ └── helpers.py # Common utilities
│
├── tests/ # Test suite
│ ├── test_rag_pipeline.py
│ └── __init__.py
│
├── data/ # Test documents
├── extracted_images/ # Image storage
├── requirements.txt # Python dependencies
├── .env.example # Environment template
├── docker-compose.yml # Container orchestration
├── Dockerfile # Container image
└── README.md # This file
- Python 3.10 or higher
- OpenAI API Key (GPT-4 access required): Get API Keys
- Pinecone Account (free tier available): Create Account
- Optional: Docker & Docker Compose for containerized deployment
git clone <repository-url>
cd document-ragpython3.10 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install --upgrade pip
pip install -r requirements.txtcp .env.example .envEdit .env with your credentials:
# Required - Core Services
OPENAI_API_KEY=sk-your-key-here
PINECONE_API_KEY=pk-your-key-here
# Optional - Configuration
PINECONE_ENVIRONMENT=us-east-1-aws
LOG_LEVEL=INFO
DEBUG=False# Start FastAPI development server
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000docker-compose up -dThe API will be available at:
- Base URL: http://localhost:8000
- API Documentation (Swagger UI): http://localhost:8000/docs
- Interactive Docs (ReDoc): http://localhost:8000/redoc
curl http://localhost:8000/healthExpected response:
{
"status": "healthy",
"openai_api": true,
"pinecone_api": true,
"timestamp": "2024-..."
}curl -X POST http://localhost:8000/documents/upload \
-F "file=@/path/to/document.pdf"Response:
{
"document_id": "doc_1234567890",
"filename": "document.pdf",
"status": "completed",
"message": "Successfully processed 25 chunks from 12 pages"
}curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{
"query": "What are the main topics covered?",
"top_k": 5,
"document_id": "doc_1234567890"
}'Response:
{
"query": "What are the main topics covered?",
"answer": "Based on the document, the main topics include...",
"sources": [
{
"content": "...",
"content_type": "text",
"metadata": {
"page_number": 1,
"chunk_id": "chunk_001",
"score": 0.95
}
}
],
"total_sources_retrieved": 3
}- ✓ Multi-format Extraction: PDF text, image, and table extraction with fidelity preservation
- ✓ OCR Support: Automatic OCR fallback (pytesseract) for scanned PDFs
- ✓ Semantic Chunking: Intelligent text segmentation with configurable overlap
- ✓ Metadata Preservation: Maintains page numbers, chunk IDs, and document references
- ✓ Dual-Modal Embeddings:
- Text: OpenAI
text-embedding-3-small(1536-dim, up to 8k tokens) - Image: CLIP
openai/clip-vit-base-patch32(512-dim)
- Text: OpenAI
- ✓ Batch Processing: Efficient vectorization of large documents
- ✓ Similarity Search: Cosine similarity-based retrieval with configurable K
- ✓ Reciprocal Rank Fusion: Fair multimodal result combination algorithm
- ✓ Pinecone Integration: Scalable, managed vector storage
- ✓ Namespace Support: Multi-document organization and isolation
- ✓ Metadata Indexing: Rich metadata-aware searching
- ✓ Upsert Operations: Efficient bulk insert/update cycles
- ✓ Context-Aware Synthesis: GPT-4.1-Mini powered answer generation
- ✓ Source Attribution: Maintains and returns source references
- ✓ Prompt Engineering: Optimized system and user prompts
- ✓ Token Management: Configurable context window and token limits
- ✓ Streaming Support: Real-time answer generation capability
- ✓ RESTful Endpoints: Clean, documented REST API
- ✓ Pydantic Validation: Strict request/response validation
- ✓ Health Checks: Comprehensive service health monitoring
- ✓ Error Handling: Graceful error responses with detailed messages
- ✓ Async Support: asyncio throughout for concurrent processing
- ✓ Dependency Injection: Singleton pattern for service initialization
- ✓ OpenAPI/Swagger: Auto-generated API documentation
- ✓ CORS Support: Cross-origin request handling
- ✓ Structured Logging: Configurable logging with debug/info/error levels
- ✓ Request Tracking: Traceable request/response flows
- ✓ Performance Metrics: Execution timing for key operations
- ✓ Error Diagnostics: Detailed error logging for troubleshooting
All configuration is managed through .env file. See .env.example for template:
# ============ REQUIRED ============
# OpenAI Configuration
OPENAI_API_KEY=sk-... # Your OpenAI API key
# Pinecone Configuration
PINECONE_API_KEY=pk-... # Your Pinecone API key
# ============ OPTIONAL ============
# Pinecone Settings
PINECONE_ENVIRONMENT=us-east-1-aws # Pinecone region (default: us-east-1-aws)
# Application Settings
LOG_LEVEL=INFO # Logging level: DEBUG|INFO|WARNING|ERROR
DEBUG=False # Enable debug mode for detailed logging
# Model Parameters (use constants.py for additional tuning)
MODEL_NAME=gpt-4.1-Mini # LLM model name
EMBEDDING_MODEL=text-embedding-3-small
CHUNK_SIZE=512 # Characters per chunk
CHUNK_OVERLAP=50 # Overlap between chunks
TOP_K=5 # Number of results to retrieveEdit app/core/constants.py for advanced tuning:
# Text Processing
CHUNK_SIZE = 512 # Adjust for context preservation vs. precision
CHUNK_OVERLAP = 50 # Prevent context loss at boundaries
# Embedding Models
TEXT_EMBEDDING_MODEL = "text-embedding-3-small" # 1536-dim
IMAGE_EMBEDDING_MODEL = "openai/clip-vit-base-patch32" # 512-dim
# LLM Parameters
LLM_MODEL = "gpt-4.1-Mini"
LLM_TEMPERATURE = 0.3 # 0.0=deterministic, 1.0=creative
MAX_TOKENS = 500 # Response length limit
REQUEST_TIMEOUT = 60 # API timeout in seconds
# Retrieval
TOP_K = 5 # Results to retrieve
RRF_K = 60 # RRF algorithm parameter
# Performance
BATCH_SIZE = 10 # Embedding batch size
MAX_FILE_SIZE_MB = 50 # Maximum PDF sizeComplete API documentation available at /docs (Swagger UI) and /redoc (ReDoc).
Endpoint: GET /health
Verify service availability and dependency health.
Response:
{
"status": "healthy|degraded|unhealthy",
"openai_api": true,
"pinecone_api": true,
"timestamp": "2024-01-15T10:30:00Z"
}Status Codes:
200: Service is healthy503: Service degraded or unavailable
Endpoint: POST /documents/upload
Process and index a PDF document for semantic search.
Request:
Content-Type: multipart/form-data
Body: file (PDF file)
Response (Status: 201):
{
"document_id": "doc_1704283200000",
"filename": "annual_report_2024.pdf",
"status": "completed",
"message": "Successfully processed 127 chunks from 42 pages",
"metadata": {
"total_pages": 42,
"chunks_created": 127,
"text_chunks": 120,
"image_chunks": 7,
"processing_time_seconds": 45.23
}
}Error Responses:
400: Invalid file format or corrupted PDF413: File size exceeds maximum limit (50MB)500: Processing failed - check logs
Endpoint: POST /search
Query documents using semantic search and RAG-powered answer generation.
Request:
{
"query": "What are the key financial metrics?",
"document_id": "doc_1704283200000",
"top_k": 5,
"include_metadata": true
}Parameters:
query(string, required): Natural language questiondocument_id(string, optional): Specific document to search, or searches all if omittedtop_k(integer, optional, default: 5): Number of results to retrieve (1-20)include_metadata(boolean, optional, default: true): Include source metadata
Response (Status: 200):
{
"query": "What are the key financial metrics?",
"answer": "Based on the document analysis, the key financial metrics include revenue growth of 15% YoY, EBITDA margin of 32%, and return on equity of 18%...",
"sources": [
{
"content": "Our company achieved revenue of $2.5B in 2024, representing 15% YoY growth. EBITDA margin improved to 32% from 28% in 2023.",
"content_type": "text",
"score": 0.96,
"metadata": {
"document_id": "doc_1704283200000",
"page_number": 12,
"chunk_id": "chunk_0042",
"chunk_position": "2:3"
}
},
{
"content": "[Image: Financial Dashboard]",
"content_type": "image",
"score": 0.87,
"metadata": {
"document_id": "doc_1704283200000",
"page_number": 14,
"chunk_id": "chunk_0048"
}
}
],
"total_sources_retrieved": 2,
"execution_time_seconds": 2.34
}Error Responses:
400: Invalid query or parameters404: Document not found503: Vector database unavailable
Endpoint: POST /documents/batch
Coming Soon: Upload and process multiple documents concurrently.
# Run all tests
pytest -v
# Run with coverage report
pytest --cov=app --cov-report=html tests/
# Run specific test file
pytest tests/test_rag_pipeline.py -v
# Run with markers
pytest -m "integration" -vtests/
├── test_rag_pipeline.py # End-to-end RAG flow
├── test_document_processor.py # PDF processing
├── test_embeddings.py # Embedding generation
├── test_retriever.py # Retrieval logic
└── conftest.py # Fixtures & setup
import pytest
from app.services.embeddings import EmbeddingService
@pytest.mark.asyncio
async def test_text_embedding_generation():
service = EmbeddingService()
result = await service.embed_texts(["Hello world"])
assert result.shape == (1, 1536)
assert result.dtype == "float32"Text Embeddings: Transforms text into 1536-dimensional vectors using OpenAI's text-embedding-3-small model. These vectors capture semantic meaning, enabling similarity-based search.
Image Embeddings: Transforms images into 512-dimensional vectors using OpenAI's CLIP model. Enables visual content retrieval and multimodal search.
Why it matters: Embeddings convert unstructured text/images into structured numerical representations that allow semantic similarity comparison.
Chunks documents intelligently to preserve semantic meaning while maintaining context. Unlike naive token-based splitting, semantic chunking respects sentence and paragraph boundaries.
Parameters:
- Chunk size: 512 chars (preserves ~200 token context)
- Overlap: 50 chars (prevents context loss)
Retrieves semantically similar content using cosine similarity metric:
Returns top-K most similar chunks based on distance to query embedding.
Combines rankings from multiple retrieval sources (text & image) fairly:
Where:
-
$k$ = constant (typically 60) -
$r(d)$ = rank of document in result set
Advantage: Prevents any single modality from dominating results.
Three-step process:
- Retrieval: Fetch relevant context from vector database
- Augmentation: Insert context into LLM prompt
- Generation: Generate answer using LLM with context
Formula:
Benefits:
- Improves answer accuracy with source documents
- Enables fact-checking against sources
- Reduces hallucinations
- Provides source attribution
- Query Embedding: ~50 tokens
- Retrieved Context: ~400 tokens (5 chunks × 80 tokens)
- System Prompt: ~100 tokens
- Answer Generation: ~500 tokens allowed
Total: ~1000 tokens (well within GPT-4's 8K limit)
- FastAPI Documentation - Web framework
- OpenAI API Reference - LLM API
- Pinecone Documentation - Vector database
- Pydantic Docs - Data validation
- PyMuPDF (fitz) - PDF text extraction
- pdfplumber - Table extraction
- pytesseract - OCR
- Attention is All You Need - Transformers
- CLIP: Learning Transferable Models for Image Classification
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Assuming 1000 document uploads + 5000 queries:
| Service | Cost | Notes |
|---|---|---|
| OpenAI Embeddings | $1-2 | ~6M tokens |
| OpenAI GPT-4.1-Mini | $10-20 | ~500K input + 100K output tokens |
| Pinecone (Free Tier) | $0 | Up to 1M vectors |
| Hosting (AWS t3.medium) | $40 | 2 vCPU, 4GB RAM |
| Total | ~$50-65 | Excluding any storage overage |
This project is licensed under the MIT License - see the LICENSE file for details.