A comprehensive data processing pipeline for analyzing PDF documents with OCR, text extraction, Named Entity Recognition (NER), embeddings generation, and vector search capabilities.
- π Documentation Index - Complete documentation catalog
- π Quick Start Guide - Get started in minutes
- π€ Agent Capability Matrix - AI agent system overview
- π MCP Server - API access to all functionality
- π Repository Structure - Project organization
The Epstein project provides a complete pipeline for processing government documents, with capabilities including:
- OCR Processing: Convert scanned PDFs to searchable text
- Text Extraction: Extract and clean text from documents
- Entity Recognition: Identify people, organizations, locations, dates
- Vector Embeddings: Generate semantic embeddings for search
- Database Storage: PostgreSQL for structured data, Qdrant for vector search
- Multi-Agent System: Specialized AI agents for different tasks
- MCP Servers: RESTful APIs for programmatic access
/agents/- AI agent implementations (9 specialized agents)/epstein/- Core pipeline code for document processing/mcp_servers/- Model Context Protocol servers/tools/- Reusable tools and Mission Control dashboard/scripts/- Utility scripts for operations/docs/- Comprehensive documentation/knowledge_base/- Knowledge base for AI agents/tests/- Test suite
- Python 3.10+
- Docker & Docker Compose
- PostgreSQL 15+
- Qdrant vector database
# 1. Health check
python scripts/doctor.py
# 2. Bootstrap environment
make bootstrap
# 3. Start services
make vectordb-up
# 4. Initialize pipeline
make pipeline-init
# 5. Run pipeline
make pipeline-run
# 6. Load results
make db-loadFor detailed setup instructions, see:
Automated GitHub Actions workflow for document processing:
- Download from DOJ, FBI, House Oversight sources
- OCR processing with Tesseract
- Text extraction and manifest generation
- Optional Cloudflare R2 upload
- GitHub releases for datasets
Quick Start Guide | Full Documentation
9 specialized agents for different tasks:
- Epstein Data Processor - Core document processing
- Entity Extraction Agent - NER and relationship extraction
- Vector DB Analyzer - Semantic search and analysis
- Database Troubleshooter - PostgreSQL optimization
- Pipeline Monitor - Health monitoring and alerts
- Document Analysis Agent - Content analysis
- Codex Agent - Code generation and explanation
- GovInfo Downloader - Government document retrieval
- Multi-Agent Orchestrator - Task coordination
Agent Documentation | Agent README
RESTful API servers for programmatic access:
- Comprehensive MCP Server - Complete API for all functionality
- Files Downloader MCP - Document download management
βββββββββββββββββββββββββββββββββββββββββββββββ
β AI Agent System (9 Agents) β
β Document Processing | Entity Extraction β
β Vector Search | Database | Monitoring β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β Pipeline β β MCP β β Tools β
β Engine β β Servers β β & UI β
ββββββββββββ ββββββββββββ ββββββββββββ
β β β
ββββββββββββββββΌβββββββββββββββ
βΌ
ββββββββββββββββββββββββ
β Data Storage β
β PostgreSQL | Qdrant β
ββββββββββββββββββββββββ
make bootstrap # Setup environment
make doctor # Health checks
make lint # Code quality checks
make test # Run tests
make format # Format code
make pipeline-run # Run pipeline
make db-load # Load data to databaseSee Makefile for all commands.
epstein/
βββ agents/ # AI agent implementations
βββ mcp_servers/ # MCP protocol servers
βββ tools/ # Reusable tools
βββ epstein/ # Core pipeline code
βββ scripts/ # Utility scripts
βββ docs/ # Documentation
βββ tests/ # Test suite
βββ knowledge_base/ # AI agent knowledge
See Repository Structure for details.
- π Documentation Index - Complete catalog
- ποΈ Repository Structure - Organization guide
- π€ Agent Capability Matrix - Agent overview
- π MCP Server API - API reference
- π Knowledge Base - Technical knowledge
- π§ User Manual - Complete user guide
- Issues: GitHub Issues
- Documentation: docs/ directory
- Examples: examples/ directory
See repository for license information.
Version: 2.0.0
Last Updated: 2026-01-15
Maintainer: Epstein Project Team