Skip to content

Epstein Files Document Analysis Pipeline - Complete OCR, NER, and Vector Database Processing System

Notifications You must be signed in to change notification settings

cbwinslow/epstein

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

63 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Epstein Project

A comprehensive data processing pipeline for analyzing PDF documents with OCR, text extraction, Named Entity Recognition (NER), embeddings generation, and vector search capabilities.

Quick Links

Overview

The Epstein project provides a complete pipeline for processing government documents, with capabilities including:

  • OCR Processing: Convert scanned PDFs to searchable text
  • Text Extraction: Extract and clean text from documents
  • Entity Recognition: Identify people, organizations, locations, dates
  • Vector Embeddings: Generate semantic embeddings for search
  • Database Storage: PostgreSQL for structured data, Qdrant for vector search
  • Multi-Agent System: Specialized AI agents for different tasks
  • MCP Servers: RESTful APIs for programmatic access

Main Components

Getting Started

Prerequisites

  • Python 3.10+
  • Docker & Docker Compose
  • PostgreSQL 15+
  • Qdrant vector database

Quick Start

# 1. Health check
python scripts/doctor.py

# 2. Bootstrap environment
make bootstrap

# 3. Start services
make vectordb-up

# 4. Initialize pipeline
make pipeline-init

# 5. Run pipeline
make pipeline-run

# 6. Load results
make db-load

For detailed setup instructions, see:

Key Features

OCR Workflow

Automated GitHub Actions workflow for document processing:

  • Download from DOJ, FBI, House Oversight sources
  • OCR processing with Tesseract
  • Text extraction and manifest generation
  • Optional Cloudflare R2 upload
  • GitHub releases for datasets

Quick Start Guide | Full Documentation

AI Agent System

9 specialized agents for different tasks:

  • Epstein Data Processor - Core document processing
  • Entity Extraction Agent - NER and relationship extraction
  • Vector DB Analyzer - Semantic search and analysis
  • Database Troubleshooter - PostgreSQL optimization
  • Pipeline Monitor - Health monitoring and alerts
  • Document Analysis Agent - Content analysis
  • Codex Agent - Code generation and explanation
  • GovInfo Downloader - Government document retrieval
  • Multi-Agent Orchestrator - Task coordination

Agent Documentation | Agent README

MCP Servers

RESTful API servers for programmatic access:

  • Comprehensive MCP Server - Complete API for all functionality
  • Files Downloader MCP - Document download management

API Documentation

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         AI Agent System (9 Agents)          β”‚
β”‚  Document Processing | Entity Extraction    β”‚
β”‚  Vector Search | Database | Monitoring      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚              β”‚              β”‚
      β–Ό              β–Ό              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Pipeline β”‚  β”‚   MCP    β”‚  β”‚  Tools   β”‚
β”‚  Engine  β”‚  β”‚  Servers β”‚  β”‚  & UI    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚              β”‚              β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Data Storage       β”‚
         β”‚ PostgreSQL | Qdrant  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Development

Available Commands

make bootstrap       # Setup environment
make doctor          # Health checks
make lint            # Code quality checks
make test            # Run tests
make format          # Format code
make pipeline-run    # Run pipeline
make db-load         # Load data to database

See Makefile for all commands.

Project Structure

epstein/
β”œβ”€β”€ agents/          # AI agent implementations
β”œβ”€β”€ mcp_servers/     # MCP protocol servers
β”œβ”€β”€ tools/           # Reusable tools
β”œβ”€β”€ epstein/         # Core pipeline code
β”œβ”€β”€ scripts/         # Utility scripts
β”œβ”€β”€ docs/            # Documentation
β”œβ”€β”€ tests/           # Test suite
└── knowledge_base/  # AI agent knowledge

See Repository Structure for details.

Documentation

Support

License

See repository for license information.


Version: 2.0.0
Last Updated: 2026-01-15
Maintainer: Epstein Project Team

About

Epstein Files Document Analysis Pipeline - Complete OCR, NER, and Vector Database Processing System

Resources

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published