Epstein Project

A comprehensive data processing pipeline for analyzing PDF documents with OCR, text extraction, Named Entity Recognition (NER), embeddings generation, and vector search capabilities.

Quick Links

📖 Documentation Index - Complete documentation catalog
🚀 Quick Start Guide - Get started in minutes
🤖 Agent Capability Matrix - AI agent system overview
🔌 MCP Server - API access to all functionality
📂 Repository Structure - Project organization

Overview

The Epstein project provides a complete pipeline for processing government documents, with capabilities including:

OCR Processing: Convert scanned PDFs to searchable text
Text Extraction: Extract and clean text from documents
Entity Recognition: Identify people, organizations, locations, dates
Vector Embeddings: Generate semantic embeddings for search
Database Storage: PostgreSQL for structured data, Qdrant for vector search
Multi-Agent System: Specialized AI agents for different tasks
MCP Servers: RESTful APIs for programmatic access

Main Components

/agents/ - AI agent implementations (9 specialized agents)
/epstein/ - Core pipeline code for document processing
/mcp_servers/ - Model Context Protocol servers
/tools/ - Reusable tools and Mission Control dashboard
/scripts/ - Utility scripts for operations
/docs/ - Comprehensive documentation
/knowledge_base/ - Knowledge base for AI agents
/tests/ - Test suite

Getting Started

Prerequisites

Python 3.10+
Docker & Docker Compose
PostgreSQL 15+
Qdrant vector database

Quick Start

# 1. Health check
python scripts/doctor.py

# 2. Bootstrap environment
make bootstrap

# 3. Start services
make vectordb-up

# 4. Initialize pipeline
make pipeline-init

# 5. Run pipeline
make pipeline-run

# 6. Load results
make db-load

For detailed setup instructions, see:

Key Features

OCR Workflow

Automated GitHub Actions workflow for document processing:

Download from DOJ, FBI, House Oversight sources
OCR processing with Tesseract
Text extraction and manifest generation
Optional Cloudflare R2 upload
GitHub releases for datasets

Quick Start Guide | Full Documentation

AI Agent System

9 specialized agents for different tasks:

Epstein Data Processor - Core document processing
Entity Extraction Agent - NER and relationship extraction
Vector DB Analyzer - Semantic search and analysis
Database Troubleshooter - PostgreSQL optimization
Pipeline Monitor - Health monitoring and alerts
Document Analysis Agent - Content analysis
Codex Agent - Code generation and explanation
GovInfo Downloader - Government document retrieval
Multi-Agent Orchestrator - Task coordination

Agent Documentation | Agent README

MCP Servers

RESTful API servers for programmatic access:

Comprehensive MCP Server - Complete API for all functionality
Files Downloader MCP - Document download management

API Documentation

Architecture

┌─────────────────────────────────────────────┐
│         AI Agent System (9 Agents)          │
│  Document Processing | Entity Extraction    │
│  Vector Search | Database | Monitoring      │
└─────────────────────────────────────────────┘
                     │
      ┌──────────────┼──────────────┐
      │              │              │
      ▼              ▼              ▼
┌──────────┐  ┌──────────┐  ┌──────────┐
│ Pipeline │  │   MCP    │  │  Tools   │
│  Engine  │  │  Servers │  │  & UI    │
└──────────┘  └──────────┘  └──────────┘
      │              │              │
      └──────────────┼──────────────┘
                     ▼
         ┌──────────────────────┐
         │   Data Storage       │
         │ PostgreSQL | Qdrant  │
         └──────────────────────┘

Development

Available Commands

make bootstrap       # Setup environment
make doctor          # Health checks
make lint            # Code quality checks
make test            # Run tests
make format          # Format code
make pipeline-run    # Run pipeline
make db-load         # Load data to database

See Makefile for all commands.

Project Structure

epstein/
├── agents/          # AI agent implementations
├── mcp_servers/     # MCP protocol servers
├── tools/           # Reusable tools
├── epstein/         # Core pipeline code
├── scripts/         # Utility scripts
├── docs/            # Documentation
├── tests/           # Test suite
└── knowledge_base/  # AI agent knowledge

See Repository Structure for details.

Documentation

📖 Documentation Index - Complete catalog
🏗️ Repository Structure - Organization guide
🤖 Agent Capability Matrix - Agent overview
🔌 MCP Server API - API reference
📚 Knowledge Base - Technical knowledge
🔧 User Manual - Complete user guide

Support

Issues: GitHub Issues
Documentation: docs/ directory
Examples: examples/ directory

License

See repository for license information.

Version: 2.0.0
Last Updated: 2026-01-15
Maintainer: Epstein Project Team

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.archive		.archive
.github		.github
.snapshots		.snapshots
agents		agents
bin		bin
config		config
db		db
docs		docs
epstein		epstein
examples		examples
integrations/lang		integrations/lang
knowledge_base		knowledge_base
lib		lib
logs		logs
mcp_servers		mcp_servers
projects		projects
rulebook-ai		rulebook-ai
rulebook-ai.worktrees		rulebook-ai.worktrees
rulebook_packs/epstein-pipeline-pack		rulebook_packs/epstein-pipeline-pack
schemas		schemas
scripts		scripts
tasks		tasks
tests		tests
tools		tools
vector-stack		vector-stack
.coderabbit.yaml		.coderabbit.yaml
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGES.md		CHANGES.md
DIFF_20260202_025822.md		DIFF_20260202_025822.md
DIFF_20260202_025930.md		DIFF_20260202_025930.md
DIFF_20260202_030012.md		DIFF_20260202_030012.md
DIFF_20260202_030045.md		DIFF_20260202_030045.md
DIFF_20260202_032246.md		DIFF_20260202_032246.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
RECOMMENDATIONS_20260202_025822.md		RECOMMENDATIONS_20260202_025822.md
RECOMMENDATIONS_20260202_032246.md		RECOMMENDATIONS_20260202_032246.md
REPO_STATE_20260202_025930.md		REPO_STATE_20260202_025930.md
REPO_STATE_20260202_025955.md		REPO_STATE_20260202_025955.md
REPO_STATE_20260202_030012.md		REPO_STATE_20260202_030012.md
USER_INSTRUCTIONS_MANUAL.md		USER_INSTRUCTIONS_MANUAL.md
compose.yml		compose.yml
epstein_full_project_bundle.zip		epstein_full_project_bundle.zip
epstein_full_project_bundle_v3.zip		epstein_full_project_bundle_v3.zip
epstein_full_project_bundle_v4.zip		epstein_full_project_bundle_v4.zip
epstein_rulebook_pack.zip		epstein_rulebook_pack.zip
issues.json		issues.json
issues_enhanced.json		issues_enhanced.json
launch_mission_control.sh		launch_mission_control.sh
makefile_checkpoint_commands.txt		makefile_checkpoint_commands.txt
posthog_install.txt		posthog_install.txt
pyproject.toml		pyproject.toml
uv.lock		uv.lock
write_docs.sh		write_docs.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Epstein Project

Quick Links

Overview

Main Components

Getting Started

Prerequisites

Quick Start

Key Features

OCR Workflow

AI Agent System

MCP Servers

Architecture

Development

Available Commands

Project Structure

Documentation

Support

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 2

Languages

Uh oh!

cbwinslow/epstein

Folders and files

Latest commit

History

Repository files navigation

Epstein Project

Quick Links

Overview

Main Components

Getting Started

Prerequisites

Quick Start

Key Features

OCR Workflow

AI Agent System

MCP Servers

Architecture

Development

Available Commands

Project Structure

Documentation

Support

License

About

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 2

Languages

Packages