An intelligent agent-based system for discovering and downloading Epstein-related documents from justice.gov.
This project uses a smart agent architecture that separates:
- Mechanical work (CLI tools: curl, htmlq) - crawling and extraction
- Cognitive work (Human/AI review) - deciding which links to follow
- Automation (Playwright) - browser-based downloads
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ CLI Tools │────→│ Human/AI Review │────→│ Playwright │
│ (curl, htmlq) │ │ Select links │ │ Download PDFs │
│ │ │ │ │ │
│ • Extract links │ │ • Relevance │ │ • Handle Akamai │
│ • Crawl pages │ │ • Priority │ │ • Fetch files │
│ • Get content │ │ • Depth │ │ • Save to disk │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Run the complete workflow with a single command:
# Full workflow: discover → download → process
python3 scripts/agent_orchestrator.py workflow
# Or step-by-step:
python3 scripts/agent_orchestrator.py discover # Find PDFs
python3 scripts/agent_orchestrator.py download # Download them
python3 scripts/agent_orchestrator.py process # Organize files
# Check status:
python3 scripts/agent_orchestrator.py reportFor hands-on discovery with human review:
python3 scripts/spider_agent.py "https://www.justice.gov/usao-sdny"The spider will:
- Crawl the page using CLI tools
- Show you discovered PDFs and links
- Ask which links to follow next
- Recurse until PDFs are found
Non-interactive discovery for automation:
python3 scripts/discovery_agent.py "https://www.justice.gov/usao-sdny"
python3 scripts/download_orchestrator.py # Downloads discovered PDFs
bash scripts/process.sh # Organizes files# Install CLI tools
brew install htmlq
# Install Python dependencies
pip3 install playwright
python3 -m playwright install chromium
# Make scripts executable
chmod +x scripts/*.py scripts/*.sh# Test syntax
bash -n scripts/download.sh
python3 -m py_compile scripts/agent_orchestrator.py
# Test agent
python3 scripts/agent_orchestrator.py reportepstein-files/
├── scripts/
│ ├── agent_orchestrator.py # Universal agent interface ⭐
│ ├── spider_agent.py # Interactive crawler
│ ├── discovery_agent.py # Batch discovery tool
│ ├── download_orchestrator.py # PDF download manager
│ ├── download_playwright.py # URL-based downloader
│ ├── download.sh # Bash/curl downloader (legacy)
│ ├── process.sh # File organization
│ └── debug_*.sh # Debug utilities
├── reference/ # Downloaded files (gitignored)
├── data/
│ ├── epstein_pdfs/ # Organized PDFs
│ ├── epstein_text/ # Text files
│ └── epstein_images/ # Images
└── docs/
├── EXECUTION_SUMMARY.md # Project status
└── AGENT_INTEGRATION.md # AI agent integration guide
# Discover from SDNY page
python3 scripts/agent_orchestrator.py discover --url https://www.justice.gov/usao-sdny
# Download all discovered PDFs
python3 scripts/agent_orchestrator.py download
# Organize files
python3 scripts/agent_orchestrator.py process
# Full workflow
python3 scripts/agent_orchestrator.py workflowfrom scripts.agent_orchestrator import EpsteinAgent
# Create agent
agent = EpsteinAgent("/path/to/project")
# Run discovery
result = agent.discover("https://www.justice.gov/usao-sdny")
print(result.output)
# Check status
print(agent.get_report())
# Download and process
agent.download()
agent.process()The system is designed to be called by AI agents. See docs/AGENT_INTEGRATION.md for:
- OpenCode slash commands
- Claude Desktop MCP server
- Claude Code tool integration
- GitHub Actions automation
- Custom agent frameworks
| Command | Description | Model Calls? |
|---|---|---|
discover |
Interactive spider crawler | Yes - link selection |
batch_discover |
Non-interactive discovery | No |
download |
Download discovered PDFs | No |
process |
Organize files | No |
report |
Show status | No |
workflow |
Full pipeline | Optional |
Solution: The bash/curl scripts cannot bypass Akamai. Use the Playwright-based tools:
# Use this instead:
python3 scripts/agent_orchestrator.py workflowSolution: The justice.gov/epstein URL may have changed. Try:
# Southern District NY (handled the case)
python3 scripts/agent_orchestrator.py discover --url https://www.justice.gov/usao-sdny
# Or search the main DOJ site
python3 scripts/spider_agent.py "https://www.justice.gov"pip3 install playwright
python3 -m playwright install chromiumbrew install htmlq
# Or use regex fallback (scripts work without htmlq)Current State: All syntax errors fixed, agent architecture complete
- ✅ All scripts pass syntax validation
- ✅ Agent orchestrator ready
- ✅ Spider agent with interactive mode
- ✅ Playwright integration for Akamai bypass
- ✅ File processing pipeline fixed
Known Issues:
- The justice.gov/epstein URL redirects to main DOJ homepage
- Use SDNY page (usao-sdny) or spider to find current location
docs/EXECUTION_SUMMARY.md- Project status and completion notesdocs/AGENT_INTEGRATION.md- Integration with AI agent frameworksdocs/in_progress.kilo.md- Original work-in-progress summary
This is a research tool for accessing public DOJ documents.
Last Updated: 2026-02-03