Web data infrastructure platform for AI developers. Scraping, crawling, search, extraction, agent workflows, pipelines, and persistent storage -- with credit-based billing and full self-host parity.
- Scraping -- render JavaScript with Playwright, extract clean markdown + metadata, browser actions, proxy support, geolocation routing
- Crawling -- BFS crawl with configurable depth, async job tracking, per-page results
- Search -- web, news, images, GitHub, and research sources via SerpAPI
- Extraction -- LLM-powered structured data extraction with JSON schema
- Agent -- autonomous web agent with multi-model support (Claude, GPT-4o, Llama 3, Gemini)
- URL Mapping -- discover all URLs on a domain without full content extraction
- Batch Scraping -- submit up to 1000 URLs in a single request
- Sessions -- persistent browser sessions with stateful actions (click, fill, scroll, screenshot)
- Browser Profiles -- save cookies/localStorage across scrapes for auth persistence
- Change Tracking -- monitor pages for changes with diff history and webhook notifications
- Pipelines -- DAG-based multi-step workflows with sinks (R2, Postgres, Pinecone, Qdrant)
- Pipeline Marketplace -- community-shared pipeline templates with one-click import
- Scheduling -- cron-based recurring scrapes with dead letter queues
- Quality Scoring -- rubric-based scoring (completeness, noise ratio, structure fidelity, paywall detection)
- Quality Gates -- auto-retry with proxy escalation when quality is below threshold
- Document Parsing -- PDF (with OCR fallback), Excel, and Word documents to markdown
- Proxy System -- basic, enhanced (residential), or auto mode with health-scored rotation
- Persistent Storage -- every result stored in Postgres with Cloudflare R2 for large assets
- Credit System -- pre-pay model with per-endpoint costs and transaction logging
- Zero Data Retention -- optional mode that returns results inline without persisting
- Real-time Events -- SSE and WebSocket streaming for job progress
- Webhook Triggers -- fire pipelines from external events with secret-based auth
- SDKs -- Node.js, Python, Go, and Rust clients
- CLI --
syphon command-line tool for all operations
- MCP Server -- Model Context Protocol server for AI tool integration
- Integrations -- LangChain (Python + Node), LlamaIndex, n8n, CrewAI
| Component |
Technology |
| Monorepo |
Turborepo + npm workspaces |
| API Gateway |
Fastify 5, TypeScript, ESM |
| Scraper Worker |
Node.js + Playwright + BullMQ |
| Pipeline Worker |
Node.js + BullMQ |
| Quality Scorer |
Python + FastAPI |
| Database |
PostgreSQL 16 (Prisma ORM) |
| Queue |
Redis 7 (BullMQ) |
| Cache |
Redis 7 (separate instance) |
| Object Storage |
Cloudflare R2 (or local filesystem) |
| Dashboard |
Next.js 15 App Router |
| MCP Server |
@modelcontextprotocol/sdk |
| CI |
GitHub Actions |
| Containerization |
Docker + Docker Compose |
syphon/
apps/
api-gateway/ Fastify API server (port 3000)
dashboard/ Next.js 15 dashboard (port 3001)
quality-scorer/ Python FastAPI scoring service (port 8000)
cli/ syphon CLI tool
mcp-server/ MCP server for AI integrations
docs/ Documentation site
workers/
scraper/ Playwright scraper + BullMQ processors
pipeline/ Pipeline execution worker
packages/
database/ Prisma schema + singleton client
shared/ Types, errors, constants
pipeline-engine/ DAG execution engine
storage/ R2 + local storage abstraction
sdks/
node/ Official Node.js SDK
python/ Official Python SDK
go/ Official Go SDK
rust/ Official Rust SDK
integrations/
langchain-python/ LangChain Python loader
langchain-node/ LangChain Node.js loader
llamaindex/ LlamaIndex connector
n8n/ n8n community node
crewai/ CrewAI tools
- Node.js >= 20
- npm >= 10.8
- Docker and Docker Compose (for local infrastructure)
-
Clone and install dependencies:
git clone <repo-url>
cd syphon
npm install
-
Start local infrastructure:
docker compose up -d postgres redis-queue redis-cache
-
Configure environment:
-
Push the database schema:
-
Start all services in dev mode:
The API gateway will be available at http://localhost:3000.
All endpoints are versioned under /v1/ and require an x-api-key header unless noted otherwise.
| Method |
Endpoint |
Credits |
Description |
| POST |
/v1/scrape |
1 |
Scrape a single URL |
| POST |
/v1/crawl |
1/page |
Start async multi-page crawl |
| GET |
/v1/crawl/:id |
-- |
Check crawl status |
| DELETE |
/v1/crawl/:id |
-- |
Cancel a running crawl |
| POST |
/v1/map |
1 |
Discover URLs on a domain |
| POST |
/v1/batch |
1/url |
Batch scrape up to 1000 URLs |
| POST |
/v1/search |
1 |
Search the web (multiple sources) |
| POST |
/v1/extract |
2 |
LLM-powered structured extraction |
| POST |
/v1/agent |
5 |
Autonomous web agent |
| Method |
Endpoint |
Credits |
Description |
| GET |
/v1/results/:id |
-- |
Fetch a stored result |
| GET |
/v1/results |
-- |
List results (paginated) |
| GET |
/v1/results/:id/diff |
-- |
Diff two result versions |
| Method |
Endpoint |
Credits |
Description |
| POST |
/v1/sessions |
-- |
Create a browser session |
| POST |
/v1/sessions/:id/action |
2 |
Execute a session action |
| DELETE |
/v1/sessions/:id |
-- |
Close a session |
| Method |
Endpoint |
Credits |
Description |
| POST |
/v1/profiles |
-- |
Create a profile |
| GET |
/v1/profiles |
-- |
List profiles |
| DELETE |
/v1/profiles/:id |
-- |
Delete a profile |
| Method |
Endpoint |
Credits |
Description |
| POST |
/v1/change-tracking |
1/check |
Create a change tracker |
| GET |
/v1/change-tracking |
-- |
List trackers |
| GET |
/v1/change-tracking/:id/history |
-- |
Get change history |
| DELETE |
/v1/change-tracking/:id |
-- |
Delete a tracker |
| Method |
Endpoint |
Credits |
Description |
| POST |
/v1/pipelines |
-- |
Create a pipeline |
| GET |
/v1/pipelines |
-- |
List pipelines |
| POST |
/v1/pipelines/:id/run |
varies |
Trigger a pipeline run |
| POST |
/v1/schedules |
-- |
Create a cron schedule |
| GET |
/v1/schedules |
-- |
List schedules |
| Method |
Endpoint |
Credits |
Description |
| GET |
/v1/marketplace |
-- |
Browse pipeline templates |
| POST |
/v1/marketplace |
-- |
Publish a template |
| POST |
/v1/marketplace/:id/import |
-- |
Import a template |
| Method |
Endpoint |
Credits |
Description |
| POST |
/v1/webhook-triggers |
-- |
Create a webhook trigger |
| POST |
/v1/webhooks/trigger/:id |
-- |
Fire a pipeline via webhook |
| Method |
Endpoint |
Credits |
Description |
| POST |
/v1/keys |
-- |
Create an API key |
| GET |
/v1/keys |
-- |
List API keys |
| GET |
/v1/usage |
-- |
Get credit usage |
| GET |
/v1/score |
0 |
Score content quality |
| GET |
/v1/events |
-- |
SSE event stream |
| GET |
/v1/events/ws |
-- |
WebSocket event stream |
| GET |
/health |
-- |
Health check |
| GET |
/ready |
-- |
Readiness check |
{
"url": "https://example.com",
"formats": ["markdown", "html", "links", "images", "summary", "json", "rawHtml"],
"actions": [{ "type": "click", "selector": ".load-more" }],
"proxy": "auto",
"location": "US",
"profile": "my-profile",
"zero_data_retention": false,
"timeout": 30000
}
{
"error": {
"code": "VALIDATION_ERROR",
"message": "url is required and must be a string",
"details": {}
}
}
npm run dev # Start all services in dev mode
npm run build # Build all packages
npm run typecheck # Type-check all packages
npm run test # Run all tests
npm run db:generate # Generate Prisma client
npm run db:push # Push schema to database
npm run db:migrate # Run Prisma migrations
Run the full stack with Docker Compose:
| Service |
Port |
| PostgreSQL |
5432 |
| Redis (queue) |
6379 |
| Redis (cache) |
6380 |
| API Gateway |
3000 |
| Scraper Worker |
-- |
| Pipeline Worker |
-- |
| Quality Scorer |
8000 |
| Dashboard |
3001 |
| Variable |
Description |
DATABASE_URL |
PostgreSQL connection string |
REDIS_URL |
Redis connection (fallback for both queue and cache) |
REDIS_QUEUE_URL |
Redis connection for BullMQ queues |
REDIS_CACHE_URL |
Redis connection for caching/rate limiting |
PORT |
API server port (default: 3000) |
HOST |
API server host (default: 0.0.0.0) |
BASE_URL |
Public base URL |
LOG_LEVEL |
Logging level (default: info) |
CORS_ORIGIN |
Allowed CORS origin |
NODE_ENV |
Environment (development/production) |
ANTHROPIC_API_KEY |
Anthropic API key (for extraction and agent) |
OPENAI_API_KEY |
OpenAI API key (optional, for multi-model agent) |
GOOGLE_AI_API_KEY |
Google AI API key (optional, for Gemini agent) |
OLLAMA_BASE_URL |
Ollama base URL (optional, default: http://localhost:11434) |
SERP_API_KEY |
SerpAPI key (for search endpoint) |
QUALITY_SCORER_URL |
Quality scorer service URL |
R2_ACCOUNT_ID |
Cloudflare R2 account ID |
R2_ACCESS_KEY_ID |
R2 access key |
R2_SECRET_ACCESS_KEY |
R2 secret key |
R2_BUCKET_NAME |
R2 bucket (default: syphon-storage) |
R2_PUBLIC_URL |
R2 public URL |
STORAGE_BACKEND |
Storage backend: r2 or local |
STORAGE_LOCAL_ROOT |
Local storage root directory |
CREDENTIAL_ENCRYPTION_KEY |
Key for encrypting stored credentials |
WEBHOOK_SECRET |
Secret for webhook signature verification |
SCRAPE_CONCURRENCY |
Concurrent scrape jobs (default: 3) |
CRAWL_CONCURRENCY |
Concurrent crawl jobs (default: 2) |
MAP_CONCURRENCY |
Concurrent map jobs (default: 5) |
AGENT_CONCURRENCY |
Concurrent agent jobs (default: 2) |
PIPELINE_CONCURRENCY |
Concurrent pipeline jobs (default: 3) |
NEXT_PUBLIC_API_URL |
Dashboard API URL |
SYPHON_API_KEY |
API key for CLI tool |
SYPHON_API_URL |
API URL for CLI tool |
Client Request
|
v
API Gateway (Fastify)
-> Auth middleware (x-api-key, SHA-256 hash lookup)
-> Rate limiting (Redis-backed, per-user)
-> Credit check & deduction (pre-pay)
-> Dispatch job to BullMQ queue
|
v
Redis (BullMQ)
|
v
Workers (Scraper / Pipeline)
-> Render page with Playwright (+ proxy / geolocation)
-> Execute browser actions
-> Extract content (Readability + Turndown -> markdown)
-> Parse documents (PDF, Excel, Word)
-> Score quality (Quality Scorer service)
-> Store results in Postgres + R2
-> Deliver webhooks
-> Emit SSE/WebSocket events
|
v
Response returned to client
MIT