Syphon

Web data infrastructure platform for AI developers. Scraping, crawling, search, extraction, agent workflows, pipelines, and persistent storage -- with credit-based billing and full self-host parity.

Features

Scraping -- render JavaScript with Playwright, extract clean markdown + metadata, browser actions, proxy support, geolocation routing
Crawling -- BFS crawl with configurable depth, async job tracking, per-page results
Search -- web, news, images, GitHub, and research sources via SerpAPI
Extraction -- LLM-powered structured data extraction with JSON schema
Agent -- autonomous web agent with multi-model support (Claude, GPT-4o, Llama 3, Gemini)
URL Mapping -- discover all URLs on a domain without full content extraction
Batch Scraping -- submit up to 1000 URLs in a single request
Sessions -- persistent browser sessions with stateful actions (click, fill, scroll, screenshot)
Browser Profiles -- save cookies/localStorage across scrapes for auth persistence
Change Tracking -- monitor pages for changes with diff history and webhook notifications
Pipelines -- DAG-based multi-step workflows with sinks (R2, Postgres, Pinecone, Qdrant)
Pipeline Marketplace -- community-shared pipeline templates with one-click import
Scheduling -- cron-based recurring scrapes with dead letter queues
Quality Scoring -- rubric-based scoring (completeness, noise ratio, structure fidelity, paywall detection)
Quality Gates -- auto-retry with proxy escalation when quality is below threshold
Document Parsing -- PDF (with OCR fallback), Excel, and Word documents to markdown
Proxy System -- basic, enhanced (residential), or auto mode with health-scored rotation
Persistent Storage -- every result stored in Postgres with Cloudflare R2 for large assets
Credit System -- pre-pay model with per-endpoint costs and transaction logging
Zero Data Retention -- optional mode that returns results inline without persisting
Real-time Events -- SSE and WebSocket streaming for job progress
Webhook Triggers -- fire pipelines from external events with secret-based auth
SDKs -- Node.js, Python, Go, and Rust clients
CLI -- syphon command-line tool for all operations
MCP Server -- Model Context Protocol server for AI tool integration
Integrations -- LangChain (Python + Node), LlamaIndex, n8n, CrewAI

Tech Stack

Component	Technology
Monorepo	Turborepo + npm workspaces
API Gateway	Fastify 5, TypeScript, ESM
Scraper Worker	Node.js + Playwright + BullMQ
Pipeline Worker	Node.js + BullMQ
Quality Scorer	Python + FastAPI
Database	PostgreSQL 16 (Prisma ORM)
Queue	Redis 7 (BullMQ)
Cache	Redis 7 (separate instance)
Object Storage	Cloudflare R2 (or local filesystem)
Dashboard	Next.js 15 App Router
MCP Server	@modelcontextprotocol/sdk
CI	GitHub Actions
Containerization	Docker + Docker Compose

Project Structure

syphon/
  apps/
    api-gateway/          Fastify API server (port 3000)
    dashboard/            Next.js 15 dashboard (port 3001)
    quality-scorer/       Python FastAPI scoring service (port 8000)
    cli/                  syphon CLI tool
    mcp-server/           MCP server for AI integrations
    docs/                 Documentation site
  workers/
    scraper/              Playwright scraper + BullMQ processors
    pipeline/             Pipeline execution worker
  packages/
    database/             Prisma schema + singleton client
    shared/               Types, errors, constants
    pipeline-engine/      DAG execution engine
    storage/              R2 + local storage abstraction
  sdks/
    node/                 Official Node.js SDK
    python/               Official Python SDK
    go/                   Official Go SDK
    rust/                 Official Rust SDK
  integrations/
    langchain-python/     LangChain Python loader
    langchain-node/       LangChain Node.js loader
    llamaindex/           LlamaIndex connector
    n8n/                  n8n community node
    crewai/               CrewAI tools

Prerequisites

Node.js >= 20
npm >= 10.8
Docker and Docker Compose (for local infrastructure)

Getting Started

Clone and install dependencies:

git clone <repo-url>
cd syphon
npm install

Start local infrastructure:

docker compose up -d postgres redis-queue redis-cache

Configure environment:
```
cp .env.example .env
```
Push the database schema:
```
npm run db:push
```
Start all services in dev mode:
```
npm run dev
```
The API gateway will be available at http://localhost:3000.

API Endpoints

All endpoints are versioned under /v1/ and require an x-api-key header unless noted otherwise.

Core

Method	Endpoint	Credits	Description
POST	`/v1/scrape`	1	Scrape a single URL
POST	`/v1/crawl`	1/page	Start async multi-page crawl
GET	`/v1/crawl/:id`	--	Check crawl status
DELETE	`/v1/crawl/:id`	--	Cancel a running crawl
POST	`/v1/map`	1	Discover URLs on a domain
POST	`/v1/batch`	1/url	Batch scrape up to 1000 URLs
POST	`/v1/search`	1	Search the web (multiple sources)
POST	`/v1/extract`	2	LLM-powered structured extraction
POST	`/v1/agent`	5	Autonomous web agent

Results and Diff

Method	Endpoint	Credits	Description
GET	`/v1/results/:id`	--	Fetch a stored result
GET	`/v1/results`	--	List results (paginated)
GET	`/v1/results/:id/diff`	--	Diff two result versions

Sessions

Method	Endpoint	Credits	Description
POST	`/v1/sessions`	--	Create a browser session
POST	`/v1/sessions/:id/action`	2	Execute a session action
DELETE	`/v1/sessions/:id`	--	Close a session

Browser Profiles

Method	Endpoint	Credits	Description
POST	`/v1/profiles`	--	Create a profile
GET	`/v1/profiles`	--	List profiles
DELETE	`/v1/profiles/:id`	--	Delete a profile

Change Tracking

Method	Endpoint	Credits	Description
POST	`/v1/change-tracking`	1/check	Create a change tracker
GET	`/v1/change-tracking`	--	List trackers
GET	`/v1/change-tracking/:id/history`	--	Get change history
DELETE	`/v1/change-tracking/:id`	--	Delete a tracker

Pipelines and Scheduling

Method	Endpoint	Credits	Description
POST	`/v1/pipelines`	--	Create a pipeline
GET	`/v1/pipelines`	--	List pipelines
POST	`/v1/pipelines/:id/run`	varies	Trigger a pipeline run
POST	`/v1/schedules`	--	Create a cron schedule
GET	`/v1/schedules`	--	List schedules

Marketplace

Method	Endpoint	Credits	Description
GET	`/v1/marketplace`	--	Browse pipeline templates
POST	`/v1/marketplace`	--	Publish a template
POST	`/v1/marketplace/:id/import`	--	Import a template

Webhook Triggers

Method	Endpoint	Credits	Description
POST	`/v1/webhook-triggers`	--	Create a webhook trigger
POST	`/v1/webhooks/trigger/:id`	--	Fire a pipeline via webhook

Other

Method	Endpoint	Credits	Description
POST	`/v1/keys`	--	Create an API key
GET	`/v1/keys`	--	List API keys
GET	`/v1/usage`	--	Get credit usage
GET	`/v1/score`	0	Score content quality
GET	`/v1/events`	--	SSE event stream
GET	`/v1/events/ws`	--	WebSocket event stream
GET	`/health`	--	Health check
GET	`/ready`	--	Readiness check

Scrape Options

{
  "url": "https://example.com",
  "formats": ["markdown", "html", "links", "images", "summary", "json", "rawHtml"],
  "actions": [{ "type": "click", "selector": ".load-more" }],
  "proxy": "auto",
  "location": "US",
  "profile": "my-profile",
  "zero_data_retention": false,
  "timeout": 30000
}

Error Format

{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "url is required and must be a string",
    "details": {}
  }
}

Available Scripts

npm run dev          # Start all services in dev mode
npm run build        # Build all packages
npm run typecheck    # Type-check all packages
npm run test         # Run all tests
npm run db:generate  # Generate Prisma client
npm run db:push      # Push schema to database
npm run db:migrate   # Run Prisma migrations

Docker

Run the full stack with Docker Compose:

docker compose up

Service	Port
PostgreSQL	5432
Redis (queue)	6379
Redis (cache)	6380
API Gateway	3000
Scraper Worker	--
Pipeline Worker	--
Quality Scorer	8000
Dashboard	3001

Environment Variables

Variable	Description
`DATABASE_URL`	PostgreSQL connection string
`REDIS_URL`	Redis connection (fallback for both queue and cache)
`REDIS_QUEUE_URL`	Redis connection for BullMQ queues
`REDIS_CACHE_URL`	Redis connection for caching/rate limiting
`PORT`	API server port (default: 3000)
`HOST`	API server host (default: 0.0.0.0)
`BASE_URL`	Public base URL
`LOG_LEVEL`	Logging level (default: info)
`CORS_ORIGIN`	Allowed CORS origin
`NODE_ENV`	Environment (development/production)
`ANTHROPIC_API_KEY`	Anthropic API key (for extraction and agent)
`OPENAI_API_KEY`	OpenAI API key (optional, for multi-model agent)
`GOOGLE_AI_API_KEY`	Google AI API key (optional, for Gemini agent)
`OLLAMA_BASE_URL`	Ollama base URL (optional, default: http://localhost:11434)
`SERP_API_KEY`	SerpAPI key (for search endpoint)
`QUALITY_SCORER_URL`	Quality scorer service URL
`R2_ACCOUNT_ID`	Cloudflare R2 account ID
`R2_ACCESS_KEY_ID`	R2 access key
`R2_SECRET_ACCESS_KEY`	R2 secret key
`R2_BUCKET_NAME`	R2 bucket (default: syphon-storage)
`R2_PUBLIC_URL`	R2 public URL
`STORAGE_BACKEND`	Storage backend: r2 or local
`STORAGE_LOCAL_ROOT`	Local storage root directory
`CREDENTIAL_ENCRYPTION_KEY`	Key for encrypting stored credentials
`WEBHOOK_SECRET`	Secret for webhook signature verification
`SCRAPE_CONCURRENCY`	Concurrent scrape jobs (default: 3)
`CRAWL_CONCURRENCY`	Concurrent crawl jobs (default: 2)
`MAP_CONCURRENCY`	Concurrent map jobs (default: 5)
`AGENT_CONCURRENCY`	Concurrent agent jobs (default: 2)
`PIPELINE_CONCURRENCY`	Concurrent pipeline jobs (default: 3)
`NEXT_PUBLIC_API_URL`	Dashboard API URL
`SYPHON_API_KEY`	API key for CLI tool
`SYPHON_API_URL`	API URL for CLI tool

Architecture

Client Request
     |
     v
API Gateway (Fastify)
  -> Auth middleware (x-api-key, SHA-256 hash lookup)
  -> Rate limiting (Redis-backed, per-user)
  -> Credit check & deduction (pre-pay)
  -> Dispatch job to BullMQ queue
     |
     v
Redis (BullMQ)
     |
     v
Workers (Scraper / Pipeline)
  -> Render page with Playwright (+ proxy / geolocation)
  -> Execute browser actions
  -> Extract content (Readability + Turndown -> markdown)
  -> Parse documents (PDF, Excel, Word)
  -> Score quality (Quality Scorer service)
  -> Store results in Postgres + R2
  -> Deliver webhooks
  -> Emit SSE/WebSocket events
     |
     v
Response returned to client

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
apps		apps
integrations		integrations
packages		packages
sdks		sdks
tests/load		tests/load
workers		workers
.env.example		.env.example
.env.schema		.env.schema
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
Syphon_Blueprint.docx		Syphon_Blueprint.docx
docker-compose.yml		docker-compose.yml
openapi.yaml		openapi.yaml
package-lock.json		package-lock.json
package.json		package.json
plan.md		plan.md
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Syphon

Features

Tech Stack

Project Structure

Prerequisites

Getting Started

API Endpoints

Core

Results and Diff

Sessions

Browser Profiles

Change Tracking

Pipelines and Scheduling

Marketplace

Webhook Triggers

Other

Scrape Options

Error Format

Available Scripts

Docker

Environment Variables

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Syphon

Features

Tech Stack

Project Structure

Prerequisites

Getting Started

API Endpoints

Core

Results and Diff

Sessions

Browser Profiles

Change Tracking

Pipelines and Scheduling

Marketplace

Webhook Triggers

Other

Scrape Options

Error Format

Available Scripts

Docker

Environment Variables

Architecture

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages