Skip to content

WhenLabs-org/syphon

Repository files navigation

Syphon

Web data infrastructure platform for AI developers. Scraping, crawling, search, extraction, agent workflows, pipelines, and persistent storage -- with credit-based billing and full self-host parity.

Features

  • Scraping -- render JavaScript with Playwright, extract clean markdown + metadata, browser actions, proxy support, geolocation routing
  • Crawling -- BFS crawl with configurable depth, async job tracking, per-page results
  • Search -- web, news, images, GitHub, and research sources via SerpAPI
  • Extraction -- LLM-powered structured data extraction with JSON schema
  • Agent -- autonomous web agent with multi-model support (Claude, GPT-4o, Llama 3, Gemini)
  • URL Mapping -- discover all URLs on a domain without full content extraction
  • Batch Scraping -- submit up to 1000 URLs in a single request
  • Sessions -- persistent browser sessions with stateful actions (click, fill, scroll, screenshot)
  • Browser Profiles -- save cookies/localStorage across scrapes for auth persistence
  • Change Tracking -- monitor pages for changes with diff history and webhook notifications
  • Pipelines -- DAG-based multi-step workflows with sinks (R2, Postgres, Pinecone, Qdrant)
  • Pipeline Marketplace -- community-shared pipeline templates with one-click import
  • Scheduling -- cron-based recurring scrapes with dead letter queues
  • Quality Scoring -- rubric-based scoring (completeness, noise ratio, structure fidelity, paywall detection)
  • Quality Gates -- auto-retry with proxy escalation when quality is below threshold
  • Document Parsing -- PDF (with OCR fallback), Excel, and Word documents to markdown
  • Proxy System -- basic, enhanced (residential), or auto mode with health-scored rotation
  • Persistent Storage -- every result stored in Postgres with Cloudflare R2 for large assets
  • Credit System -- pre-pay model with per-endpoint costs and transaction logging
  • Zero Data Retention -- optional mode that returns results inline without persisting
  • Real-time Events -- SSE and WebSocket streaming for job progress
  • Webhook Triggers -- fire pipelines from external events with secret-based auth
  • SDKs -- Node.js, Python, Go, and Rust clients
  • CLI -- syphon command-line tool for all operations
  • MCP Server -- Model Context Protocol server for AI tool integration
  • Integrations -- LangChain (Python + Node), LlamaIndex, n8n, CrewAI

Tech Stack

Component Technology
Monorepo Turborepo + npm workspaces
API Gateway Fastify 5, TypeScript, ESM
Scraper Worker Node.js + Playwright + BullMQ
Pipeline Worker Node.js + BullMQ
Quality Scorer Python + FastAPI
Database PostgreSQL 16 (Prisma ORM)
Queue Redis 7 (BullMQ)
Cache Redis 7 (separate instance)
Object Storage Cloudflare R2 (or local filesystem)
Dashboard Next.js 15 App Router
MCP Server @modelcontextprotocol/sdk
CI GitHub Actions
Containerization Docker + Docker Compose

Project Structure

syphon/
  apps/
    api-gateway/          Fastify API server (port 3000)
    dashboard/            Next.js 15 dashboard (port 3001)
    quality-scorer/       Python FastAPI scoring service (port 8000)
    cli/                  syphon CLI tool
    mcp-server/           MCP server for AI integrations
    docs/                 Documentation site
  workers/
    scraper/              Playwright scraper + BullMQ processors
    pipeline/             Pipeline execution worker
  packages/
    database/             Prisma schema + singleton client
    shared/               Types, errors, constants
    pipeline-engine/      DAG execution engine
    storage/              R2 + local storage abstraction
  sdks/
    node/                 Official Node.js SDK
    python/               Official Python SDK
    go/                   Official Go SDK
    rust/                 Official Rust SDK
  integrations/
    langchain-python/     LangChain Python loader
    langchain-node/       LangChain Node.js loader
    llamaindex/           LlamaIndex connector
    n8n/                  n8n community node
    crewai/               CrewAI tools

Prerequisites

  • Node.js >= 20
  • npm >= 10.8
  • Docker and Docker Compose (for local infrastructure)

Getting Started

  1. Clone and install dependencies:

    git clone <repo-url>
    cd syphon
    npm install
  2. Start local infrastructure:

    docker compose up -d postgres redis-queue redis-cache
  3. Configure environment:

    cp .env.example .env
  4. Push the database schema:

    npm run db:push
  5. Start all services in dev mode:

    npm run dev

    The API gateway will be available at http://localhost:3000.

API Endpoints

All endpoints are versioned under /v1/ and require an x-api-key header unless noted otherwise.

Core

Method Endpoint Credits Description
POST /v1/scrape 1 Scrape a single URL
POST /v1/crawl 1/page Start async multi-page crawl
GET /v1/crawl/:id -- Check crawl status
DELETE /v1/crawl/:id -- Cancel a running crawl
POST /v1/map 1 Discover URLs on a domain
POST /v1/batch 1/url Batch scrape up to 1000 URLs
POST /v1/search 1 Search the web (multiple sources)
POST /v1/extract 2 LLM-powered structured extraction
POST /v1/agent 5 Autonomous web agent

Results and Diff

Method Endpoint Credits Description
GET /v1/results/:id -- Fetch a stored result
GET /v1/results -- List results (paginated)
GET /v1/results/:id/diff -- Diff two result versions

Sessions

Method Endpoint Credits Description
POST /v1/sessions -- Create a browser session
POST /v1/sessions/:id/action 2 Execute a session action
DELETE /v1/sessions/:id -- Close a session

Browser Profiles

Method Endpoint Credits Description
POST /v1/profiles -- Create a profile
GET /v1/profiles -- List profiles
DELETE /v1/profiles/:id -- Delete a profile

Change Tracking

Method Endpoint Credits Description
POST /v1/change-tracking 1/check Create a change tracker
GET /v1/change-tracking -- List trackers
GET /v1/change-tracking/:id/history -- Get change history
DELETE /v1/change-tracking/:id -- Delete a tracker

Pipelines and Scheduling

Method Endpoint Credits Description
POST /v1/pipelines -- Create a pipeline
GET /v1/pipelines -- List pipelines
POST /v1/pipelines/:id/run varies Trigger a pipeline run
POST /v1/schedules -- Create a cron schedule
GET /v1/schedules -- List schedules

Marketplace

Method Endpoint Credits Description
GET /v1/marketplace -- Browse pipeline templates
POST /v1/marketplace -- Publish a template
POST /v1/marketplace/:id/import -- Import a template

Webhook Triggers

Method Endpoint Credits Description
POST /v1/webhook-triggers -- Create a webhook trigger
POST /v1/webhooks/trigger/:id -- Fire a pipeline via webhook

Other

Method Endpoint Credits Description
POST /v1/keys -- Create an API key
GET /v1/keys -- List API keys
GET /v1/usage -- Get credit usage
GET /v1/score 0 Score content quality
GET /v1/events -- SSE event stream
GET /v1/events/ws -- WebSocket event stream
GET /health -- Health check
GET /ready -- Readiness check

Scrape Options

{
  "url": "https://example.com",
  "formats": ["markdown", "html", "links", "images", "summary", "json", "rawHtml"],
  "actions": [{ "type": "click", "selector": ".load-more" }],
  "proxy": "auto",
  "location": "US",
  "profile": "my-profile",
  "zero_data_retention": false,
  "timeout": 30000
}

Error Format

{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "url is required and must be a string",
    "details": {}
  }
}

Available Scripts

npm run dev          # Start all services in dev mode
npm run build        # Build all packages
npm run typecheck    # Type-check all packages
npm run test         # Run all tests
npm run db:generate  # Generate Prisma client
npm run db:push      # Push schema to database
npm run db:migrate   # Run Prisma migrations

Docker

Run the full stack with Docker Compose:

docker compose up
Service Port
PostgreSQL 5432
Redis (queue) 6379
Redis (cache) 6380
API Gateway 3000
Scraper Worker --
Pipeline Worker --
Quality Scorer 8000
Dashboard 3001

Environment Variables

Variable Description
DATABASE_URL PostgreSQL connection string
REDIS_URL Redis connection (fallback for both queue and cache)
REDIS_QUEUE_URL Redis connection for BullMQ queues
REDIS_CACHE_URL Redis connection for caching/rate limiting
PORT API server port (default: 3000)
HOST API server host (default: 0.0.0.0)
BASE_URL Public base URL
LOG_LEVEL Logging level (default: info)
CORS_ORIGIN Allowed CORS origin
NODE_ENV Environment (development/production)
ANTHROPIC_API_KEY Anthropic API key (for extraction and agent)
OPENAI_API_KEY OpenAI API key (optional, for multi-model agent)
GOOGLE_AI_API_KEY Google AI API key (optional, for Gemini agent)
OLLAMA_BASE_URL Ollama base URL (optional, default: http://localhost:11434)
SERP_API_KEY SerpAPI key (for search endpoint)
QUALITY_SCORER_URL Quality scorer service URL
R2_ACCOUNT_ID Cloudflare R2 account ID
R2_ACCESS_KEY_ID R2 access key
R2_SECRET_ACCESS_KEY R2 secret key
R2_BUCKET_NAME R2 bucket (default: syphon-storage)
R2_PUBLIC_URL R2 public URL
STORAGE_BACKEND Storage backend: r2 or local
STORAGE_LOCAL_ROOT Local storage root directory
CREDENTIAL_ENCRYPTION_KEY Key for encrypting stored credentials
WEBHOOK_SECRET Secret for webhook signature verification
SCRAPE_CONCURRENCY Concurrent scrape jobs (default: 3)
CRAWL_CONCURRENCY Concurrent crawl jobs (default: 2)
MAP_CONCURRENCY Concurrent map jobs (default: 5)
AGENT_CONCURRENCY Concurrent agent jobs (default: 2)
PIPELINE_CONCURRENCY Concurrent pipeline jobs (default: 3)
NEXT_PUBLIC_API_URL Dashboard API URL
SYPHON_API_KEY API key for CLI tool
SYPHON_API_URL API URL for CLI tool

Architecture

Client Request
     |
     v
API Gateway (Fastify)
  -> Auth middleware (x-api-key, SHA-256 hash lookup)
  -> Rate limiting (Redis-backed, per-user)
  -> Credit check & deduction (pre-pay)
  -> Dispatch job to BullMQ queue
     |
     v
Redis (BullMQ)
     |
     v
Workers (Scraper / Pipeline)
  -> Render page with Playwright (+ proxy / geolocation)
  -> Execute browser actions
  -> Extract content (Readability + Turndown -> markdown)
  -> Parse documents (PDF, Excel, Word)
  -> Score quality (Quality Scorer service)
  -> Store results in Postgres + R2
  -> Deliver webhooks
  -> Emit SSE/WebSocket events
     |
     v
Response returned to client

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors