Skip to content

Anakin-Inc/anakin

AnakinScraper OSS

CI License: AGPL-3.0 Go Python Docker React

The open-source web scraping API for AI. Turn any website into LLM-ready markdown or structured data.

Self-host with a single command. No cloud dependencies. Powers RAG pipelines, AI agents, and data extraction at scale.

git clone https://github.com/Anakin-Inc/anakinscraper-oss.git && cd anakinscraper-oss && make up

# Scrape any website — one curl, full result:
curl -s -X POST http://localhost:8080/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}' | jq .markdown

Why AnakinScraper?

AnakinScraper Firecrawl Crawlee Scrapy
Anti-detect browser Camoufox (Firefox) Headless Chrome Playwright No
Smart proxy selection Thompson Sampling (ML) Round-robin Manual Manual
Zero-config start go run — no DB needed Docker required npm install pip install
Single binary Go — one 30MB binary Node.js Node.js Python
Handler chain fallback HTTP → Browser → API Single mode Single mode Single mode
Structured JSON (AI) Gemini extraction LLM extraction No No

Features

  • Handler chain with fallback — HTTP fetch → anti-detect browser → external API. Each handler tries in order; if one fails, the next picks up automatically. Most pages resolve on the free local HTTP handler — paid APIs are only called for the ~5% that actually need them. Docs →
  • Custom API handlers — plug in any third-party scraping service as a chain fallback. Only invoked when local handlers fail — saves 90%+ on API costs vs routing everything through a paid service. Built-in anakin.io handler included. How to add your own →
  • Domain configs — per-domain scraping strategies: choose which handlers to use, set timeouts, retries, custom headers, block domains, and validate content with pattern matching. Docs →
  • Failure detection — define failure patterns and required patterns per domain. If the scraped content matches a failure pattern (e.g. CAPTCHA page) or misses a required pattern, the job auto-retries with the next handler. Docs →
  • Anti-detect browserCamoufox (anti-detect Firefox) with realistic fingerprints, not headless Chrome. Docs →
  • Proxy auto-selectThompson Sampling picks the best proxy per domain, learning from success/failure in real time. Docs →
  • Structured JSON extraction — use Gemini AI to extract structured data from any page (bring your own API key)
  • Sync + async + batch APIPOST /v1/scrape for instant results, /v1/url-scraper for async with polling, batch up to 10 URLs
  • LLM-ready markdown — automatic boilerplate removal, clean content extraction. Feed directly into RAG pipelines, Claude, GPT, or any LLM without preprocessing
  • Web dashboard — built-in React UI for scraping, job tracking, domain config management, and proxy monitoring
  • Zero-config mode — run with just Go, no database needed. Or use Docker for the full stack
  • Self-contained — no Redis, no AWS, no message queues. Single Go binary. Optional PostgreSQL for persistence

Quick Start (no Docker, no database)

Just Go 1.25+. Two commands:

cd server && go run cmd/server/main.go

# In another terminal:
curl -s -X POST http://localhost:8080/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}' | jq .markdown

Jobs are stored in memory (lost on restart). For persistence, set DATABASE_URL. For JavaScript-heavy sites, add the browser service via Docker.

Self-Host (Docker — full stack)

Prerequisites

Start

git clone https://github.com/Anakin-Inc/anakinscraper-oss.git
cd anakinscraper-oss
make up

That's it. Three containers start:

Service Port Description
Server 8080 REST API + worker pool
Browser Service 9222 Camoufox anti-detect browser (WebSocket)
PostgreSQL 5432 Job storage

Web Dashboard

A built-in web UI is included for visual scraping, job tracking, and configuration:

cd webapp && npm install && npm run dev

Open http://localhost:3000 — the dashboard proxies API calls to the server on port 8080.

Pages: Dashboard (health + quick scrape) | Scrape (sync/async/batch with live results) | Jobs (tracked history with status filters) | Domain Configs (CRUD with handler chain management) | Proxy Scores (Thompson Sampling performance)

Scrape a URL

Synchronous (recommended for getting started):

curl -s -X POST http://localhost:8080/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}' | jq .

One request, full result back. No polling. Timeout: 30 seconds.

Asynchronous (for long-running scrapes):

# Submit
curl -s -X POST http://localhost:8080/v1/url-scraper \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

# Poll for result
curl -s http://localhost:8080/v1/url-scraper/JOB_UUID | jq .

With AI-powered JSON extraction (requires GEMINI_API_KEY):

curl -s -X POST http://localhost:8080/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "generateJson": true}' | jq .generatedJson

No API keys required for the scraper itself. Just JSON in, results out.

Architecture

                    ┌─────────────────┐
                    │   Your App      │
                    │   (cURL / CLI)  │
                    └────────┬────────┘
                             │ HTTP
                             ▼
                    ┌─────────────────┐         ┌──────────┐
                    │     Server      │────────▶│  Gemini  │
                    │   (Go/Fiber)    │ optional│  (JSON)  │
                    │   Port 8080     │         └──────────┘
                    └──┬──────┬───┬──┘
                       │      │   │
            ┌──────────┘      │   └────────────┐
            ▼                 ▼                 ▼
      ┌──────────┐   ┌──────────────┐   ┌──────────────┐
      │ Storage  │   │   Browser    │   │ API Handler  │
      │ Postgres │   │   Service    │   │ (anakin.io   │
      │ or memory│   │  (Camoufox)  │   │  or custom)  │
      │(optional)│   │  (optional)  │   │  (optional)  │
      └──────────┘   └──────────────┘   └──────────────┘

The server is a single Go binary that runs with zero dependencies. Optionally add PostgreSQL for persistence, the browser service for JavaScript-heavy sites, and API handlers for hard-to-scrape sites. Workers execute the handler chain (HTTP → browser → API fallback), convert HTML to markdown, and optionally extract structured JSON via Gemini.

API Reference

See docs/API.md for the complete API reference. Quick overview:

Method Endpoint Description
POST /v1/scrape Sync — scrape a URL and get the result back directly (30s timeout)
POST /v1/url-scraper Async — submit a scrape job, returns job ID
GET /v1/url-scraper/:id Poll for async job result
POST /v1/url-scraper/batch Batch scrape up to 10 URLs
GET /v1/url-scraper/batch/:id Poll for batch result
POST /v1/domain-configs Create a per-domain scraping config
GET /v1/domain-configs List all domain configs
GET /v1/proxy/scores View proxy Thompson Sampling scores
GET /v1/telemetry/status View telemetry state and next payload (details)
GET /health Health check

Request Fields

Field Type Default Description
url string required URL to scrape
useBrowser bool false Skip HTTP handler, go straight to browser
generateJson bool false Extract structured JSON via Gemini AI (requires GEMINI_API_KEY)

Response

{
  "id": "550e8400-...",
  "status": "completed",
  "url": "https://example.com",
  "html": "<html>...</html>",
  "cleanedHtml": "<main>...</main>",
  "markdown": "# Page Title\n\nContent...",
  "generatedJson": {
    "status": "success",
    "data": {"title": "Page Title", "content": "..."}
  },
  "durationMs": 1234
}

generatedJson is only present when generateJson: true and GEMINI_API_KEY is configured.

Handler Chain

Each scrape job goes through the handler chain. On failure, it falls back to the next handler:

HTTP Handler (fast, ~200ms) ──fail──▶ Browser Handler (Camoufox) ──fail──▶ API Handler (optional)

HTTP Handler — direct HTTP GET with a browser user-agent. Handles static HTML, server-rendered pages. No browser overhead.

Browser Handler — connects to Camoufox (anti-detect Firefox) over WebSocket via Playwright protocol. Full JavaScript rendering, network-idle detection, realistic browser fingerprints. Handles SPAs, lazy-loaded content, and sites with anti-bot protection.

API Handler (optional) — delegates to an external scraping API when local handlers fail. Set ANAKIN_API_KEY to enable the built-in anakin.io fallback for hard-to-scrape sites (Cloudflare, DataDome, etc.). See Adding Custom API Handlers below.

Adding Custom API Handlers

The API handler pattern makes it easy to integrate any third-party scraping service as a chain fallback. The built-in anakin.io handler (server/internal/handler/api.go) is a working example — copy and modify it for your provider:

  1. Copy api.go to my_provider.go
  2. Add a constructor like NewAnakinHandler — set your provider's URL, auth header name, and response format
  3. Register in main.go:
    if cfg.MyProviderAPIKey != "" {
        handlers = append(handlers, handler.NewAPIHandler(handler.APIHandlerConfig{
            Name:       "my-provider",
            APIURL:     "https://api.my-provider.com/scrape",
            APIKey:     cfg.MyProviderAPIKey,
            AuthHeader: "Authorization",  // or "X-API-Key", "Bearer", etc.
        }))
    }
  4. Add env var to config.go: MyProviderAPIKey: os.Getenv("MY_PROVIDER_API_KEY")

API keys always come from environment variables — never hardcoded. The handler only activates when its key is set.

Extending the Chain

Implement the ScrapingHandler interface:

type ScrapingHandler interface {
    Name() string
    CanHandle(ctx context.Context, req *HandlerRequest) bool
    Scrape(ctx context.Context, req *HandlerRequest) (*ScrapeResult, error)
    IsHealthy() bool
}

See examples/custom-handler/.

Configuration

All configuration via environment variables:

Variable Default Description
PORT 8080 Server port
DATABASE_URL PostgreSQL connection string (optional — uses in-memory storage when not set)
BROWSER_WS_URL ws://localhost:9222/camoufox Browser service WebSocket URL
BROWSER_TIMEOUT 60 Page navigation timeout (seconds)
BROWSER_LOAD_WAIT 2 Extra wait after page load (seconds)
WORKER_POOL_SIZE 5 Concurrent scrape workers
JOB_BUFFER_SIZE 100 Job queue buffer size
JOB_TIMEOUT 120 Max job duration (seconds)
PROXY_URL Default HTTP proxy for the HTTP handler
PROXY_URLS Comma-separated proxy pool for auto-selection (Thompson Sampling)
GEMINI_API_KEY Google Gemini API key for structured JSON extraction (get one free)
ANAKIN_API_KEY anakin.io API key — enables hosted API as chain fallback for hard-to-scrape sites
LOG_LEVEL INFO Log level (DEBUG, INFO, WARN, ERROR)
TELEMETRY on Anonymous usage telemetry (off to disable — see TELEMETRY.md)
TELEMETRY_URL Custom telemetry endpoint (defaults to https://telemetry.anakin.io/v1/collect)
DISABLE_HOSTED_HINTS Set to true to suppress hosted service tips in error messages

Project Structure

anakinscraper-oss/
├── server/                     # Go server (API + workers)
│   ├── cmd/server/             # Entry point
│   └── internal/
│       ├── config/             # Environment config
│       ├── models/             # Data types
│       ├── worker/             # Channel-based worker pool
│       ├── handler/            # Scraping handlers (HTTP, Browser)
│       ├── converter/          # HTML → Markdown
│       ├── gemini/             # Gemini AI JSON extraction
│       ├── domain/             # Domain configs + failure detection
│       ├── store/              # Job storage (PostgreSQL or in-memory)
│       ├── proxy/              # Proxy pool + Thompson Sampling
│       ├── telemetry/          # Anonymous usage telemetry
│       ├── processor/          # Job processing
│       └── http/
│           ├── handlers/       # API request handlers
│           └── router/         # Route registration
├── browser-service/            # Camoufox anti-detect browser server
├── webapp/                     # React web dashboard (Vite + Tailwind)
├── openclaw-skill/             # OpenClaw skill wrapper
├── examples/                   # Usage examples
├── docker-compose.yml          # Full stack (3 containers)
├── scripts/init-db.sql         # Database schema
└── .env.example                # Config template

Development

Running Locally (without Docker)

Minimal (Go only):

cd server && go run cmd/server/main.go

No database, no browser service. HTTP handler scrapes static sites. Jobs stored in memory.

Full local stack (Go + Python + PostgreSQL):

# Terminal 1: PostgreSQL
docker compose up postgres -d

# Terminal 2: Browser Service (for JS-heavy sites)
cd browser-service && pip install -r requirements.txt && python server.py

# Terminal 3: Server with persistence
cd server && DATABASE_URL="postgres://postgres:postgres@localhost:5432/anakinscraper?sslmode=disable" go run cmd/server/main.go

Running Tests

cd server && go test ./...

Building

cd server && go build -o server ./cmd/server

CLI

Use the Anakin CLI to scrape from your terminal. It works against both self-hosted and the hosted API:

# Install
pip install anakin-cli

# Scrape via your local instance (no API key needed)
anakin scrape "https://example.com" --api-url http://localhost:8080

# Or set it as your default
export ANAKIN_API_URL="http://localhost:8080"
anakin scrape "https://example.com"

# Extract structured JSON
anakin scrape "https://example.com" --format json --api-url http://localhost:8080

# Batch scrape
anakin scrape-batch "https://example.com" "https://httpbin.org/html" --api-url http://localhost:8080

The same CLI also supports AI web search and deep research via the hosted API — get a free API key to unlock those features.

See the anakin-cli repo for full usage.

Integrations

OpenClaw Skill

Use AnakinScraper as an OpenClaw skill:

cp -r openclaw-skill ~/.openclaw/workspace/skills/anakinscraper

See openclaw-skill/SKILL.md.

Self-Hosted vs Hosted

This repo gives you the full scraping engine. anakin.io adds the infrastructure you'd otherwise build yourself:

Feature Self-Hosted (this repo) Hosted (anakin.io)
Sync + async scraping Yes Yes
Batch scraping Yes Yes
Anti-detect browser Yes Yes
Structured JSON extraction Yes (bring your own Gemini key) Yes (built-in)
Domain configs Yes Yes
Proxy auto-selection Yes (bring your own proxies) Yes (195 countries included)
Geo-targeted proxies 195 countries
AI web search Yes
Deep agentic research Yes
Zero infrastructure Yes

Already self-hosting? Switch to hosted with one line — same API, same CLI:

anakin login --api-key "ak-xxx"   # get your key at anakin.io/dashboard
anakin scrape "https://example.com"  # now routes through hosted

Try it free →

Community

Join our Discord for questions, discussion, and support.

Contributing

See CONTRIBUTING.md.

License

AGPL-3.0 — free to self-host. If you modify the code and offer it as a hosted service, you must open-source your changes.


Built by Anakin-Inc. If you find this useful, give us a star!

About

Open-source web scraping API. Turn any website into clean markdown or structured JSON. Anti-detect browser, proxy auto-selection, self-hosted. One command: make up

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors