AnakinScraper OSS

The open-source web scraping API for AI. Turn any website into LLM-ready markdown or structured data.

Self-host with a single command. No cloud dependencies. Powers RAG pipelines, AI agents, and data extraction at scale.

git clone https://github.com/Anakin-Inc/anakinscraper-oss.git && cd anakinscraper-oss && make up

# Scrape any website — one curl, full result:
curl -s -X POST http://localhost:8080/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}' | jq .markdown

Why AnakinScraper?

	AnakinScraper	Firecrawl	Crawlee	Scrapy
Anti-detect browser	Camoufox (Firefox)	Headless Chrome	Playwright	No
Smart proxy selection	Thompson Sampling (ML)	Round-robin	Manual	Manual
Zero-config start	`go run` — no DB needed	Docker required	npm install	pip install
Single binary	Go — one 30MB binary	Node.js	Node.js	Python
Handler chain fallback	HTTP → Browser → API	Single mode	Single mode	Single mode
Structured JSON (AI)	Gemini extraction	LLM extraction	No	No

Features

Handler chain with fallback — HTTP fetch → anti-detect browser → external API. Each handler tries in order; if one fails, the next picks up automatically. Most pages resolve on the free local HTTP handler — paid APIs are only called for the ~5% that actually need them. Docs →
Custom API handlers — plug in any third-party scraping service as a chain fallback. Only invoked when local handlers fail — saves 90%+ on API costs vs routing everything through a paid service. Built-in anakin.io handler included. How to add your own →
Domain configs — per-domain scraping strategies: choose which handlers to use, set timeouts, retries, custom headers, block domains, and validate content with pattern matching. Docs →
Failure detection — define failure patterns and required patterns per domain. If the scraped content matches a failure pattern (e.g. CAPTCHA page) or misses a required pattern, the job auto-retries with the next handler. Docs →
Anti-detect browser — Camoufox (anti-detect Firefox) with realistic fingerprints, not headless Chrome. Docs →
Proxy auto-select — Thompson Sampling picks the best proxy per domain, learning from success/failure in real time. Docs →
Structured JSON extraction — use Gemini AI to extract structured data from any page (bring your own API key)
Sync + async + batch API — POST /v1/scrape for instant results, /v1/url-scraper for async with polling, batch up to 10 URLs
LLM-ready markdown — automatic boilerplate removal, clean content extraction. Feed directly into RAG pipelines, Claude, GPT, or any LLM without preprocessing
Web dashboard — built-in React UI for scraping, job tracking, domain config management, and proxy monitoring
Zero-config mode — run with just Go, no database needed. Or use Docker for the full stack
Self-contained — no Redis, no AWS, no message queues. Single Go binary. Optional PostgreSQL for persistence

Quick Start (no Docker, no database)

Just Go 1.25+. Two commands:

cd server && go run cmd/server/main.go

# In another terminal:
curl -s -X POST http://localhost:8080/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}' | jq .markdown

Jobs are stored in memory (lost on restart). For persistence, set DATABASE_URL. For JavaScript-heavy sites, add the browser service via Docker.

Self-Host (Docker — full stack)

Prerequisites

Docker and Docker Compose

Start

git clone https://github.com/Anakin-Inc/anakinscraper-oss.git
cd anakinscraper-oss
make up

That's it. Three containers start:

Service	Port	Description
Server	8080	REST API + worker pool
Browser Service	9222	Camoufox anti-detect browser (WebSocket)
PostgreSQL	5432	Job storage

Web Dashboard

A built-in web UI is included for visual scraping, job tracking, and configuration:

cd webapp && npm install && npm run dev

Open http://localhost:3000 — the dashboard proxies API calls to the server on port 8080.

Pages: Dashboard (health + quick scrape) | Scrape (sync/async/batch with live results) | Jobs (tracked history with status filters) | Domain Configs (CRUD with handler chain management) | Proxy Scores (Thompson Sampling performance)

Scrape a URL

Synchronous (recommended for getting started):

curl -s -X POST http://localhost:8080/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}' | jq .

One request, full result back. No polling. Timeout: 30 seconds.

Asynchronous (for long-running scrapes):

# Submit
curl -s -X POST http://localhost:8080/v1/url-scraper \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

# Poll for result
curl -s http://localhost:8080/v1/url-scraper/JOB_UUID | jq .

With AI-powered JSON extraction (requires GEMINI_API_KEY):

curl -s -X POST http://localhost:8080/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "generateJson": true}' | jq .generatedJson

No API keys required for the scraper itself. Just JSON in, results out.

Architecture

                    ┌─────────────────┐
                    │   Your App      │
                    │   (cURL / CLI)  │
                    └────────┬────────┘
                             │ HTTP
                             ▼
                    ┌─────────────────┐         ┌──────────┐
                    │     Server      │────────▶│  Gemini  │
                    │   (Go/Fiber)    │ optional│  (JSON)  │
                    │   Port 8080     │         └──────────┘
                    └──┬──────┬───┬──┘
                       │      │   │
            ┌──────────┘      │   └────────────┐
            ▼                 ▼                 ▼
      ┌──────────┐   ┌──────────────┐   ┌──────────────┐
      │ Storage  │   │   Browser    │   │ API Handler  │
      │ Postgres │   │   Service    │   │ (anakin.io   │
      │ or memory│   │  (Camoufox)  │   │  or custom)  │
      │(optional)│   │  (optional)  │   │  (optional)  │
      └──────────┘   └──────────────┘   └──────────────┘

The server is a single Go binary that runs with zero dependencies. Optionally add PostgreSQL for persistence, the browser service for JavaScript-heavy sites, and API handlers for hard-to-scrape sites. Workers execute the handler chain (HTTP → browser → API fallback), convert HTML to markdown, and optionally extract structured JSON via Gemini.

API Reference

See docs/API.md for the complete API reference. Quick overview:

Method	Endpoint	Description
`POST`	`/v1/scrape`	Sync — scrape a URL and get the result back directly (30s timeout)
`POST`	`/v1/url-scraper`	Async — submit a scrape job, returns job ID
`GET`	`/v1/url-scraper/:id`	Poll for async job result
`POST`	`/v1/url-scraper/batch`	Batch scrape up to 10 URLs
`GET`	`/v1/url-scraper/batch/:id`	Poll for batch result
`POST`	`/v1/domain-configs`	Create a per-domain scraping config
`GET`	`/v1/domain-configs`	List all domain configs
`GET`	`/v1/proxy/scores`	View proxy Thompson Sampling scores
`GET`	`/v1/telemetry/status`	View telemetry state and next payload (details)
`GET`	`/health`	Health check

Request Fields

Field	Type	Default	Description
`url`	string	required	URL to scrape
`useBrowser`	bool	`false`	Skip HTTP handler, go straight to browser
`generateJson`	bool	`false`	Extract structured JSON via Gemini AI (requires `GEMINI_API_KEY`)

Response

{
  "id": "550e8400-...",
  "status": "completed",
  "url": "https://example.com",
  "html": "<html>...</html>",
  "cleanedHtml": "<main>...</main>",
  "markdown": "# Page Title\n\nContent...",
  "generatedJson": {
    "status": "success",
    "data": {"title": "Page Title", "content": "..."}
  },
  "durationMs": 1234
}

generatedJson is only present when generateJson: true and GEMINI_API_KEY is configured.

Handler Chain

Each scrape job goes through the handler chain. On failure, it falls back to the next handler:

HTTP Handler (fast, ~200ms) ──fail──▶ Browser Handler (Camoufox) ──fail──▶ API Handler (optional)

HTTP Handler — direct HTTP GET with a browser user-agent. Handles static HTML, server-rendered pages. No browser overhead.

Browser Handler — connects to Camoufox (anti-detect Firefox) over WebSocket via Playwright protocol. Full JavaScript rendering, network-idle detection, realistic browser fingerprints. Handles SPAs, lazy-loaded content, and sites with anti-bot protection.

API Handler (optional) — delegates to an external scraping API when local handlers fail. Set ANAKIN_API_KEY to enable the built-in anakin.io fallback for hard-to-scrape sites (Cloudflare, DataDome, etc.). See Adding Custom API Handlers below.

Adding Custom API Handlers

The API handler pattern makes it easy to integrate any third-party scraping service as a chain fallback. The built-in anakin.io handler (server/internal/handler/api.go) is a working example — copy and modify it for your provider:

Copy api.go to my_provider.go
Add a constructor like NewAnakinHandler — set your provider's URL, auth header name, and response format

Register in main.go:

if cfg.MyProviderAPIKey != "" {
    handlers = append(handlers, handler.NewAPIHandler(handler.APIHandlerConfig{
        Name:       "my-provider",
        APIURL:     "https://api.my-provider.com/scrape",
        APIKey:     cfg.MyProviderAPIKey,
        AuthHeader: "Authorization",  // or "X-API-Key", "Bearer", etc.
    }))
}

Add env var to config.go: MyProviderAPIKey: os.Getenv("MY_PROVIDER_API_KEY")

API keys always come from environment variables — never hardcoded. The handler only activates when its key is set.

Extending the Chain

Implement the ScrapingHandler interface:

type ScrapingHandler interface {
    Name() string
    CanHandle(ctx context.Context, req *HandlerRequest) bool
    Scrape(ctx context.Context, req *HandlerRequest) (*ScrapeResult, error)
    IsHealthy() bool
}

See examples/custom-handler/.

Configuration

All configuration via environment variables:

Variable	Default	Description
`PORT`	`8080`	Server port
`DATABASE_URL`	—	PostgreSQL connection string (optional — uses in-memory storage when not set)
`BROWSER_WS_URL`	`ws://localhost:9222/camoufox`	Browser service WebSocket URL
`BROWSER_TIMEOUT`	`60`	Page navigation timeout (seconds)
`BROWSER_LOAD_WAIT`	`2`	Extra wait after page load (seconds)
`WORKER_POOL_SIZE`	`5`	Concurrent scrape workers
`JOB_BUFFER_SIZE`	`100`	Job queue buffer size
`JOB_TIMEOUT`	`120`	Max job duration (seconds)
`PROXY_URL`	—	Default HTTP proxy for the HTTP handler
`PROXY_URLS`	—	Comma-separated proxy pool for auto-selection (Thompson Sampling)
`GEMINI_API_KEY`	—	Google Gemini API key for structured JSON extraction (get one free)
`ANAKIN_API_KEY`	—	anakin.io API key — enables hosted API as chain fallback for hard-to-scrape sites
`LOG_LEVEL`	`INFO`	Log level (DEBUG, INFO, WARN, ERROR)
`TELEMETRY`	`on`	Anonymous usage telemetry (`off` to disable — see TELEMETRY.md)
`TELEMETRY_URL`	—	Custom telemetry endpoint (defaults to `https://telemetry.anakin.io/v1/collect`)
`DISABLE_HOSTED_HINTS`	—	Set to `true` to suppress hosted service tips in error messages

Project Structure

anakinscraper-oss/
├── server/                     # Go server (API + workers)
│   ├── cmd/server/             # Entry point
│   └── internal/
│       ├── config/             # Environment config
│       ├── models/             # Data types
│       ├── worker/             # Channel-based worker pool
│       ├── handler/            # Scraping handlers (HTTP, Browser)
│       ├── converter/          # HTML → Markdown
│       ├── gemini/             # Gemini AI JSON extraction
│       ├── domain/             # Domain configs + failure detection
│       ├── store/              # Job storage (PostgreSQL or in-memory)
│       ├── proxy/              # Proxy pool + Thompson Sampling
│       ├── telemetry/          # Anonymous usage telemetry
│       ├── processor/          # Job processing
│       └── http/
│           ├── handlers/       # API request handlers
│           └── router/         # Route registration
├── browser-service/            # Camoufox anti-detect browser server
├── webapp/                     # React web dashboard (Vite + Tailwind)
├── openclaw-skill/             # OpenClaw skill wrapper
├── examples/                   # Usage examples
├── docker-compose.yml          # Full stack (3 containers)
├── scripts/init-db.sql         # Database schema
└── .env.example                # Config template

Development

Running Locally (without Docker)

Minimal (Go only):

cd server && go run cmd/server/main.go

No database, no browser service. HTTP handler scrapes static sites. Jobs stored in memory.

Full local stack (Go + Python + PostgreSQL):

# Terminal 1: PostgreSQL
docker compose up postgres -d

# Terminal 2: Browser Service (for JS-heavy sites)
cd browser-service && pip install -r requirements.txt && python server.py

# Terminal 3: Server with persistence
cd server && DATABASE_URL="postgres://postgres:postgres@localhost:5432/anakinscraper?sslmode=disable" go run cmd/server/main.go

Running Tests

cd server && go test ./...

Building

cd server && go build -o server ./cmd/server

CLI

Use the Anakin CLI to scrape from your terminal. It works against both self-hosted and the hosted API:

# Install
pip install anakin-cli

# Scrape via your local instance (no API key needed)
anakin scrape "https://example.com" --api-url http://localhost:8080

# Or set it as your default
export ANAKIN_API_URL="http://localhost:8080"
anakin scrape "https://example.com"

# Extract structured JSON
anakin scrape "https://example.com" --format json --api-url http://localhost:8080

# Batch scrape
anakin scrape-batch "https://example.com" "https://httpbin.org/html" --api-url http://localhost:8080

The same CLI also supports AI web search and deep research via the hosted API — get a free API key to unlock those features.

See the anakin-cli repo for full usage.

Integrations

OpenClaw Skill

Use AnakinScraper as an OpenClaw skill:

cp -r openclaw-skill ~/.openclaw/workspace/skills/anakinscraper

See openclaw-skill/SKILL.md.

Self-Hosted vs Hosted

This repo gives you the full scraping engine. anakin.io adds the infrastructure you'd otherwise build yourself:

Feature	Self-Hosted (this repo)	Hosted (anakin.io)
Sync + async scraping	Yes	Yes
Batch scraping	Yes	Yes
Anti-detect browser	Yes	Yes
Structured JSON extraction	Yes (bring your own Gemini key)	Yes (built-in)
Domain configs	Yes	Yes
Proxy auto-selection	Yes (bring your own proxies)	Yes (195 countries included)
Geo-targeted proxies	—	195 countries
AI web search	—	Yes
Deep agentic research	—	Yes
Zero infrastructure	—	Yes

Already self-hosting? Switch to hosted with one line — same API, same CLI:

anakin login --api-key "ak-xxx"   # get your key at anakin.io/dashboard
anakin scrape "https://example.com"  # now routes through hosted

Try it free →

Community

Join our Discord for questions, discussion, and support.

Contributing

See CONTRIBUTING.md.

License

AGPL-3.0 — free to self-host. If you modify the code and offer it as a hosted service, you must open-source your changes.

Built by Anakin-Inc. If you find this useful, give us a star!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
browser-service		browser-service
docs		docs
examples		examples
openclaw-skill		openclaw-skill
scripts		scripts
server		server
webapp		webapp
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
TELEMETRY.md		TELEMETRY.md
TODOS.md		TODOS.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

AnakinScraper OSS

Why AnakinScraper?

Features

Quick Start (no Docker, no database)

Self-Host (Docker — full stack)

Prerequisites

Start

Web Dashboard

Scrape a URL

Architecture

API Reference

Request Fields

Response

Handler Chain

Adding Custom API Handlers

Extending the Chain

Configuration

Project Structure

Development

Running Locally (without Docker)

Running Tests

Building

CLI

Integrations

OpenClaw Skill

Self-Hosted vs Hosted

Community

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages