The open-source web scraping API for AI. Turn any website into LLM-ready markdown or structured data.
Self-host with a single command. No cloud dependencies. Powers RAG pipelines, AI agents, and data extraction at scale.
git clone https://github.com/Anakin-Inc/anakinscraper-oss.git && cd anakinscraper-oss && make up
# Scrape any website — one curl, full result:
curl -s -X POST http://localhost:8080/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}' | jq .markdown| AnakinScraper | Firecrawl | Crawlee | Scrapy | |
|---|---|---|---|---|
| Anti-detect browser | Camoufox (Firefox) | Headless Chrome | Playwright | No |
| Smart proxy selection | Thompson Sampling (ML) | Round-robin | Manual | Manual |
| Zero-config start | go run — no DB needed |
Docker required | npm install | pip install |
| Single binary | Go — one 30MB binary | Node.js | Node.js | Python |
| Handler chain fallback | HTTP → Browser → API | Single mode | Single mode | Single mode |
| Structured JSON (AI) | Gemini extraction | LLM extraction | No | No |
- Handler chain with fallback — HTTP fetch → anti-detect browser → external API. Each handler tries in order; if one fails, the next picks up automatically. Most pages resolve on the free local HTTP handler — paid APIs are only called for the ~5% that actually need them. Docs →
- Custom API handlers — plug in any third-party scraping service as a chain fallback. Only invoked when local handlers fail — saves 90%+ on API costs vs routing everything through a paid service. Built-in anakin.io handler included. How to add your own →
- Domain configs — per-domain scraping strategies: choose which handlers to use, set timeouts, retries, custom headers, block domains, and validate content with pattern matching. Docs →
- Failure detection — define failure patterns and required patterns per domain. If the scraped content matches a failure pattern (e.g. CAPTCHA page) or misses a required pattern, the job auto-retries with the next handler. Docs →
- Anti-detect browser — Camoufox (anti-detect Firefox) with realistic fingerprints, not headless Chrome. Docs →
- Proxy auto-select — Thompson Sampling picks the best proxy per domain, learning from success/failure in real time. Docs →
- Structured JSON extraction — use Gemini AI to extract structured data from any page (bring your own API key)
- Sync + async + batch API —
POST /v1/scrapefor instant results,/v1/url-scraperfor async with polling, batch up to 10 URLs - LLM-ready markdown — automatic boilerplate removal, clean content extraction. Feed directly into RAG pipelines, Claude, GPT, or any LLM without preprocessing
- Web dashboard — built-in React UI for scraping, job tracking, domain config management, and proxy monitoring
- Zero-config mode — run with just Go, no database needed. Or use Docker for the full stack
- Self-contained — no Redis, no AWS, no message queues. Single Go binary. Optional PostgreSQL for persistence
Just Go 1.25+. Two commands:
cd server && go run cmd/server/main.go
# In another terminal:
curl -s -X POST http://localhost:8080/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}' | jq .markdownJobs are stored in memory (lost on restart). For persistence, set DATABASE_URL. For JavaScript-heavy sites, add the browser service via Docker.
- Docker and Docker Compose
git clone https://github.com/Anakin-Inc/anakinscraper-oss.git
cd anakinscraper-oss
make upThat's it. Three containers start:
| Service | Port | Description |
|---|---|---|
| Server | 8080 | REST API + worker pool |
| Browser Service | 9222 | Camoufox anti-detect browser (WebSocket) |
| PostgreSQL | 5432 | Job storage |
A built-in web UI is included for visual scraping, job tracking, and configuration:
cd webapp && npm install && npm run devOpen http://localhost:3000 — the dashboard proxies API calls to the server on port 8080.
Pages: Dashboard (health + quick scrape) | Scrape (sync/async/batch with live results) | Jobs (tracked history with status filters) | Domain Configs (CRUD with handler chain management) | Proxy Scores (Thompson Sampling performance)
Synchronous (recommended for getting started):
curl -s -X POST http://localhost:8080/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}' | jq .One request, full result back. No polling. Timeout: 30 seconds.
Asynchronous (for long-running scrapes):
# Submit
curl -s -X POST http://localhost:8080/v1/url-scraper \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
# Poll for result
curl -s http://localhost:8080/v1/url-scraper/JOB_UUID | jq .With AI-powered JSON extraction (requires GEMINI_API_KEY):
curl -s -X POST http://localhost:8080/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "generateJson": true}' | jq .generatedJsonNo API keys required for the scraper itself. Just JSON in, results out.
┌─────────────────┐
│ Your App │
│ (cURL / CLI) │
└────────┬────────┘
│ HTTP
▼
┌─────────────────┐ ┌──────────┐
│ Server │────────▶│ Gemini │
│ (Go/Fiber) │ optional│ (JSON) │
│ Port 8080 │ └──────────┘
└──┬──────┬───┬──┘
│ │ │
┌──────────┘ │ └────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Storage │ │ Browser │ │ API Handler │
│ Postgres │ │ Service │ │ (anakin.io │
│ or memory│ │ (Camoufox) │ │ or custom) │
│(optional)│ │ (optional) │ │ (optional) │
└──────────┘ └──────────────┘ └──────────────┘
The server is a single Go binary that runs with zero dependencies. Optionally add PostgreSQL for persistence, the browser service for JavaScript-heavy sites, and API handlers for hard-to-scrape sites. Workers execute the handler chain (HTTP → browser → API fallback), convert HTML to markdown, and optionally extract structured JSON via Gemini.
See docs/API.md for the complete API reference. Quick overview:
| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/scrape |
Sync — scrape a URL and get the result back directly (30s timeout) |
POST |
/v1/url-scraper |
Async — submit a scrape job, returns job ID |
GET |
/v1/url-scraper/:id |
Poll for async job result |
POST |
/v1/url-scraper/batch |
Batch scrape up to 10 URLs |
GET |
/v1/url-scraper/batch/:id |
Poll for batch result |
POST |
/v1/domain-configs |
Create a per-domain scraping config |
GET |
/v1/domain-configs |
List all domain configs |
GET |
/v1/proxy/scores |
View proxy Thompson Sampling scores |
GET |
/v1/telemetry/status |
View telemetry state and next payload (details) |
GET |
/health |
Health check |
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
useBrowser |
bool | false |
Skip HTTP handler, go straight to browser |
generateJson |
bool | false |
Extract structured JSON via Gemini AI (requires GEMINI_API_KEY) |
{
"id": "550e8400-...",
"status": "completed",
"url": "https://example.com",
"html": "<html>...</html>",
"cleanedHtml": "<main>...</main>",
"markdown": "# Page Title\n\nContent...",
"generatedJson": {
"status": "success",
"data": {"title": "Page Title", "content": "..."}
},
"durationMs": 1234
}generatedJson is only present when generateJson: true and GEMINI_API_KEY is configured.
Each scrape job goes through the handler chain. On failure, it falls back to the next handler:
HTTP Handler (fast, ~200ms) ──fail──▶ Browser Handler (Camoufox) ──fail──▶ API Handler (optional)
HTTP Handler — direct HTTP GET with a browser user-agent. Handles static HTML, server-rendered pages. No browser overhead.
Browser Handler — connects to Camoufox (anti-detect Firefox) over WebSocket via Playwright protocol. Full JavaScript rendering, network-idle detection, realistic browser fingerprints. Handles SPAs, lazy-loaded content, and sites with anti-bot protection.
API Handler (optional) — delegates to an external scraping API when local handlers fail. Set ANAKIN_API_KEY to enable the built-in anakin.io fallback for hard-to-scrape sites (Cloudflare, DataDome, etc.). See Adding Custom API Handlers below.
The API handler pattern makes it easy to integrate any third-party scraping service as a chain fallback. The built-in anakin.io handler (server/internal/handler/api.go) is a working example — copy and modify it for your provider:
- Copy
api.gotomy_provider.go - Add a constructor like
NewAnakinHandler— set your provider's URL, auth header name, and response format - Register in
main.go:if cfg.MyProviderAPIKey != "" { handlers = append(handlers, handler.NewAPIHandler(handler.APIHandlerConfig{ Name: "my-provider", APIURL: "https://api.my-provider.com/scrape", APIKey: cfg.MyProviderAPIKey, AuthHeader: "Authorization", // or "X-API-Key", "Bearer", etc. })) }
- Add env var to
config.go:MyProviderAPIKey: os.Getenv("MY_PROVIDER_API_KEY")
API keys always come from environment variables — never hardcoded. The handler only activates when its key is set.
Implement the ScrapingHandler interface:
type ScrapingHandler interface {
Name() string
CanHandle(ctx context.Context, req *HandlerRequest) bool
Scrape(ctx context.Context, req *HandlerRequest) (*ScrapeResult, error)
IsHealthy() bool
}All configuration via environment variables:
| Variable | Default | Description |
|---|---|---|
PORT |
8080 |
Server port |
DATABASE_URL |
— | PostgreSQL connection string (optional — uses in-memory storage when not set) |
BROWSER_WS_URL |
ws://localhost:9222/camoufox |
Browser service WebSocket URL |
BROWSER_TIMEOUT |
60 |
Page navigation timeout (seconds) |
BROWSER_LOAD_WAIT |
2 |
Extra wait after page load (seconds) |
WORKER_POOL_SIZE |
5 |
Concurrent scrape workers |
JOB_BUFFER_SIZE |
100 |
Job queue buffer size |
JOB_TIMEOUT |
120 |
Max job duration (seconds) |
PROXY_URL |
— | Default HTTP proxy for the HTTP handler |
PROXY_URLS |
— | Comma-separated proxy pool for auto-selection (Thompson Sampling) |
GEMINI_API_KEY |
— | Google Gemini API key for structured JSON extraction (get one free) |
ANAKIN_API_KEY |
— | anakin.io API key — enables hosted API as chain fallback for hard-to-scrape sites |
LOG_LEVEL |
INFO |
Log level (DEBUG, INFO, WARN, ERROR) |
TELEMETRY |
on |
Anonymous usage telemetry (off to disable — see TELEMETRY.md) |
TELEMETRY_URL |
— | Custom telemetry endpoint (defaults to https://telemetry.anakin.io/v1/collect) |
DISABLE_HOSTED_HINTS |
— | Set to true to suppress hosted service tips in error messages |
anakinscraper-oss/
├── server/ # Go server (API + workers)
│ ├── cmd/server/ # Entry point
│ └── internal/
│ ├── config/ # Environment config
│ ├── models/ # Data types
│ ├── worker/ # Channel-based worker pool
│ ├── handler/ # Scraping handlers (HTTP, Browser)
│ ├── converter/ # HTML → Markdown
│ ├── gemini/ # Gemini AI JSON extraction
│ ├── domain/ # Domain configs + failure detection
│ ├── store/ # Job storage (PostgreSQL or in-memory)
│ ├── proxy/ # Proxy pool + Thompson Sampling
│ ├── telemetry/ # Anonymous usage telemetry
│ ├── processor/ # Job processing
│ └── http/
│ ├── handlers/ # API request handlers
│ └── router/ # Route registration
├── browser-service/ # Camoufox anti-detect browser server
├── webapp/ # React web dashboard (Vite + Tailwind)
├── openclaw-skill/ # OpenClaw skill wrapper
├── examples/ # Usage examples
├── docker-compose.yml # Full stack (3 containers)
├── scripts/init-db.sql # Database schema
└── .env.example # Config template
Minimal (Go only):
cd server && go run cmd/server/main.goNo database, no browser service. HTTP handler scrapes static sites. Jobs stored in memory.
Full local stack (Go + Python + PostgreSQL):
# Terminal 1: PostgreSQL
docker compose up postgres -d
# Terminal 2: Browser Service (for JS-heavy sites)
cd browser-service && pip install -r requirements.txt && python server.py
# Terminal 3: Server with persistence
cd server && DATABASE_URL="postgres://postgres:postgres@localhost:5432/anakinscraper?sslmode=disable" go run cmd/server/main.gocd server && go test ./...cd server && go build -o server ./cmd/serverUse the Anakin CLI to scrape from your terminal. It works against both self-hosted and the hosted API:
# Install
pip install anakin-cli
# Scrape via your local instance (no API key needed)
anakin scrape "https://example.com" --api-url http://localhost:8080
# Or set it as your default
export ANAKIN_API_URL="http://localhost:8080"
anakin scrape "https://example.com"
# Extract structured JSON
anakin scrape "https://example.com" --format json --api-url http://localhost:8080
# Batch scrape
anakin scrape-batch "https://example.com" "https://httpbin.org/html" --api-url http://localhost:8080The same CLI also supports AI web search and deep research via the hosted API — get a free API key to unlock those features.
See the anakin-cli repo for full usage.
Use AnakinScraper as an OpenClaw skill:
cp -r openclaw-skill ~/.openclaw/workspace/skills/anakinscraperThis repo gives you the full scraping engine. anakin.io adds the infrastructure you'd otherwise build yourself:
| Feature | Self-Hosted (this repo) | Hosted (anakin.io) |
|---|---|---|
| Sync + async scraping | Yes | Yes |
| Batch scraping | Yes | Yes |
| Anti-detect browser | Yes | Yes |
| Structured JSON extraction | Yes (bring your own Gemini key) | Yes (built-in) |
| Domain configs | Yes | Yes |
| Proxy auto-selection | Yes (bring your own proxies) | Yes (195 countries included) |
| Geo-targeted proxies | — | 195 countries |
| AI web search | — | Yes |
| Deep agentic research | — | Yes |
| Zero infrastructure | — | Yes |
Already self-hosting? Switch to hosted with one line — same API, same CLI:
anakin login --api-key "ak-xxx" # get your key at anakin.io/dashboard
anakin scrape "https://example.com" # now routes through hostedJoin our Discord for questions, discussion, and support.
See CONTRIBUTING.md.
AGPL-3.0 — free to self-host. If you modify the code and offer it as a hosted service, you must open-source your changes.
Built by Anakin-Inc. If you find this useful, give us a star!