A lightweight, high-performance web scraping and crawling engine — a modern alternative to Firecrawl. Converts websites to LLM-ready markdown, extracts structured data with AI, and uses Chrome DevTools Protocol (CDP) for zero-dependency browser automation.
- Features
- Quick Start
- Installation
- Configuration
- API Endpoints
- Comparison with Firecrawl
- Architecture
- Development
- License
- Lightweight: No Playwright/Puppeteer — uses Chrome DevTools Protocol (CDP) directly
- Smart Fetching: Auto-fallback to CDP browser rendering for SPAs and bot-protected sites
- Clean Markdown: Mozilla Readability for semantic content extraction
- AI Extraction: Structured data extraction via LLMs (OpenAI, Anthropic, Ollama, NVIDIA Nemotron via OpenRouter)
- Site Mapping: Discover all URLs via sitemaps and link crawling
- Async Crawling: SQLite-backed job queue for scalable crawling
- Zero External Search: Automate search engines (Google, DuckDuckGo, Bing) via CDP — no third-party APIs
- MCP Server: Integrate with AI tools like Cursor and KiloCode
- Webhooks: Real-time crawl completion notifications
- PDF Support: Automatic PDF text extraction to markdown
- Agent Mode: AI-driven browser automation for complex tasks
- Browser Actions: Click, type, scroll, press keys via CDP (no Playwright needed)
- Screenshots: Full-page PNG/JPEG captures
- Multi-Format Output: Markdown, HTML, or screenshots from any scrape
- Batch Scraping: Async multi-URL scraping via
/v1/batch/scrape
As an AI agent, you can use AeroCrawl in two ways:
PATH A: Use Directly via CLI If you need to scrape something immediately in the terminal:
# 1. Install
Windows: iwr https://raw.githubusercontent.com/deviprasadshetty-dev/aerocrawl/main/install.ps1 -OutFile install.ps1; .\install.ps1
Mac/Linux: curl -fsSL https://raw.githubusercontent.com/deviprasadshetty-dev/aerocrawl/main/install.sh -o install.sh && bash install.sh
# 2. Scrape
aerocrawl https://example.comPATH B: Install as MCP Server for Yourself To add AeroCrawl's tools natively to your own context window (like Cursor or KiloCode):
# 1. Install (same as above)
# 2. Generate configuration
aerocrawl --generate-mcp-config(The AI editor will run the server automatically. No need to keep it running in the terminal).
Option 1: Use install scripts (Recommended)
# Windows (PowerShell)
iwr https://raw.githubusercontent.com/deviprasadshetty-dev/aerocrawl/main/install.ps1 -OutFile install.ps1; .\install.ps1
# Mac/Linux
curl -fsSL https://raw.githubusercontent.com/deviprasadshetty-dev/aerocrawl/main/install.sh -o install.sh && bash install.shOption 2: Manual install
# Windows (PowerShell)
git clone https://github.com/deviprasadshetty-dev/aerocrawl.git; cd aerocrawl; npm install; npm run build; npm link# Mac/Linux
git clone https://github.com/deviprasadshetty-dev/aerocrawl.git && cd aerocrawl && npm install && npm run build && npm linkVerify:
aerocrawl --helpThe easiest way to use AeroCrawl — no server setup required:
# Scrape a single URL to markdown
aerocrawl https://example.com
# Extract structured data with AI (uses OpenRouter free models by default)
aerocrawl https://example.com -m extract
# Map all URLs on a site
aerocrawl https://example.com -m map
# Crawl an entire website
aerocrawl https://example.com -m crawl
# AI Agent Mode: automate tasks with natural language
aerocrawl https://example.com --goal "Extract all product prices and features"
# Search the web (zero external API)
aerocrawl "best web scraping tools" -m search
# Capture screenshots
aerocrawl https://example.com --screenshot
# Batch scrape from file
aerocrawl --batch urls.txt --output results.jsonIntegrate with Cursor, KiloCode, and other MCP-compatible tools. The AI assistant will run the server automatically when needed—you do not need to run it manually in your terminal.
Auto-Generate Configuration:
aerocrawl --generate-mcp-configManual MCP Configuration JSON:
Add this to your MCP settings file:
{
"mcpServers": {
"aerocrawl": {
"command": "aerocrawl",
"args": ["-m", "mcp"]
}
}
}Cursor: Add to ~/.cursor/mcp.json or .cursor/mcp.json in your project
KiloCode: Add to KiloCode's MCP settings file
Available MCP Tools:
scrape/extract/crawl/search/agent
Usage in Cursor:
@aerocrawl Can you scrape https://example.com and extract the main heading?
Start the HTTP API server for programmatic access:
# Start server (defaults to port 3000)
aerocrawl -m serve
# Or via Node.js
node dist/cli.js serveAeroCrawl is not published to the npm registry. Install directly from GitHub:
- Node.js ≥ 18
- Git
- Chrome or Edge (CDP uses your system browser)
- Clone the repo:
git clone https://github.com/deviprasadshetty-dev/aerocrawl.git
- Install dependencies:
cd aerocrawl && npm install
- Build:
npm run build
- Link globally (optional):
npm link
npm install github:deviprasadshetty-dev/aerocrawlPrerequisites:
- Node.js ≥ 18
- Chrome/Edge installed (CDP uses your system browser)
Create a .env file in your working directory:
# LLM Provider (openai, anthropic, ollama, openrouter)
LLM_PROVIDER=openrouter
# OpenRouter (recommended - free models available)
OPENROUTER_API_KEY=sk-or-v1-...
OPENROUTER_MODEL=nvidia/nemotron-3-super-120b-a12b:free
# OpenAI
# OPENAI_API_KEY=sk-...
# OPENAI_MODEL=gpt-4o-mini
# Anthropic
# ANTHROPIC_API_KEY=sk-ant-...
# ANTHROPIC_MODEL=claude-3-5-haiku-20241022
# Ollama (local)
# OLLAMA_BASE_URL=http://localhost:11434
# OLLAMA_MODEL=llama3.2
# Server
PORT=3000Base URL: http://localhost:3000
Scrape a single URL with optional CDP actions and multi-format output:
curl -X POST http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown", "html", "screenshot"]}'AI-driven browser automation:
curl -X POST http://localhost:3000/v1/agent \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/pricing", "goal": "Extract all product prices"}'Batch scrape multiple URLs asynchronously:
curl -X POST http://localhost:3000/v1/batch/scrape \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com/1", "https://example.com/2"]}'| Feature | Firecrawl | AeroCrawl |
|---|---|---|
| Lightweight (No Playwright) | No | Yes |
| CDP Browser Automation | No | Yes |
| LLM-Ready Markdown | Yes | Yes |
| AI Structured Extraction | Yes | Yes |
| Site Mapping & Crawling | Yes | Yes |
| Browser Actions (Click, Type, etc.) | Yes | Yes |
| Screenshots & Multi-Format Output | Yes | Yes |
| Batch Scraping | Yes | Yes |
| AI Agent Mode | Yes | Yes |
| Zero-External-API Search | No | Yes |
| MCP Server Integration | Yes | Yes |
| Webhooks | Yes | Yes |
| PDF Support | No | Yes |
| Free AI Models (OpenRouter) | No | Yes |
- 70% lighter than Firecrawl (no 300MB+ Playwright dependency)
- Local-first: SQLite instead of Redis/Postgres
- Zero-config: Works with your system Chrome/Edge out of the box
- Free AI: OpenRouter integration with free tier models
AeroCrawl uses a modular three-layer architecture:
- Engine Layer: SmartFetcher with CDP browser fallback
- Transformation Layer: MarkdownPipeline (JSDOM + Readability + Turndown)
- Routing Layer: Express API + SQLite job queue
See ARCHITECTURE.md for full technical details.
- OpenRouter (recommended - free models available):
nvidia/nemotron-3-super-120b-a12b:freemeta-llama/llama-3-8b-instruct:freegoogle/gemma-7b-it:free
- OpenAI: GPT-4o, GPT-4o-mini, etc.
- Anthropic: Claude 3.5 Haiku, Sonnet, Opus, etc.
- Ollama: Local LLMs like Llama 3.2, Mistral, etc.
# Install dependencies
npm install
# Run in development mode (hot reload)
npm run dev
# Build production bundle
npm run build
# Type check
npx tsc --noEmitISC