A production-ready web crawler written in Rust, capable of handling billions of URLs with advanced features like content deduplication, distributed crawling, and JavaScript rendering.
cargo install argus-crawlerbrew tap dedsecrattle/argus
brew install argus-crawlersnap install arguschoco install argusdocker run dedsecrattle/argus crawl --seed-url https://example.com# Simple crawl
argus crawl --seed-url https://example.com --storage-dir ./data
# Distributed crawling with Redis
argus crawl --redis-url redis://localhost:6379 --workers 8
# JavaScript rendering (build with js-render feature)
argus crawl --seed-url https://spa-example.com --js-render- ✅ Robust Error Handling - Automatic retry with exponential backoff
- ✅ Robots.txt Compliance - Full respect for crawl rules
- ✅ Graceful Shutdown - Clean interruption on SIGTERM/SIGINT
- ✅ Rate Limiting - Configurable delays per domain
- ✅ Content Limits - Size limits for HTML, text, and binary content
- 🔄 Content Deduplication - Simhash-based near-duplicate detection
- 🌐 JavaScript Rendering - Headless Chrome support for SPAs
- 📊 Metadata Extraction - Canonical URLs, hreflang, meta tags
- 🗺️ Sitemap Parsing - Auto-discovery and parsing of sitemaps
- 📦 Multiple Storage Backends - File system or S3-compatible storage
- 🧠 Bloom Filter Deduplication - 1B URLs in only 1.2GB RAM
- 🔀 Distributed Crawling - Redis-based coordination
- 🌊 Redis Streams - High-throughput job distribution
- ☁️ Object Storage - Unlimited scaling with S3/MinIO
| Metric | Single Node | Distributed (10 nodes) |
|---|---|---|
| URLs/second | 100-1000 | 1000-10000 |
| Memory (1B URLs) | 1.2GB (Bloom) | 1.2GB per node |
| Storage | Local disk | S3 (unlimited) |
| Network | 1 Gbps | 10 Gbps+ |
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontier │ │ Fetcher │ │ Parser │
│ │ │ │ │ │
│ • URL Queue │───▶│ • HTTP Client │───▶│ • HTML Parser │
│ • Prioritization│ │ • Retry Logic │ │ • Link Extract │
│ • Deduplication │ │ • Rate Limit │ │ • Metadata │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Deduplication │ │ Storage │ │ Robots.txt │
│ │ │ │ │ │
│ • Bloom Filter │ │ • File System │ │ • Parser │
│ • Simhash │ │ • S3/MinIO │ │ • Cache │
│ • Redis │ │ • Metadata │ │ • Rules │
└─────────────────┘ └─────────────────┘ └─────────────────┘
use argus_cli::run_crawl;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
run_crawl(&[
"crawl",
"--seed-url", "https://example.com",
"--max-depth", "3",
"--storage-dir", "./data"
]).await
}use argus_frontier::StreamFrontier;
use argus_dedupe::HybridSeenSet;
use argus_storage::S3Storage;
// Redis Streams for job distribution
let frontier = StreamFrontier::new(
"redis://localhost:6379",
Some("argus:jobs".to_string()),
Some("workers".to_string()),
"worker-1".to_string()
).await?;
// Bloom filter + Redis for deduplication
let seen = HybridSeenSet::new(
"redis://localhost:6379",
None,
1_000_000_000, // 1B URLs
0.01 // 1% false positive rate
).await?;
// S3 for unlimited storage
let storage = S3Storage::new(
"my-crawl-bucket".to_string(),
Some("crawl/".to_string())
).await?;# Build with JS support
cargo build --release --features js-render
# Crawl SPA sites
argus crawl \
--seed-url https://react-app.com \
--js-render \
--wait-for-selector "#content"git clone https://github.com/dedsecrattle/argus.git
cd argus
cargo build
cargo testredis- Enable Redis support (default)s3- Enable S3 storagejs-render- Enable JavaScript renderingall-features- Enable everything
# Build with all features
cargo build --all-features
# Run tests with all features
cargo test --all-featuresThis is a workspace with the following crates:
- argus-crawler - Command-line interface
- argus-common - Common types and utilities
- argus-fetcher - HTTP fetching with retry logic
- argus-parser - HTML and sitemap parsing
- argus-dedupe - Content deduplication with Simhash
- argus-storage - Storage backends
- argus-frontier - URL frontier implementations
- argus-robots - Robots.txt parsing
- argus-worker - Worker implementation
- argus-config - Configuration management
# Pull image
docker pull dedsecrattle/argus:latest
# Run crawl
docker run -v $(pwd)/data:/data dedsecrattle/argus \
crawl --seed-url https://example.com --storage-dir /dataversion: '3.8'
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
argus:
image: dedsecrattle/argus:latest
command: crawl --redis-url redis://redis:6379
volumes:
- ./data:/data
depends_on:
- redisContributions are welcome! Please read our Contributing Guide.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Rust
- Inspired by Scrapy and Nutch
- Icons by Feather Icons
Made with ❤️ by the Argus contributors