Skip to content

dedsecrattle/Argus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Argus Web Crawler

Crates.io Documentation Build Status License: MIT

A production-ready web crawler written in Rust, capable of handling billions of URLs with advanced features like content deduplication, distributed crawling, and JavaScript rendering.

⚡ Quick Start

Installation

📦 Cargo (Recommended)

cargo install argus-crawler

🍺 Homebrew (macOS)

brew tap dedsecrattle/argus
brew install argus-crawler

🐧 Snap (Linux)

snap install argus

🪟 Chocolatey (Windows)

choco install argus

🐳 Docker

docker run dedsecrattle/argus crawl --seed-url https://example.com

Basic Usage

# Simple crawl
argus crawl --seed-url https://example.com --storage-dir ./data

# Distributed crawling with Redis
argus crawl --redis-url redis://localhost:6379 --workers 8

# JavaScript rendering (build with js-render feature)
argus crawl --seed-url https://spa-example.com --js-render

🚀 Features

Core Features

  • Robust Error Handling - Automatic retry with exponential backoff
  • Robots.txt Compliance - Full respect for crawl rules
  • Graceful Shutdown - Clean interruption on SIGTERM/SIGINT
  • Rate Limiting - Configurable delays per domain
  • Content Limits - Size limits for HTML, text, and binary content

Advanced Features

  • 🔄 Content Deduplication - Simhash-based near-duplicate detection
  • 🌐 JavaScript Rendering - Headless Chrome support for SPAs
  • 📊 Metadata Extraction - Canonical URLs, hreflang, meta tags
  • 🗺️ Sitemap Parsing - Auto-discovery and parsing of sitemaps
  • 📦 Multiple Storage Backends - File system or S3-compatible storage

Scalability Features

  • 🧠 Bloom Filter Deduplication - 1B URLs in only 1.2GB RAM
  • 🔀 Distributed Crawling - Redis-based coordination
  • 🌊 Redis Streams - High-throughput job distribution
  • ☁️ Object Storage - Unlimited scaling with S3/MinIO

📈 Performance

Metric Single Node Distributed (10 nodes)
URLs/second 100-1000 1000-10000
Memory (1B URLs) 1.2GB (Bloom) 1.2GB per node
Storage Local disk S3 (unlimited)
Network 1 Gbps 10 Gbps+

🏗️ Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontier      │    │    Fetcher      │    │    Parser       │
│                 │    │                 │    │                 │
│ • URL Queue     │───▶│ • HTTP Client   │───▶│ • HTML Parser   │
│ • Prioritization│    │ • Retry Logic   │    │ • Link Extract  │
│ • Deduplication │    │ • Rate Limit    │    │ • Metadata      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Deduplication │    │    Storage      │    │   Robots.txt   │
│                 │    │                 │    │                 │
│ • Bloom Filter  │    │ • File System   │    │ • Parser        │
│ • Simhash       │    │ • S3/MinIO      │    │ • Cache         │
│ • Redis         │    │ • Metadata      │    │ • Rules         │
└─────────────────┘    └─────────────────┘    └─────────────────┘

📚 Documentation

💡 Examples

Basic Crawling

use argus_cli::run_crawl;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    run_crawl(&[
        "crawl",
        "--seed-url", "https://example.com",
        "--max-depth", "3",
        "--storage-dir", "./data"
    ]).await
}

Distributed Crawling

use argus_frontier::StreamFrontier;
use argus_dedupe::HybridSeenSet;
use argus_storage::S3Storage;

// Redis Streams for job distribution
let frontier = StreamFrontier::new(
    "redis://localhost:6379",
    Some("argus:jobs".to_string()),
    Some("workers".to_string()),
    "worker-1".to_string()
).await?;

// Bloom filter + Redis for deduplication
let seen = HybridSeenSet::new(
    "redis://localhost:6379",
    None,
    1_000_000_000, // 1B URLs
    0.01 // 1% false positive rate
).await?;

// S3 for unlimited storage
let storage = S3Storage::new(
    "my-crawl-bucket".to_string(),
    Some("crawl/".to_string())
).await?;

JavaScript Rendering

# Build with JS support
cargo build --release --features js-render

# Crawl SPA sites
argus crawl \
  --seed-url https://react-app.com \
  --js-render \
  --wait-for-selector "#content"

🛠️ Development

Setup

git clone https://github.com/dedsecrattle/argus.git
cd argus
cargo build
cargo test

Features

  • redis - Enable Redis support (default)
  • s3 - Enable S3 storage
  • js-render - Enable JavaScript rendering
  • all-features - Enable everything
# Build with all features
cargo build --all-features

# Run tests with all features
cargo test --all-features

📦 Crates

This is a workspace with the following crates:

🐳 Docker

Basic Usage

# Pull image
docker pull dedsecrattle/argus:latest

# Run crawl
docker run -v $(pwd)/data:/data dedsecrattle/argus \
  crawl --seed-url https://example.com --storage-dir /data

With Docker Compose

version: '3.8'
services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
  
  argus:
    image: dedsecrattle/argus:latest
    command: crawl --redis-url redis://redis:6379
    volumes:
      - ./data:/data
    depends_on:
      - redis

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide.

Quick Start

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

🔗 Links


⭐ Star us on GitHub!

Made with ❤️ by the Argus contributors

About

Argus is a distributed web crawler written in Rust.

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors