Skip to content

ah2727/spiderly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•·οΈ Spiderly

A high-performance, concurrent web crawler and scraper built in Go.

Spiderly is a fast, flexible command-line web crawler designed for deep site exploration at scale. It leverages Colly for efficient crawling, GoQuery for HTML parsing, and supports headless browser interactions via ChromeDP. Spiderly can discover pages through XML sitemap parsing or recursive link following, process URLs in parallel chunks with a multi-worker pool, and export structured results to JSON or Markdown β€” all with a rich, colorized console experience.


✨ Features

Crawling & discovery

  • Guide Spiderly with sitemap seeds or recursive link-following while tuning depth, timeout, and delay to suit each target URL. Optional priority and URL filters plus the -recursive override help you balance auto-discovery with deep traversal. [cmd/main.go:23-52]

Parallel chunked processing

  • Chunked mode splits the crawl into batches so multiple workers (--chunk-size, --max-workers) can fetch pages concurrently while respecting per-worker concurrency, delays, and timeouts.

Proxy support

  • Route crawl traffic through one or more HTTP proxies via --proxy or SPIDERLY_PROXY. In chunked mode, workers receive sticky proxy assignments (worker index modulo proxy list).

Product intelligence

  • Enable -product-mode (or provide -product-pattern) to focus on ecommerce listings. Product mode auto-promotes chunked processing, raises the default page/chunk/worker counts, and lets you customize the regex that identifies product URLs. [cmd/main.go:43-118]
  • Markdown reports surface product summaries, price distributions, stock counts, and per-product metadata whenever product data is available, while JSON output and the console summary keep crawl stats, status codes, and keyword insights organized for automation. [cmd/main.go:269-405][cmd/main.go:500-580]

News intelligence

  • Enable --news-mode (or provide --news-pattern) to focus on article/news pages. News mode prioritizes news-like sitemaps, supports custom sitemap keyword hints via --news-sitemaps, and extracts headline/author/publish date/tags using DOM + meta tag parsing (no JSON-LD path).

Output & observability

  • Save indented JSON or illustrated Markdown reports, and rely on Spiderly's summary box, verbose logging, and ANSI-safe mode to understand throughput, HTTP statuses, and any errors that arise during the run. [cmd/main.go:269-365][cmd/main.go:539-670]

πŸ“¦ Installation

Prerequisites

  • Go 1.25 or later

From source

# Clone the repository
git clone https://github.com/your-org/spiderly.git
cd spiderly

# Download dependencies
go mod download

# Build the binary
go build -o spiderly cmd/main.go

Using go install

go install spiderly/cmd@latest

πŸš€ Usage

spiderly -url <target> [options]

CLI Flags

Flag Default Description
-url (required) Target URL to crawl
-pages 100 Maximum number of pages to scrape
-depth 10 Maximum crawl depth
-concurrency 5 Concurrent requests per worker
-timeout 30s Request timeout (Go duration, e.g. 10s, 1m)
-delay 200ms Delay between requests (Go duration)
-chunked false Enable parallel chunked processing
-chunk-size 50 Number of URLs per chunk
--max-workers 4 Number of parallel chunk workers (new crawl subcommand flag)
--proxy β€” Proxy URL(s), supports repeated or comma-separated values
-sitemap β€” Direct sitemap URL (skip auto-discovery)
-min-priority 0 Minimum sitemap priority filter (0.0 - 1.0)
-url-pattern β€” Regex to filter sitemap URLs
-product-mode false Enable product-only crawl mode (auto-enables chunked)
-product-pattern β€” Custom regex for product URLs
--news-mode false Enable news/article extraction mode
--news-pattern β€” Custom regex for news/article URLs
--news-sitemaps β€” Preferred sitemap keywords for news mode (e.g. news,press)
-output β€” File path for JSON output
-markdown β€” File path for Markdown report output
-recursive false Force recursive crawl (skip sitemap discovery)
-verbose false Enable verbose / debug logging
-no-color false Disable colored terminal output

Examples

Basic crawl with defaults:

spiderly -url https://example.com

High-throughput parallel crawl:

spiderly crawl https://example.com --max-pages 500 --chunked --chunk-size 100 --max-workers 8

Crawl through a single proxy:

spiderly crawl https://example.com --proxy http://127.0.0.1:8080

Rotate proxies across chunk workers:

SPIDERLY_PROXY=http://p1:8080,http://p2:8080 spiderly crawl https://example.com --chunked --max-workers 4

**News crawl with tag parsing:**

```bash
spiderly crawl https://news.example.com --news-mode --news-pattern "/news|article/" --news-sitemaps news,press,headline

**Export results to JSON and Markdown:**

```bash
spiderly -url https://example.com -output results.json -markdown report.md -verbose

Recursive crawl with custom depth and timeout:

spiderly -url https://example.com -recursive -depth 5 -timeout 15s -concurrency 10

πŸ—οΈ Project Structure

.
β”œβ”€β”€ cmd/
β”‚   └── main.go              # CLI entry point β€” flag parsing, orchestration, output saving
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ chunker/
β”‚   β”‚   └── chunker.go       # Parallel chunk processing engine (worker pool)
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   └──        # Core configuration, main run loop, console UI
β”‚   β”œβ”€β”€ crawler/              # Colly-based web crawling engine
β”‚   β”œβ”€β”€ models/               # Shared data structures (ScrapedPage, CrawlStats, etc.)
β”‚   β”œβ”€β”€ scraper/              # Extraction layer binding crawler to data pipeline
β”‚   β”œβ”€β”€ sitemap/              # Sitemap discovery and XML parsing
β”‚   └── ui/                   # Terminal UI components
β”œβ”€β”€ go.mod                    # Go module definition (module spiderly, go 1.25.7)
β”œβ”€β”€ go.sum                    # Dependency checksums
└── README.md                 # This file
Path Description
cmd/main.go Application entry point (507 lines). Parses all CLI flags, initialises the Core, triggers the crawl, and writes JSON / Markdown output files.
internal/chunker/ Parallel chunk processor (534 lines). Splits a URL list into fixed-size chunks and dispatches them to a configurable pool of workers for concurrent scraping.
internal/core/ Core orchestrator (944 lines). Holds the CoreConfig, manages the crawl lifecycle, coordinates the crawler and chunker, and renders the rich colorized console summary.
internal/crawler/ Colly-powered crawling engine. Handles HTTP requests, link extraction, and robots.txt compliance.
internal/models/ Shared data models β€” ScrapedPage, CrawlStats, sitemap entries, and WebSocket message types.
internal/sitemap/ Sitemap discovery β€” fetches and parses XML sitemaps to seed the URL queue.
internal/scraper/ Extraction glue layer connecting the crawler output to the data pipeline and optional dashboard.
internal/ui/ Terminal and web dashboard UI components for real-time monitoring.
go.mod / go.sum Go module files tracking dependencies (Colly, GoQuery, ChromeDP, Lipgloss, and more).

βš™οΈ Configuration

Product mode & URL filters

When -product-mode is enabled (or a custom -product-pattern is provided), Spiderly auto-enables chunked processing, increases default page, chunk, and worker counts, and uses a regex to identify product pages. You can also supply a direct sitemap (-sitemap), filter by minimum priority (-min-priority), or apply a URL filter (-url-pattern), or force recursive link-following with -recursive. [cmd/main.go:44-51][cmd/main.go:68-96]

# Crawl products from sitemap with custom pattern and save JSON
spiderly -url https://example.com -product-mode -product-pattern "/product/" -output products.json

News mode & tag parsing

When --news-mode is enabled (or a custom --news-pattern is provided), Spiderly prioritizes news-like sitemap files and parses article metadata from HTML/meta tags and page tag links. This mode does not rely on JSON-LD extraction.

spiderly crawl https://news.example.com --news-mode --news-pattern "/news|article/" --news-sitemaps news,press

Chunked processing

Split large crawls into batches of --chunk-size URLs processed concurrently across --max-workers. Each chunk worker still respects --concurrency, --delay, and --timeout to maintain throughput and politeness.

# Process 1000 pages in chunks of 200 across 5 workers
spiderly crawl https://example.com --max-pages 1000 --chunked --chunk-size 200 --max-workers 5

Proxy configuration

Use --proxy for explicit proxy URLs, or SPIDERLY_PROXY for environment-based configuration.

# Single proxy
spiderly crawl https://example.com --proxy http://127.0.0.1:8080

# Multiple proxies (comma-separated)
SPIDERLY_PROXY=http://p1:8080,http://p2:8080 spiderly crawl https://example.com --chunked --max-workers 4

Output formats & logging

Use -output for JSON export or -markdown for a rich Markdown report. Both can be combined. [cmd/main.go:269-365]

Format Flag Description
JSON -output results.json Pretty-printed JSON array of scraped pages.
Markdown -markdown report.md Human-readable Markdown report with stats, tables, and per-page details.

Verbosity & colors

  • -verbose prints detailed per-request logs.
  • -no-color removes ANSI colors for compatibility with text outputs. [cmd/main.go:31-33]

🀝 Contributing

Contributions are welcome! Here's how to get started:

  1. Fork the repository and create a new branch from main.
  2. Make your changes β€” please keep commits focused and well-described.
  3. Run tests and ensure the build passes:
    go build ./...
    go vet ./...
  4. Open a Pull Request with a clear description of what you changed and why.

Guidelines

  • Follow standard Go conventions and formatting (gofmt / goimports).
  • Keep public API changes backward-compatible where possible.
  • Add or update documentation for any new flags or features.
  • Be respectful and constructive in reviews and discussions.

πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.


Built with ❀️ in Go β€” happy crawling!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages