A high-performance, concurrent web crawler and scraper built in Go.
Spiderly is a fast, flexible command-line web crawler designed for deep site exploration at scale. It leverages Colly for efficient crawling, GoQuery for HTML parsing, and supports headless browser interactions via ChromeDP. Spiderly can discover pages through XML sitemap parsing or recursive link following, process URLs in parallel chunks with a multi-worker pool, and export structured results to JSON or Markdown β all with a rich, colorized console experience.
- Guide Spiderly with sitemap seeds or recursive link-following while tuning depth, timeout, and delay to suit each target URL. Optional priority and URL filters plus the
-recursiveoverride help you balance auto-discovery with deep traversal. [cmd/main.go:23-52]
- Chunked mode splits the crawl into batches so multiple workers (
--chunk-size,--max-workers) can fetch pages concurrently while respecting per-worker concurrency, delays, and timeouts.
- Route crawl traffic through one or more HTTP proxies via
--proxyorSPIDERLY_PROXY. In chunked mode, workers receive sticky proxy assignments (worker index modulo proxy list).
- Enable
-product-mode(or provide-product-pattern) to focus on ecommerce listings. Product mode auto-promotes chunked processing, raises the default page/chunk/worker counts, and lets you customize the regex that identifies product URLs. [cmd/main.go:43-118] - Markdown reports surface product summaries, price distributions, stock counts, and per-product metadata whenever product data is available, while JSON output and the console summary keep crawl stats, status codes, and keyword insights organized for automation. [
cmd/main.go:269-405][cmd/main.go:500-580]
- Enable
--news-mode(or provide--news-pattern) to focus on article/news pages. News mode prioritizes news-like sitemaps, supports custom sitemap keyword hints via--news-sitemaps, and extracts headline/author/publish date/tags using DOM + meta tag parsing (no JSON-LD path).
- Save indented JSON or illustrated Markdown reports, and rely on Spiderly's summary box, verbose logging, and ANSI-safe mode to understand throughput, HTTP statuses, and any errors that arise during the run. [
cmd/main.go:269-365][cmd/main.go:539-670]
- Go 1.25 or later
# Clone the repository
git clone https://github.com/your-org/spiderly.git
cd spiderly
# Download dependencies
go mod download
# Build the binary
go build -o spiderly cmd/main.gogo install spiderly/cmd@latestspiderly -url <target> [options]| Flag | Default | Description |
|---|---|---|
-url |
(required) | Target URL to crawl |
-pages |
100 |
Maximum number of pages to scrape |
-depth |
10 |
Maximum crawl depth |
-concurrency |
5 |
Concurrent requests per worker |
-timeout |
30s |
Request timeout (Go duration, e.g. 10s, 1m) |
-delay |
200ms |
Delay between requests (Go duration) |
-chunked |
false |
Enable parallel chunked processing |
-chunk-size |
50 |
Number of URLs per chunk |
--max-workers |
4 |
Number of parallel chunk workers (new crawl subcommand flag) |
--proxy |
β | Proxy URL(s), supports repeated or comma-separated values |
-sitemap |
β | Direct sitemap URL (skip auto-discovery) |
-min-priority |
0 |
Minimum sitemap priority filter (0.0 - 1.0) |
-url-pattern |
β | Regex to filter sitemap URLs |
-product-mode |
false |
Enable product-only crawl mode (auto-enables chunked) |
-product-pattern |
β | Custom regex for product URLs |
--news-mode |
false |
Enable news/article extraction mode |
--news-pattern |
β | Custom regex for news/article URLs |
--news-sitemaps |
β | Preferred sitemap keywords for news mode (e.g. news,press) |
-output |
β | File path for JSON output |
-markdown |
β | File path for Markdown report output |
-recursive |
false |
Force recursive crawl (skip sitemap discovery) |
-verbose |
false |
Enable verbose / debug logging |
-no-color |
false |
Disable colored terminal output |
Basic crawl with defaults:
spiderly -url https://example.comHigh-throughput parallel crawl:
spiderly crawl https://example.com --max-pages 500 --chunked --chunk-size 100 --max-workers 8Crawl through a single proxy:
spiderly crawl https://example.com --proxy http://127.0.0.1:8080Rotate proxies across chunk workers:
SPIDERLY_PROXY=http://p1:8080,http://p2:8080 spiderly crawl https://example.com --chunked --max-workers 4
**News crawl with tag parsing:**
```bash
spiderly crawl https://news.example.com --news-mode --news-pattern "/news|article/" --news-sitemaps news,press,headline
**Export results to JSON and Markdown:**
```bash
spiderly -url https://example.com -output results.json -markdown report.md -verbose
Recursive crawl with custom depth and timeout:
spiderly -url https://example.com -recursive -depth 5 -timeout 15s -concurrency 10.
βββ cmd/
β βββ main.go # CLI entry point β flag parsing, orchestration, output saving
βββ internal/
β βββ chunker/
β β βββ chunker.go # Parallel chunk processing engine (worker pool)
β βββ core/
β β βββ # Core configuration, main run loop, console UI
β βββ crawler/ # Colly-based web crawling engine
β βββ models/ # Shared data structures (ScrapedPage, CrawlStats, etc.)
β βββ scraper/ # Extraction layer binding crawler to data pipeline
β βββ sitemap/ # Sitemap discovery and XML parsing
β βββ ui/ # Terminal UI components
βββ go.mod # Go module definition (module spiderly, go 1.25.7)
βββ go.sum # Dependency checksums
βββ README.md # This file
| Path | Description |
|---|---|
cmd/main.go |
Application entry point (507 lines). Parses all CLI flags, initialises the Core, triggers the crawl, and writes JSON / Markdown output files. |
internal/chunker/ |
Parallel chunk processor (534 lines). Splits a URL list into fixed-size chunks and dispatches them to a configurable pool of workers for concurrent scraping. |
internal/core/ |
Core orchestrator (944 lines). Holds the CoreConfig, manages the crawl lifecycle, coordinates the crawler and chunker, and renders the rich colorized console summary. |
internal/crawler/ |
Colly-powered crawling engine. Handles HTTP requests, link extraction, and robots.txt compliance. |
internal/models/ |
Shared data models β ScrapedPage, CrawlStats, sitemap entries, and WebSocket message types. |
internal/sitemap/ |
Sitemap discovery β fetches and parses XML sitemaps to seed the URL queue. |
internal/scraper/ |
Extraction glue layer connecting the crawler output to the data pipeline and optional dashboard. |
internal/ui/ |
Terminal and web dashboard UI components for real-time monitoring. |
go.mod / go.sum |
Go module files tracking dependencies (Colly, GoQuery, ChromeDP, Lipgloss, and more). |
When -product-mode is enabled (or a custom -product-pattern is provided), Spiderly auto-enables chunked processing, increases default page, chunk, and worker counts, and uses a regex to identify product pages. You can also supply a direct sitemap (-sitemap), filter by minimum priority (-min-priority), or apply a URL filter (-url-pattern), or force recursive link-following with -recursive. [cmd/main.go:44-51][cmd/main.go:68-96]
# Crawl products from sitemap with custom pattern and save JSON
spiderly -url https://example.com -product-mode -product-pattern "/product/" -output products.jsonWhen --news-mode is enabled (or a custom --news-pattern is provided), Spiderly prioritizes news-like sitemap files and parses article metadata from HTML/meta tags and page tag links. This mode does not rely on JSON-LD extraction.
spiderly crawl https://news.example.com --news-mode --news-pattern "/news|article/" --news-sitemaps news,pressSplit large crawls into batches of --chunk-size URLs processed concurrently across --max-workers. Each chunk worker still respects --concurrency, --delay, and --timeout to maintain throughput and politeness.
# Process 1000 pages in chunks of 200 across 5 workers
spiderly crawl https://example.com --max-pages 1000 --chunked --chunk-size 200 --max-workers 5Use --proxy for explicit proxy URLs, or SPIDERLY_PROXY for environment-based configuration.
# Single proxy
spiderly crawl https://example.com --proxy http://127.0.0.1:8080
# Multiple proxies (comma-separated)
SPIDERLY_PROXY=http://p1:8080,http://p2:8080 spiderly crawl https://example.com --chunked --max-workers 4Use -output for JSON export or -markdown for a rich Markdown report. Both can be combined. [cmd/main.go:269-365]
| Format | Flag | Description |
|---|---|---|
| JSON | -output results.json |
Pretty-printed JSON array of scraped pages. |
| Markdown | -markdown report.md |
Human-readable Markdown report with stats, tables, and per-page details. |
-verboseprints detailed per-request logs.-no-colorremoves ANSI colors for compatibility with text outputs. [cmd/main.go:31-33]
Contributions are welcome! Here's how to get started:
- Fork the repository and create a new branch from
main. - Make your changes β please keep commits focused and well-described.
- Run tests and ensure the build passes:
go build ./... go vet ./...
- Open a Pull Request with a clear description of what you changed and why.
- Follow standard Go conventions and formatting (
gofmt/goimports). - Keep public API changes backward-compatible where possible.
- Add or update documentation for any new flags or features.
- Be respectful and constructive in reviews and discussions.
This project is licensed under the MIT License. See the LICENSE file for details.
Built with β€οΈ in Go β happy crawling!