Skip to content

v1.1.1 — Chunked Crawling: Resume Huge Crawls Across Sessions

Choose a tag to compare

@adityaarsharma adityaarsharma released this 28 May 05:34
· 41 commits to main since this release

New: librecrawl_resume_from_crawl_id(crawl_id) — the crawl a 50,000-page site politely, across multiple days, from any session feature.

The flow

Day 1:  Audit https://huge-site.com (max 5000 pages)
        → runs, returns crawl_id 42
        → pause anytime with librecrawl_pause_crawl()

Day 2:  list_crawls() → see crawl_id 42 with status "paused" at 1,847 pages
        librecrawl_resume_from_crawl_id(42)
        → picks up at page 1,848 — every previously-crawled URL is re-used
          from SQLite, ZERO duplicate HTTP requests to the target

Day 3:  Same thing until done → generate_report(42) for the full audit

State persists across PM2 restarts, server reboots, and entirely different MCP client sessions. The crawl is keyed by crawl_id, not by the MCP session.

How it stays polite

  • Token-bucket rate limiter — smooth distribution, no bursting. Default 2 req/sec.
  • Respects Crawl-Delay directive in robots.txt
  • Single connection, sequential by default (configurable up to 5 workers)
  • Identifiable User-Agent (LibreCrawl/1.0 (Web Crawler))
  • Idempotent resume — URLs already in the DB are never re-fetched

Implementation note

Tries two resume paths transparently:

  1. In-session resume (/api/resume_crawl) — when the upstream crawler instance is still alive with is_paused=True. Just flips the flag.
  2. Cross-session DB resume (/api/crawls/<id>/resume) — when the crawler has been GCed. Rebuilds from the SQLite snapshot.

The tool returns {\"mode\": \"in_session_resume\" | \"db_resume\"} so you can see which path fired.

Also in v1.1.1

  • Better auto-recovery_ensure_crawler_ready() now uses crawl speed (instead of status string alone) to detect paused/zombie crawlers. Upstream LibreCrawl reports paused crawls as status=\"running\", which the old detection missed.
  • Tool count: 20 (was 19) — librecrawl_resume_from_crawl_id added.

Upgrade

```bash
cd ~/librecrawl-mcp && git pull && pm2 restart librecrawl-mcp
```

Or rerun the installer — it'll update in place.


Verified live on nexterwp.com: paused at 19 pages from session A, resumed from session B (fresh MCP session), continued cleanly to 35 pages with no duplicate requests.