v1.1.1 — Chunked Crawling: Resume Huge Crawls Across Sessions
New: librecrawl_resume_from_crawl_id(crawl_id) — the crawl a 50,000-page site politely, across multiple days, from any session feature.
The flow
Day 1: Audit https://huge-site.com (max 5000 pages)
→ runs, returns crawl_id 42
→ pause anytime with librecrawl_pause_crawl()
Day 2: list_crawls() → see crawl_id 42 with status "paused" at 1,847 pages
librecrawl_resume_from_crawl_id(42)
→ picks up at page 1,848 — every previously-crawled URL is re-used
from SQLite, ZERO duplicate HTTP requests to the target
Day 3: Same thing until done → generate_report(42) for the full audit
State persists across PM2 restarts, server reboots, and entirely different MCP client sessions. The crawl is keyed by crawl_id, not by the MCP session.
How it stays polite
- Token-bucket rate limiter — smooth distribution, no bursting. Default 2 req/sec.
- Respects
Crawl-Delaydirective in robots.txt - Single connection, sequential by default (configurable up to 5 workers)
- Identifiable User-Agent (
LibreCrawl/1.0 (Web Crawler)) - Idempotent resume — URLs already in the DB are never re-fetched
Implementation note
Tries two resume paths transparently:
- In-session resume (
/api/resume_crawl) — when the upstream crawler instance is still alive withis_paused=True. Just flips the flag. - Cross-session DB resume (
/api/crawls/<id>/resume) — when the crawler has been GCed. Rebuilds from the SQLite snapshot.
The tool returns {\"mode\": \"in_session_resume\" | \"db_resume\"} so you can see which path fired.
Also in v1.1.1
- Better auto-recovery —
_ensure_crawler_ready()now uses crawl speed (instead of status string alone) to detect paused/zombie crawlers. Upstream LibreCrawl reports paused crawls asstatus=\"running\", which the old detection missed. - Tool count: 20 (was 19) —
librecrawl_resume_from_crawl_idadded.
Upgrade
```bash
cd ~/librecrawl-mcp && git pull && pm2 restart librecrawl-mcp
```
Or rerun the installer — it'll update in place.
Verified live on nexterwp.com: paused at 19 pages from session A, resumed from session B (fresh MCP session), continued cleanly to 35 pages with no duplicate requests.