Skip to content

v2.1.1 — Full word-by-word audit of every page, heavy-site-safe

Latest

Choose a tag to compare

@adityaarsharma adityaarsharma released this 15 Jun 04:14
· 1 commit to main since this release

The reliability release. librecrawl-mcp now does a complete, word-by-word audit of every page on any site — however large or heavy — without dropping pages, overloading the origin, or crashing. Proven on real 700–1,900 page production sites.

What's in this release

✅ Word-by-word audit of every page

Content analysis (readability, AI-tells, boilerplate, punctuation) + extended SEO checks (hreflang, schema.org, security headers, image performance) run across the whole crawl, not a 50-page sample.

✅ All pages — full coverage by default

sitemap_fill_cap defaults to 0 = entire sitemap. No caps to remember, no flags to pass. A plain librecrawl_start_chunked_audit(url=...) crawls the whole site. Internal-linked pages and sitemap-only orphans.

✅ Broken-link detection across every domain

Every outbound link — yours, third-party, social, CDN — is HTTP-validated and classified into 17 status classes. Dedup-then-validate (same as Screaming Frog): thousands of link appearances → unique URLs → each checked once.

✅ No laziness — nothing silently dropped

Per-page core checks + external-link validation cover 100% of crawled pages. Deep content/extended checks cap at 500 pages for memory safety (tunable up), and the report says exactly what was covered.

✅ Heavy / large sites crawlable

4–5 MB pages (Elementor / heavy-WP) now fetch successfully — 25s per-page timeout gives heavy pages TIME instead of timing out to status 0.

✅ Screaming-Frog-grade politeness — never overloads an origin

4 concurrent workers + 500ms jittered delay + 25s timeout. Heavy pages get more time, not more parallelism. Validated: full audits of 1,900-page sites with the origin staying healthy throughout.

✅ Re-scan anytime — zero history on the server (ephemeral)

After the client downloads the zip, the server wipes the session, every artifact file, and the upstream crawl record. 0 bytes, 0 rows per-audit footprint. Re-scan as often as you like; nothing persists.

✅ OOM-safe (v2.1.1 fix)

v2.1.0's "check every page" exhausted memory on huge heavy sites and looped. Deep-checks now cap at 500 pages by default — full audits of 1,900+ page sites complete cleanly.

The fix arc (v2.0.5 → v2.1.1)

  • v2.0.5 — hreflang false positives eliminated
  • v2.0.7 — finalize works under force_advance (8 files always; event-loop fix)
  • v2.0.8 — heavy 4–5 MB pages actually fetch
  • v2.0.9 — Screaming-Frog politeness; never overload the origin
  • v2.1.0 — full audit by default (every page, every text, every link)
  • v2.1.1 — OOM-safe on very large heavy sites

Verified on real sites

Site Pages Result
theplusaddons.com 1,942 all 8 files, full coverage, origin safe
nexterwp.com 709 100% 200-OK, every external domain validated

Output

Single zip, 8 files: branded PDF + Markdown + per-page CSV + extended-checks CSV + content-audit CSV + external-links CSV + sitemap-recon CSV + SUMMARY.

Roadmap

  • Concurrent multi-site audits (3+): the 3-backend infrastructure is provisioned; pool routing is the next release (currently one audit at a time).
  • JavaScript rendering for SPA sites.

MIT · self-hosted · ephemeral · built on LibreCrawl.