Skip to content

Releases: adityaarsharma/librecrawl-technical-seo-audit-mcp

v2.1.1 — Full word-by-word audit of every page, heavy-site-safe

15 Jun 04:14

Choose a tag to compare

The reliability release. librecrawl-mcp now does a complete, word-by-word audit of every page on any site — however large or heavy — without dropping pages, overloading the origin, or crashing. Proven on real 700–1,900 page production sites.

What's in this release

✅ Word-by-word audit of every page

Content analysis (readability, AI-tells, boilerplate, punctuation) + extended SEO checks (hreflang, schema.org, security headers, image performance) run across the whole crawl, not a 50-page sample.

✅ All pages — full coverage by default

sitemap_fill_cap defaults to 0 = entire sitemap. No caps to remember, no flags to pass. A plain librecrawl_start_chunked_audit(url=...) crawls the whole site. Internal-linked pages and sitemap-only orphans.

✅ Broken-link detection across every domain

Every outbound link — yours, third-party, social, CDN — is HTTP-validated and classified into 17 status classes. Dedup-then-validate (same as Screaming Frog): thousands of link appearances → unique URLs → each checked once.

✅ No laziness — nothing silently dropped

Per-page core checks + external-link validation cover 100% of crawled pages. Deep content/extended checks cap at 500 pages for memory safety (tunable up), and the report says exactly what was covered.

✅ Heavy / large sites crawlable

4–5 MB pages (Elementor / heavy-WP) now fetch successfully — 25s per-page timeout gives heavy pages TIME instead of timing out to status 0.

✅ Screaming-Frog-grade politeness — never overloads an origin

4 concurrent workers + 500ms jittered delay + 25s timeout. Heavy pages get more time, not more parallelism. Validated: full audits of 1,900-page sites with the origin staying healthy throughout.

✅ Re-scan anytime — zero history on the server (ephemeral)

After the client downloads the zip, the server wipes the session, every artifact file, and the upstream crawl record. 0 bytes, 0 rows per-audit footprint. Re-scan as often as you like; nothing persists.

✅ OOM-safe (v2.1.1 fix)

v2.1.0's "check every page" exhausted memory on huge heavy sites and looped. Deep-checks now cap at 500 pages by default — full audits of 1,900+ page sites complete cleanly.

The fix arc (v2.0.5 → v2.1.1)

  • v2.0.5 — hreflang false positives eliminated
  • v2.0.7 — finalize works under force_advance (8 files always; event-loop fix)
  • v2.0.8 — heavy 4–5 MB pages actually fetch
  • v2.0.9 — Screaming-Frog politeness; never overload the origin
  • v2.1.0 — full audit by default (every page, every text, every link)
  • v2.1.1 — OOM-safe on very large heavy sites

Verified on real sites

Site Pages Result
theplusaddons.com 1,942 all 8 files, full coverage, origin safe
nexterwp.com 709 100% 200-OK, every external domain validated

Output

Single zip, 8 files: branded PDF + Markdown + per-page CSV + extended-checks CSV + content-audit CSV + external-links CSV + sitemap-recon CSV + SUMMARY.

Roadmap

  • Concurrent multi-site audits (3+): the 3-backend infrastructure is provisioned; pool routing is the next release (currently one audit at a time).
  • JavaScript rendering for SPA sites.

MIT · self-hosted · ephemeral · built on LibreCrawl.

v2.0.6 — event-loop fix (content_audit + extended_checks)

10 Jun 05:43

Choose a tag to compare

Hotfix: content_audit and extended_checks modules silently failed during the audit finalize chain on multi-page crawls with "event-loop error", leaving the audit zip without content-audit.csv and extended-checks.csv. Root cause was racy reuse of asyncio.new_event_loop() across sequential module calls in Python 3.12. Replaced with idiomatic asyncio.run() in all 3 affected modules (content_audit.py, extended_checks.py, external_links.py).

Verified on a real 90-page adityaarsharma.com audit: 185 extended_checks findings + 50 pages content-audited, both CSVs in the zip with real content.

No API change. No version bump needed in server.json/Glama listings (still 2.0.5 there until v2.1 ships).

v2.0.5 — hreflang false positives fixed (logic-lens scan)

05 Jun 04:35

Choose a tag to compare

Two false-positive sources in the hreflang audit caught by an end-to-end logic-lens scan on cloudflare.com.

Bug 1 — VALID_HREFLANG_RE rejected lowercase region codes

Strict BCP-47 says uppercase, but Google accepts both cases and major sites (Cloudflare, etc) use lowercase: de-de, fr-fr, zh-cn, pt-br. 262 false positives on cloudflare.com alone. Fix: re.IGNORECASE + accept both cases. en-USA and en_US still correctly rejected as invalid.

Bug 2 — hreflang_conflicts_lang_attr matched x-default as self-entry

On home pages, the canonical-lang entry AND x-default often point at the same URL. The next() iterator was sometimes picking x-default as self_entry, then "x-default".startswith("en") = False triggered a bogus conflict. 261 false positives on cloudflare.com alone. Fix: iterate self-matches, skip x-default, pick first real-language entry. Skip the conflict check when only x-default matches.

Result

cloudflare.com 534-page audit:

  • v2.0.4: 936 findings, 23 check classes (hreflang_invalid_codes: 262, hreflang_conflicts_lang_attr: 261)
  • v2.0.5: ~400 findings, 21 check classes (hreflang_invalid_codes: 0, hreflang_conflicts_lang_attr: 0)

Remaining ~400 findings are all genuine: real sitemap-noindex entries, real JS redirects, real performance issues. Zero false positives target met for hreflang.

v2.0.4 — install.sh fixed: one-line installer now produces a working server

04 Jun 12:08

Choose a tag to compare

Critical installer fix. Pre-v2.0.4, the one-line installer (curl -fsSL https://raw.githubusercontent.com/adityaarsharma/librecrawl-mcp/main/install.sh | bash) only downloaded server.py. But v1.4.0+ added 9 more required Python modules. Any fresh install crashed on first import.

What's fixed

  • ✅ Downloads all 10 Python modules in a loop (server.py + state.py + libreclient.py + runner.py + external_links.py + content_audit.py + extended_checks.py + schema_validator.py + sitemap_fill.py + pdf_report.py)
  • ✅ pip-installs weasyprint + markdown alongside the existing mcp + httpx + uvicorn
  • ✅ apt-installs libpango / libcairo / libharfbuzz non-interactively for WeasyPrint to render the PDF (skipped gracefully on non-Debian or no-sudo)
  • ✅ Drops the Claude Code skill into ~/.claude/skills/librecrawl-audit/ as a developer-experience convenience (correctness is covered by v2.0.3's server-side instructions either way)
  • ✅ Post-install tool listing updated to reflect the actual 37 tools in v2.0.3

Net result

curl -fsSL https://raw.githubusercontent.com/adityaarsharma/librecrawl-mcp/main/install.sh | bash

…now produces a fully-working v2.0.3 install with all modules + deps + skill. Anyone running this gets the same server-side mandatory rules, the same 37 tools, the same ephemeral-mode behaviour, the same PDF + 7-CSV bundle output.

v2.0.3 — Mandatory rules baked into the MCP server itself

04 Jun 11:57

Choose a tag to compare

What changed

When operators connect to the librecrawl-mcp server without the local Claude Code skill installed (e.g. through a fresh Cursor/Codex/Windsurf install or directly via Claude Desktop's MCP config), their LLM previously had no visibility into the "always save zip locally + auto_cleanup=True + never report zip_path" rules. That mandate lived only in .claude/skills/librecrawl-audit/SKILL.md.

v2.0.3 fixes this by passing a 2,786-character instructions payload to FastMCP at server-init time. Every MCP-compatible client receives it during the initialize handshake and passes it to the LLM's system context.

Instructions payload covers

  • 3-step audit workflow (start_chunked_audit → poll status → audit_zip)
  • 3 mandatory rules for handling the zip response:
    1. Save locally: base64-decode content_base64 to a local file using the filename field
    2. Never report zip_path: it's the REMOTE path, useless to the operator
    3. auto_cleanup=True is mandatory
  • 8-file zip contents reference
  • Final response shape to mirror back to the user

Plus

librecrawl_audit_zip docstring rewritten to lead with the same 3 rules — visible to any LLM that reads the tool schema even if the server-level instructions are ignored.

Verified

Initialize handshake against brain.posimyth.com/librecrawl/mcp returns the instructions payload (length: 2,786 chars). The local skill at .claude/skills/librecrawl-audit/ remains as a developer-experience fallback but is no longer required for correctness.

v2.0.2 — Broken external links now in the PDF report

04 Jun 11:50

Choose a tag to compare

Hotfix for a UX gap caught on theculinarypeace.com: the external-link validator was finding broken outbound URLs correctly and writing them to .external-links.csv — but the operator opening the PDF/MD audit report saw zero mention of them. Buried, not surfaced.

What changed

Summary scorecard adds 4 external-link rows

| External links audited                                   | 142 |
| External links — broken (4xx/5xx/timeout/DNS/SSL)        |  23 | 🔴
| External links — followed redirect (3xx)                 |  18 | ⚠️
| External links — skipped (mailto/tel/javascript)         |   0 |

Dedicated 'Broken External Links' section in the Critical block

Table with target URL · status · source page · anchor text. Capped at first 25 rows in the report; full list in .external-links.csv as before.

Verified

Live smoke test on theculinarypeace.com — 23 broken externals surfaced including swayampaaka.com 500 (the exact failure flagged in the original Google Sheet review), 7× NDTV/Mayo/Indian Express/Scielo 403s, nih.gov 404, tabletwise.com SSL error, malformed Gordon Ramsay URL, 11× Pinterest 429s.

v2.0.1 — Skill rule: always save zip locally

04 Jun 11:43

Choose a tag to compare

Hotfix for an agent-side gap caught in production: a fresh audit with auto_cleanup=False was reporting the remote server path as the deliverable instead of saving the base64 zip locally. User couldn't open the file.

Two-part fix

1. .claude/skills/librecrawl-audit/SKILL.md — mandatory rules block

Adds a "⚠️ MANDATORY RULES" header any compliant agent must follow:

  • Rule 1: ALWAYS base64-decode content_base64 and write to a LOCAL file (use the response's filename field). NEVER report zip_path as the deliverable — that path is on the remote server.
  • Rule 2: ALWAYS auto_cleanup=True (default and only sane choice).
  • Rule 3: Final user-facing message reports the LOCAL path, not the remote path.

2. librecrawl_audit_zip response

Tightened to make the contract explicit:

  • New field zip_path_is_remote: true — makes remote-ness unambiguous.
  • New field save_to_local_filename — echoes the local filename for clarity.
  • Rewritten note text stating the agent MUST decode + save locally.

Any future agent reading the response cannot miss the requirement.

v2.0.0 — Feature-complete technical SEO audit MCP server

04 Jun 11:21

Choose a tag to compare

librecrawl-mcp v2.0.0 marks feature-complete for technical SEO auditing. Self-hosted SEO crawler for Claude / Cursor / Codex / Windsurf / Continue.dev. 50+ technical checks, chunked-progressive crawling, ephemeral by default, PDF + 7 CSV sidecars per audit.

Highlights

  • Chunked-progressive crawling that never hits the MCP client timeout — background runner thread, SQLite WAL state, polling API. Enterprise sites with 10,000+ pages work the same as 50-page blogs.
  • WAF / bot-block detection during the crawl — Cloudflare · Akamai · DataDome · Imperva · PerimeterX challenge pages fingerprinted as bot_block_challenge_detected. No other open-source SEO crawler does this.
  • Ephemeral by default — after the client downloads the zip bundle, the server deletes the session row, all artifact files on disk, AND the upstream LibreCrawl crawl record. The local client is the only memory of any audited site.
  • AIMD adaptive crawl-delay — additive-increase / multiplicative-decrease controller tunes delay live from target's p95 latency + 5xx rate. Polite by construction. Honours robots.txt Crawl-Delay floor.
  • Sitemap-orphan fill — URLs in sitemap not reachable via internal-link traversal get a lightweight HTTP fetch and join the inbound-link graph. Closes the LibreCrawl maxDepth coverage gap.
  • PDF + 7 CSV sidecars per audit — branded WeasyPrint PDF, Markdown source, per-page CSV with 30 columns, sitemap-recon CSV, external-links CSV, content-audit CSV, extended-checks CSV.

Compatible AI agents

Claude Code · Claude Desktop · Cursor · OpenAI Codex CLI · Windsurf · Continue.dev · any MCP-compatible client over stdio or streamable-HTTP transport.

What it checks (50+ technical SEO checks)

Security headers · mixed content · WAF detection · sitemap cross-checks · hreflang full audit · canonical chain depth + relative + → 3xx · redirect chains with destination · meta-refresh · JS-redirect · http-refresh · schema.org validation (16 types, schema.org spec + Google Rich Results) · URL quality · anchor text quality · broken bookmarks · internal nofollow patterns · image performance + CLS · HTML structure pathologies · accessibility / metadata · crawl-budget killers · dev leaks · content quality (Flesch · AI-tell tokens · missing punctuation · boilerplate) · external link validation (17 status classes).

Full check inventory in README.md.

Install

```bash
curl -fsSL https://raw.githubusercontent.com/adityaarsharma/librecrawl-mcp/main/install.sh | bash
```

Release-gate smoke test

Full audit on theculinarypeace.com — 460 pages crawled (200 LibreCrawl + 260 sitemap-fill), 8-file zip bundle (320 KB, sha256 verified), 620 findings across 15 distinct check classes, server returned to zero-memory baseline after client downloaded the bundle. Every check class verified to fire on real production HTML or correctly absent on clean pages.

Full CHANGELOG

See CHANGELOG.md for the v1.2.0 → v2.0.0 arc:
v1.2.0 Screaming-Frog parity · v1.4.0 chunked engine · v1.4.1 external-link validator · v1.5.0 PDF + content audit + extended checks + GSC + schema validation · v1.5.1 audit_complete respects sitemap coverage · v1.6.0 sitemap-orphan fill · v1.6.1 false-positive orphans fix · v1.6.2 in-content link extraction · v1.7.0 Tier 1 "fix broken" · v1.8.0 Tier 2 30+ technical checks · v1.9.0 ephemeral mode · v1.9.1 polish.

Reusable Claude Code skill

.claude/skills/librecrawl-audit/SKILL.md — drops into any project's .claude/skills/ directory (or ~/.claude/skills/ globally). Auto-loads when the user asks for site audit / SEO check / broken link / schema validation work.

License

MIT. Built on top of LibreCrawl (MIT).


By Aditya Sharma — github.com/adityaarsharma/librecrawl-mcp

v1.1.1 — Chunked Crawling: Resume Huge Crawls Across Sessions

28 May 05:34

Choose a tag to compare

New: librecrawl_resume_from_crawl_id(crawl_id) — the crawl a 50,000-page site politely, across multiple days, from any session feature.

The flow

Day 1:  Audit https://huge-site.com (max 5000 pages)
        → runs, returns crawl_id 42
        → pause anytime with librecrawl_pause_crawl()

Day 2:  list_crawls() → see crawl_id 42 with status "paused" at 1,847 pages
        librecrawl_resume_from_crawl_id(42)
        → picks up at page 1,848 — every previously-crawled URL is re-used
          from SQLite, ZERO duplicate HTTP requests to the target

Day 3:  Same thing until done → generate_report(42) for the full audit

State persists across PM2 restarts, server reboots, and entirely different MCP client sessions. The crawl is keyed by crawl_id, not by the MCP session.

How it stays polite

  • Token-bucket rate limiter — smooth distribution, no bursting. Default 2 req/sec.
  • Respects Crawl-Delay directive in robots.txt
  • Single connection, sequential by default (configurable up to 5 workers)
  • Identifiable User-Agent (LibreCrawl/1.0 (Web Crawler))
  • Idempotent resume — URLs already in the DB are never re-fetched

Implementation note

Tries two resume paths transparently:

  1. In-session resume (/api/resume_crawl) — when the upstream crawler instance is still alive with is_paused=True. Just flips the flag.
  2. Cross-session DB resume (/api/crawls/<id>/resume) — when the crawler has been GCed. Rebuilds from the SQLite snapshot.

The tool returns {\"mode\": \"in_session_resume\" | \"db_resume\"} so you can see which path fired.

Also in v1.1.1

  • Better auto-recovery_ensure_crawler_ready() now uses crawl speed (instead of status string alone) to detect paused/zombie crawlers. Upstream LibreCrawl reports paused crawls as status=\"running\", which the old detection missed.
  • Tool count: 20 (was 19) — librecrawl_resume_from_crawl_id added.

Upgrade

```bash
cd ~/librecrawl-mcp && git pull && pm2 restart librecrawl-mcp
```

Or rerun the installer — it'll update in place.


Verified live on nexterwp.com: paused at 19 pages from session A, resumed from session B (fresh MCP session), continued cleanly to 35 pages with no duplicate requests.

v1.1.0 — Always Works: Auto-recover + GSC Top Queries

28 May 05:08

Choose a tag to compare

Three fixes that turn LibreCrawl MCP into an always works tool for people who run audits back-to-back across many sites.

🔧 Auto-recovery from stuck crawler state

New _ensure_crawler_ready() helper runs before every start_crawl / audit. Resets three stale states that were silently blocking fresh audits:

  1. Previous crawl still running"Crawl already in progress" 409 errors
  2. Previous crawl paused — resume button dead when launched from a new MCP session
  3. Zombie threadis_running=True but 0 progress for >60s

Audit tool also has a force-stop-and-retry fallback if it still hits a race. Net effect: run a new crawl anytime — it just works.

🔍 GSC section: top queries + quick wins

librecrawl_append_gsc_section() now renders, in addition to indexing errors:

  • Top 25 search queries table — clicks, impressions, CTR, position
  • 🎯 Quick wins — page-2 keywords (positions 6-20) with >50 impressions, auto-surfaced as page-1 optimisation opportunities

Input is also more forgiving: pass performance metrics either nested under performance: {...} or flat at the top level. Whichever shape your GSC MCP returns, it just works.

📚 README / installer

  • GSC property-type warning (sc-domain: vs URL-prefix) — the #1 reason "User does not have sufficient permission" errors happen on first GSC use. Now called out in both README and installer.
  • What's new in v1.1.0 callout at the top of README.

Upgrade

If you already installed via install.sh, just pull and restart:

```bash
cd ~/librecrawl-mcp && git pull && pm2 restart librecrawl-mcp
```

Or rerun the one-line installer — it'll update in place.


Compatibility: all 19 MCP tools unchanged at the signature level. Pure additive release.