scrapalyser

Pre-scraping intelligence tool. Scan any website before writing a single line of scraper.

Install

# Core (curl_cffi engine)
pip install scrapalyser

# With Playwright support
pip install scrapalyser[playwright]
playwright install chromium

Usage

import scrapalyser

# Simple scan — returns a dict
report = scrapalyser.scan("https://example.com")
print(report)

# Full options
report = scrapalyser.scan(
    url="https://example.com",
    output="txt",           # "json" (default) or "txt"
    lang="fr",              # "en" (default), "fr", "es", "br"
    save="report.txt",      # optional — write output to file
    engine="curl",          # "curl" (default) or "playwright"
    headless=True,          # True (default) or False — playwright only
    screenshot="shot.png",  # optional — playwright only
)

Example output

JSON

{
  "url": "https://example.com",
  "scanned_at": "2026-04-30T12:32:00Z",
  "status_code": 200,
  "engine": "curl",
  "antibot": {
    "detected": true,
    "name": "Cloudflare Turnstile"
  },
  "technology": {
    "type": "React"
  },
  "js_required": true,
  "api": {
    "detected": true,
    "endpoints": [
      "https://api.example.com/v1/search"
    ]
  },
  "robots_txt": {
    "found": true,
    "url": "https://example.com/robots.txt"
  },
  "sitemap": {
    "found": true,
    "url": "https://example.com/sitemap.xml"
  },
  "login_wall": {
    "detected": false,
    "type": null
  }
}

TXT (French)

═══════════════════════════════════════════
        SCRAPALYSER RAPPORT
        https://example.com
        Scanné le : 2026-04-30T12:32:00Z
        Status : 200
═══════════════════════════════════════════

[ANTI-BOT]
  ⚠️  Détecté : Cloudflare Turnstile

[TECHNOLOGIE]
  🖥️  Type : React

[JAVASCRIPT]
  ⚠️  JS requis : Oui → utiliser Playwright/Selenium

[API DÉTECTÉES]
  ✅  https://api.example.com/v1/search

[ROBOTS.TXT]
  ✅  https://example.com/robots.txt

[SITEMAP]
  ✅  https://example.com/sitemap.xml

[LOGIN WALL]
  ✅  Aucun login requis

═══════════════════════════════════════════

Features

🛡️ Anti-bot detection — Cloudflare, DataDome, PerimeterX, Akamai, Kasada, reCAPTCHA, hCaptcha
🖥️ Technology detection — React, Vue, Angular, Next.js, Nuxt, Svelte, WordPress, Shopify, Drupal, Joomla, Wix, Webflow
⚡ JS requirement detection — know before you write whether requests is enough or Playwright is needed
🌐 API endpoint discovery — via CSP headers, inline scripts, and XHR/Fetch interception (Playwright mode)
🤖 robots.txt parser — URL extraction with common path variants
🗺️ Sitemap detection — via robots.txt or direct probe
🔐 Login wall detection — form, redirect, button, OAuth
📸 Screenshot — capture the page as seen by the browser (Playwright mode)
🌍 Multi-language output — fr, en, es, br
📄 JSON & TXT export — machine-readable or human-readable

Engines

Feature	`curl`	`playwright`
Speed	Fast	Slower
JS execution	❌	✅
XHR/Fetch API detection	❌	✅
Screenshot	❌	✅
Bot detection bypass	Partial	Better with `headless=False`

Contributing

Fork the repo
Create a feature branch (git checkout -b feat/my-feature)
Commit your changes
Open a pull request

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
scrapalyser		scrapalyser
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapalyser

Install

Usage

Example output

JSON

TXT (French)

Features

Engines

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scrapalyser

Install

Usage

Example output

JSON

TXT (French)

Features

Engines

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages