Skip to content

Automatically take screenshots along a nav chain of a website in a DFS manner

Notifications You must be signed in to change notification settings

akelechi/Nav-tree-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌳 NavTree Crawler v2.0

A recursive web crawler that captures full-page screenshots along navigation chains, producing a visual tree of your website's structure. Now with authentication, stealth mode, rate limiting, and click tracking for JS-heavy sites.

What it does

Given any website, NavTree Crawler will:

  1. Start at the homepage (root)
  2. Find all clickable navigation elements (CTAs) — including JavaScript-only buttons
  3. Visit each one, take a full-page screenshot
  4. Recursively discover CTAs on each new page
  5. Build a tree structure representing the navigation hierarchy
  6. Generate an HTML report with thumbnails and links

The result is a complete visual map of how users navigate through your site.

Homepage
├── About Us (🔗 link)
│   ├── Team
│   └── History
├── Products (🖱️ JS click)
│   ├── Category A
│   │   ├── Product 1
│   │   └── Product 2
│   └── Category B
└── Contact

What's New in v2.0

Feature Description
🔐 Authentication Cookies, localStorage, sessionStorage, HTTP headers
🥷 Stealth Mode Bypass bot detection with playwright-extra stealth plugin
⏱️ Rate Limiting Fixed or randomized delays between requests
🖱️ Click Tracking Detect navigation from JavaScript button clicks, not just <a> links
📁 Auto-named Output Screenshots folder automatically named after the website (e.g., screenshots_example_com)

Installation

# Clone or download this repository
cd nav-tree-crawler

# Install dependencies
npm install

# Install Playwright browsers
npx playwright install chromium

Quick Start

CLI Usage

# Basic crawl
node src/cli.js https://example.com

# With rate limiting (be polite!)
node src/cli.js https://example.com --random-delay 1000-3000

# Authenticated crawl
node src/cli.js https://dashboard.example.com --cookies ./cookies.json

# Full options
node src/cli.js https://spa.example.com \
  --random-delay 500-1500 \
  --cookies ./auth.json \
  --depth 4 \
  --pages 200

# See all options
node src/cli.js --help

Programmatic Usage

const NavTreeCrawler = require('./src/crawler');

const crawler = new NavTreeCrawler({
  baseUrl: 'https://example.com',
  // outputDir defaults to './screenshots_example_com' based on the URL
  maxDepth: 5,
  maxPages: 100,

  // Authentication
  cookies: [{ name: 'session', value: 'abc123', domain: '.example.com', path: '/' }],

  // Rate limiting
  randomDelay: [1000, 3000],

  // Click tracking for SPAs
  clickTracking: true,
});

await crawler.crawl();
await crawler.generateReport();

🔐 Authentication

Crawl login-protected sites by injecting authentication credentials.

Cookies

// Direct cookie injection
new NavTreeCrawler({
  cookies: [
    {
      name: 'session_id',
      value: 'your-session-token',
      domain: '.example.com',
      path: '/',
      httpOnly: true,
      secure: true,
    }
  ],
});

// Or load from file
new NavTreeCrawler({
  cookiesFile: './cookies.json',
});

Cookies file format (same as browser export):

[
  {
    "name": "session",
    "value": "abc123",
    "domain": ".example.com",
    "path": "/"
  }
]

HTTP Headers

new NavTreeCrawler({
  extraHeaders: {
    'Authorization': 'Bearer your-jwt-token',
    'X-API-Key': 'your-api-key',
  },
});

localStorage / sessionStorage

Useful for SPAs that store auth tokens in browser storage:

new NavTreeCrawler({
  localStorage: {
    'auth_token': 'your-token',
    'user_id': '12345',
  },
  sessionStorage: {
    'temp_data': 'value',
  },
});

🥷 Stealth Mode

Avoid bot detection using the puppeteer-extra-plugin-stealth.

Stealth mode is enabled by default and helps bypass:

  • Cloudflare bot protection
  • DataDome
  • PerimeterX
  • Basic headless browser detection
new NavTreeCrawler({
  stealth: true, // default
  
  // Custom user agent (optional)
  userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36...',
});

To disable stealth mode:

new NavTreeCrawler({
  stealth: false,
});

⏱️ Rate Limiting

Be polite to servers and avoid getting blocked.

Fixed Delay

new NavTreeCrawler({
  requestDelay: 1000, // 1 second between each request
});

Randomized Delay (Recommended)

More human-like behavior, harder to detect as a bot:

new NavTreeCrawler({
  randomDelay: [1000, 3000], // Random 1-3 seconds between requests
});

CLI

# Fixed delay
node src/cli.js https://example.com --delay 1000

# Random delay
node src/cli.js https://example.com --random-delay 1000-3000

🖱️ Click Tracking

Detect navigation from JavaScript button clicks, not just <a href="..."> links. Essential for SPAs (React, Vue, Angular).

How it Works

  1. Finds all clickable elements (buttons, [role="button"], [onclick], etc.)
  2. Clicks each element
  3. Detects if navigation occurred (URL change OR content change)
  4. Uses content hashing to detect SPA state changes even when URL doesn't change
new NavTreeCrawler({
  clickTracking: true, // default
  
  // How long to wait for navigation after click
  clickNavigationTimeout: 5000, // 5 seconds
  
  // Extra wait for SPA transitions
  additionalWait: 1500,
  
  // Custom clickable selectors
  clickableSelectors: [
    'button:not([disabled])',
    '[role="button"]',
    '[onclick]',
    '[data-href]',
    '.nav-item',
    '.menu-trigger',
  ],
});

CLI

# Disable click tracking (links only)
node src/cli.js https://example.com --no-click-tracking

# Custom click timeout
node src/cli.js https://spa.example.com --click-timeout 8000

Configuration Reference

Basic Options

Option Default Description
baseUrl - Starting URL for the crawl
outputDir ./screenshots_<hostname> Where to save screenshots and reports (auto-generated from URL)
maxDepth 5 Maximum navigation depth
maxPages 100 Maximum pages to crawl (safety limit)
viewport {width: 1920, height: 1080} Browser viewport size
waitUntil 'networkidle' Page load wait strategy
additionalWait 500 Extra wait time (ms) after page load
sameDomainOnly true Only crawl pages on the same domain
includeSubdomains false Include subdomains when sameDomainOnly is true

Authentication Options

Option Default Description
cookies [] Array of cookie objects to inject
cookiesFile null Path to JSON file with cookies
extraHeaders {} HTTP headers to add to all requests
localStorage {} Items to set in localStorage
sessionStorage {} Items to set in sessionStorage

Stealth Options

Option Default Description
stealth true Enable stealth mode
userAgent null Custom user agent string

Rate Limiting Options

Option Default Description
requestDelay 0 Fixed delay between requests (ms)
randomDelay null Random delay range [min, max] in ms

Click Tracking Options

Option Default Description
clickTracking true Enable JS click tracking
clickNavigationTimeout 5000 Timeout waiting for navigation after click (ms)
clickableSelectors (see below) Selectors for clickable elements

Default Selectors

CTA Selectors:

[
  'nav a',
  'header a',
  '[role="navigation"] a',
  'a[href]:not([href^="mailto:"]):not([href^="tel:"]):not([href^="#"]):not([href^="javascript:"])',
  'button',
  '[role="button"]',
  '[onclick]',
  '.btn',
  '.button',
]

Clickable Selectors:

[
  'button:not([disabled])',
  '[role="button"]',
  '[onclick]',
  '[data-href]',
  '[data-link]',
  '[data-nav]',
  '.clickable',
  '[tabindex="0"]',
]

Exclude Selectors:

[
  'a[href*="login"]',
  'a[href*="logout"]',
  'a[href*="signin"]',
  'a[href*="signup"]',
  'a[rel="external"]',
]

Output

After crawling, you'll find screenshots in a directory named after the website (e.g., ./screenshots_example_com for https://example.com). The folder contains:

  • *.png - Full-page screenshots
  • tree.json - JSON tree structure
  • metadata.json - Crawl metadata and options
  • report.html - Visual HTML report

Tree Structure

{
  "url": "https://example.com",
  "title": "Example Domain",
  "screenshot": "d0_001_home_abc123.png",
  "depth": 0,
  "contentHash": "a1b2c3d4...",
  "children": [
    {
      "url": "https://example.com/about",
      "ctaText": "About Us",
      "ctaType": "link",
      "screenshot": "d1_002_about_def456.png",
      "depth": 1,
      "children": []
    },
    {
      "url": "https://example.com/products",
      "ctaText": "View Products",
      "ctaType": "click",
      "screenshot": "d1_003_products_ghi789.png",
      "depth": 1,
      "children": []
    }
  ]
}

Use Cases

  • QA Testing - Visual regression testing across your entire site
  • Site Audits - Understand your site's navigation structure
  • Documentation - Generate visual documentation of user flows
  • Competitive Analysis - Map out competitor site structures
  • Accessibility Review - Screenshot all pages for manual review
  • Archive/Backup - Visual snapshot of your entire site
  • SPA Testing - Verify all JS-navigated routes are working

Limitations

  1. CAPTCHAs - Can't solve CAPTCHAs (though stealth helps avoid them)
  2. 2FA - Can't handle two-factor authentication flows
  3. Infinite Scroll - Captures initial viewport only
  4. Complex SPAs - Some edge cases with JS navigation may be missed
  5. iframes - Doesn't crawl into iframes

License

MIT

About

Automatically take screenshots along a nav chain of a website in a DFS manner

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published