🌳 NavTree Crawler v2.0

A recursive web crawler that captures full-page screenshots along navigation chains, producing a visual tree of your website's structure. Now with authentication, stealth mode, rate limiting, and click tracking for JS-heavy sites.

What it does

Given any website, NavTree Crawler will:

Start at the homepage (root)
Find all clickable navigation elements (CTAs) — including JavaScript-only buttons
Visit each one, take a full-page screenshot
Recursively discover CTAs on each new page
Build a tree structure representing the navigation hierarchy
Generate an HTML report with thumbnails and links

The result is a complete visual map of how users navigate through your site.

Homepage
├── About Us (🔗 link)
│   ├── Team
│   └── History
├── Products (🖱️ JS click)
│   ├── Category A
│   │   ├── Product 1
│   │   └── Product 2
│   └── Category B
└── Contact

What's New in v2.0

Feature	Description
🔐 Authentication	Cookies, localStorage, sessionStorage, HTTP headers
🥷 Stealth Mode	Bypass bot detection with playwright-extra stealth plugin
⏱️ Rate Limiting	Fixed or randomized delays between requests
🖱️ Click Tracking	Detect navigation from JavaScript button clicks, not just `<a>` links
📁 Auto-named Output	Screenshots folder automatically named after the website (e.g., `screenshots_example_com`)

Installation

# Clone or download this repository
cd nav-tree-crawler

# Install dependencies
npm install

# Install Playwright browsers
npx playwright install chromium

Quick Start

CLI Usage

# Basic crawl
node src/cli.js https://example.com

# With rate limiting (be polite!)
node src/cli.js https://example.com --random-delay 1000-3000

# Authenticated crawl
node src/cli.js https://dashboard.example.com --cookies ./cookies.json

# Full options
node src/cli.js https://spa.example.com \
  --random-delay 500-1500 \
  --cookies ./auth.json \
  --depth 4 \
  --pages 200

# See all options
node src/cli.js --help

Programmatic Usage

const NavTreeCrawler = require('./src/crawler');

const crawler = new NavTreeCrawler({
  baseUrl: 'https://example.com',
  // outputDir defaults to './screenshots_example_com' based on the URL
  maxDepth: 5,
  maxPages: 100,

  // Authentication
  cookies: [{ name: 'session', value: 'abc123', domain: '.example.com', path: '/' }],

  // Rate limiting
  randomDelay: [1000, 3000],

  // Click tracking for SPAs
  clickTracking: true,
});

await crawler.crawl();
await crawler.generateReport();

🔐 Authentication

Crawl login-protected sites by injecting authentication credentials.

Cookies

// Direct cookie injection
new NavTreeCrawler({
  cookies: [
    {
      name: 'session_id',
      value: 'your-session-token',
      domain: '.example.com',
      path: '/',
      httpOnly: true,
      secure: true,
    }
  ],
});

// Or load from file
new NavTreeCrawler({
  cookiesFile: './cookies.json',
});

Cookies file format (same as browser export):

[
  {
    "name": "session",
    "value": "abc123",
    "domain": ".example.com",
    "path": "/"
  }
]

HTTP Headers

new NavTreeCrawler({
  extraHeaders: {
    'Authorization': 'Bearer your-jwt-token',
    'X-API-Key': 'your-api-key',
  },
});

localStorage / sessionStorage

Useful for SPAs that store auth tokens in browser storage:

new NavTreeCrawler({
  localStorage: {
    'auth_token': 'your-token',
    'user_id': '12345',
  },
  sessionStorage: {
    'temp_data': 'value',
  },
});

🥷 Stealth Mode

Avoid bot detection using the puppeteer-extra-plugin-stealth.

Stealth mode is enabled by default and helps bypass:

Cloudflare bot protection
DataDome
PerimeterX
Basic headless browser detection

new NavTreeCrawler({
  stealth: true, // default
  
  // Custom user agent (optional)
  userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36...',
});

To disable stealth mode:

new NavTreeCrawler({
  stealth: false,
});

⏱️ Rate Limiting

Be polite to servers and avoid getting blocked.

Fixed Delay

new NavTreeCrawler({
  requestDelay: 1000, // 1 second between each request
});

Randomized Delay (Recommended)

More human-like behavior, harder to detect as a bot:

new NavTreeCrawler({
  randomDelay: [1000, 3000], // Random 1-3 seconds between requests
});

CLI

# Fixed delay
node src/cli.js https://example.com --delay 1000

# Random delay
node src/cli.js https://example.com --random-delay 1000-3000

🖱️ Click Tracking

Detect navigation from JavaScript button clicks, not just <a href="..."> links. Essential for SPAs (React, Vue, Angular).

How it Works

Finds all clickable elements (buttons, [role="button"], [onclick], etc.)
Clicks each element
Detects if navigation occurred (URL change OR content change)
Uses content hashing to detect SPA state changes even when URL doesn't change

new NavTreeCrawler({
  clickTracking: true, // default
  
  // How long to wait for navigation after click
  clickNavigationTimeout: 5000, // 5 seconds
  
  // Extra wait for SPA transitions
  additionalWait: 1500,
  
  // Custom clickable selectors
  clickableSelectors: [
    'button:not([disabled])',
    '[role="button"]',
    '[onclick]',
    '[data-href]',
    '.nav-item',
    '.menu-trigger',
  ],
});

CLI

# Disable click tracking (links only)
node src/cli.js https://example.com --no-click-tracking

# Custom click timeout
node src/cli.js https://spa.example.com --click-timeout 8000

Configuration Reference

Basic Options

Option	Default	Description
`baseUrl`	-	Starting URL for the crawl
`outputDir`	`./screenshots_<hostname>`	Where to save screenshots and reports (auto-generated from URL)
`maxDepth`	`5`	Maximum navigation depth
`maxPages`	`100`	Maximum pages to crawl (safety limit)
`viewport`	`{width: 1920, height: 1080}`	Browser viewport size
`waitUntil`	`'networkidle'`	Page load wait strategy
`additionalWait`	`500`	Extra wait time (ms) after page load
`sameDomainOnly`	`true`	Only crawl pages on the same domain
`includeSubdomains`	`false`	Include subdomains when `sameDomainOnly` is true

Authentication Options

Option	Default	Description
`cookies`	`[]`	Array of cookie objects to inject
`cookiesFile`	`null`	Path to JSON file with cookies
`extraHeaders`	`{}`	HTTP headers to add to all requests
`localStorage`	`{}`	Items to set in localStorage
`sessionStorage`	`{}`	Items to set in sessionStorage

Stealth Options

Option	Default	Description
`stealth`	`true`	Enable stealth mode
`userAgent`	`null`	Custom user agent string

Rate Limiting Options

Option	Default	Description
`requestDelay`	`0`	Fixed delay between requests (ms)
`randomDelay`	`null`	Random delay range `[min, max]` in ms

Click Tracking Options

Option	Default	Description
`clickTracking`	`true`	Enable JS click tracking
`clickNavigationTimeout`	`5000`	Timeout waiting for navigation after click (ms)
`clickableSelectors`	(see below)	Selectors for clickable elements

Default Selectors

CTA Selectors:

[
  'nav a',
  'header a',
  '[role="navigation"] a',
  'a[href]:not([href^="mailto:"]):not([href^="tel:"]):not([href^="#"]):not([href^="javascript:"])',
  'button',
  '[role="button"]',
  '[onclick]',
  '.btn',
  '.button',
]

Clickable Selectors:

[
  'button:not([disabled])',
  '[role="button"]',
  '[onclick]',
  '[data-href]',
  '[data-link]',
  '[data-nav]',
  '.clickable',
  '[tabindex="0"]',
]

Exclude Selectors:

[
  'a[href*="login"]',
  'a[href*="logout"]',
  'a[href*="signin"]',
  'a[href*="signup"]',
  'a[rel="external"]',
]

Output

After crawling, you'll find screenshots in a directory named after the website (e.g., ./screenshots_example_com for https://example.com). The folder contains:

*.png - Full-page screenshots
tree.json - JSON tree structure
metadata.json - Crawl metadata and options
report.html - Visual HTML report

Tree Structure

{
  "url": "https://example.com",
  "title": "Example Domain",
  "screenshot": "d0_001_home_abc123.png",
  "depth": 0,
  "contentHash": "a1b2c3d4...",
  "children": [
    {
      "url": "https://example.com/about",
      "ctaText": "About Us",
      "ctaType": "link",
      "screenshot": "d1_002_about_def456.png",
      "depth": 1,
      "children": []
    },
    {
      "url": "https://example.com/products",
      "ctaText": "View Products",
      "ctaType": "click",
      "screenshot": "d1_003_products_ghi789.png",
      "depth": 1,
      "children": []
    }
  ]
}

Use Cases

QA Testing - Visual regression testing across your entire site
Site Audits - Understand your site's navigation structure
Documentation - Generate visual documentation of user flows
Competitive Analysis - Map out competitor site structures
Accessibility Review - Screenshot all pages for manual review
Archive/Backup - Visual snapshot of your entire site
SPA Testing - Verify all JS-navigated routes are working

Limitations

CAPTCHAs - Can't solve CAPTCHAs (though stealth helps avoid them)
2FA - Can't handle two-factor authentication flows
Infinite Scroll - Captures initial viewport only
Complex SPAs - Some edge cases with JS navigation may be missed
iframes - Doesn't crawl into iframes

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌳 NavTree Crawler v2.0

What it does

What's New in v2.0

Installation

Quick Start

CLI Usage

Programmatic Usage

🔐 Authentication

Cookies

HTTP Headers

localStorage / sessionStorage

🥷 Stealth Mode

⏱️ Rate Limiting

Fixed Delay

Randomized Delay (Recommended)

CLI

🖱️ Click Tracking

How it Works

CLI

Configuration Reference

Basic Options

Authentication Options

Stealth Options

Rate Limiting Options

Click Tracking Options

Default Selectors

Output

Tree Structure

Use Cases

Limitations

License

About

Uh oh!

Releases

Packages

Languages

akelechi/Nav-tree-crawler

Folders and files

Latest commit

History

Repository files navigation

🌳 NavTree Crawler v2.0

What it does

What's New in v2.0

Installation

Quick Start

CLI Usage

Programmatic Usage

🔐 Authentication

Cookies

HTTP Headers

localStorage / sessionStorage

🥷 Stealth Mode

⏱️ Rate Limiting

Fixed Delay

Randomized Delay (Recommended)

CLI

🖱️ Click Tracking

How it Works

CLI

Configuration Reference

Basic Options

Authentication Options

Stealth Options

Rate Limiting Options

Click Tracking Options

Default Selectors

Output

Tree Structure

Use Cases

Limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages