sitewalker

Crawl a website and create a structured map of its pages.

Installation

pipx install sitewalker

Usage

# Map all pages on a site (single-level crawl)
sitewalker example.com

# Recursive crawl of all internal pages
sitewalker example.com -r

# Collect external links
sitewalker example.com -e

# Collect external links and check their HTTP status
sitewalker example.com -e --check-external

# Recursive crawl with external link collection
sitewalker example.com -r -e

# Only crawl web pages (skip images, PDFs, etc.)
sitewalker example.com -r -p

# Crawl an HTTP-only site (e.g., LAN staging server)
sitewalker http://staging.lan --allow-private

# Verbose output for debugging
sitewalker example.com -r -v

The target accepts a bare domain (example.com) or a full URL (http://example.com). Bare domains default to HTTPS — if the connection fails, sitewalker exits with a message to provide the full URL.

Options

Flag	Description	Default
`-r`, `--recursive`	Recursively crawl internal links	Off
`-e`, `--external-links`	Collect external links	Off
`--check-external`	Check HTTP status of external links (requires `-e`)	Off
`-p`, `--pages`	Only crawl web pages (HTML, PHP, etc.)	Off
`-v`, `--verbose`	Enable verbose/debug output	Off
`-t`, `--timeout`	Request timeout in seconds	30
`--max-pages`	Maximum number of pages to crawl	1000
`--max-depth`	Maximum link distance from start URL (BFS)	10
`--delay`	Delay between requests in seconds (use 0 for local)	1.0
`--allow-private`	Allow crawling domains that resolve to private IPs	Off
`--ignore-robots`	Ignore robots.txt rules	Off

Output

Results are saved to a CSV file named {domain}_{timestamp}.csv with columns:

URL — the page URL
Title — the page's <title> tag content
Status Code — HTTP response status

When using -e, external links are additionally saved to {domain}_{timestamp}_external_links.csv. The internal pages CSV is always generated. With --check-external, the external links CSV includes a Status Code column.

Security

SSRF protection: Domains that resolve to private/reserved IP addresses are blocked by default. Use --allow-private to override for legitimate internal use.
robots.txt: Respected by default. Use --ignore-robots to override.
CSV injection: Output values are sanitized to prevent spreadsheet formula injection.
Crawl limits: Recursive crawls are bounded by --max-pages and --max-depth to prevent resource exhaustion.

Roadmap

--format json — JSON output format
--images --check-alt — image inventory with alt text auditing

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.dev		.dev
.github/workflows		.github/workflows
src/sitewalker		src/sitewalker
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RELEASE-NOTES.md		RELEASE-NOTES.md
TASKS.md		TASKS.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sitewalker

Installation

Usage

Options

Output

Security

Roadmap

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sitewalker

Installation

Usage

Options

Output

Security

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages