A powerful Rust CLI tool to scan GitHub repositories and websites for broken links.
Perfect for:
- CI/CD pipelines to catch broken documentation links
- Maintaining website link health
- Scan GitHub repositories (README.md)
- Scan websites with configurable crawl depth
- Detect broken links (404, timeouts, SSL errors, etc.)
- Detect redirects (301, 302)
- Human-readable table output
- JSON output for scripting/CI
- Proper exit codes for CI integration
- Blazing fast concurrent link checking (500 concurrent by default, configurable)
- Polite crawling with delays
- Rust 1.70 or newer (install from rust-lang.org)
# Clone the repository (or navigate to the project folder)
cd link-guardian
# Build in release mode (optimized)
cargo build --release
# The binary will be at target/release/link-guardiancargo install --path .
# Now you can run 'link-guardian' from anywhere# Check links in a GitHub repo's README
link-guardian github https://github.com/rust-lang/rust
# With JSON output
link-guardian github https://github.com/rust-lang/rust --json
# With custom concurrency (default: 500)
link-guardian github https://github.com/rust-lang/rust --concurrency 1000# Scan just the homepage
link-guardian site https://example.com
# Scan homepage + all linked pages (depth 2)
link-guardian site https://example.com --max-depth 2
# With JSON output
link-guardian site https://example.com --json
# With custom concurrency for faster checking
link-guardian site https://example.com --concurrency 1000 --max-depth 2link-guardian --help
Commands:
github Scan a GitHub repository for broken links
site Scan a website for broken links
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
link-guardian github --help
Scan a GitHub repository for broken links in README and docs/
Usage: link-guardian github [OPTIONS] <REPO_URL>
Arguments:
<REPO_URL> GitHub repository URL (e.g., https://github.com/user/repo)
Options:
--json Output results in JSON format instead of a table
-h, --help Print help
link-guardian site --help
Scan a website for broken links
Usage: link-guardian site [OPTIONS] <WEBSITE_URL>
Arguments:
<WEBSITE_URL> Website URL to scan (e.g., https://example.com)
Options:
--json Output results in JSON format instead of a table
--max-depth <MAX_DEPTH> Maximum crawl depth (default: 1) [default: 1]
-h, --help Print help
π Scanning website: https://example.com
π Max crawl depth: 1
π Crawled 1 page(s)
5 links found on https://example.com
π Checking 5 unique link(s)...
URL STATUS MESSAGE
=========================================================================================================
https://example.com/about β
OK HTTP 200
https://example.com/contact β
OK HTTP 200
https://example.com/old-page π REDIRECT HTTP 301 -> /new-page
https://example.com/missing β BROKEN HTTP 404
https://example.com/timeout β±οΈ TIMEOUT Request timed out
π Summary:
β
OK: 2
β Broken: 3
π Total: 5
link-guardian site https://example.com --json[
{
"url": "https://example.com/about",
"status": "ok",
"message": "HTTP 200"
},
{
"url": "https://example.com/old-page",
"status": "redirect",
"redirect": "https://example.com/new-page",
"message": "HTTP 301 -> https://example.com/new-page"
},
{
"url": "https://example.com/missing",
"status": "broken",
"message": "HTTP 404"
}
]Perfect for CI/CD integration:
- 0: All links are OK (success)
- 1: Broken links detected (failure)
- 2: Internal error or invalid usage
#!/bin/bash
# In your CI script
link-guardian github https://github.com/youruser/yourrepo
if [ $? -eq 1 ]; then
echo "β Broken links detected!"
exit 1
else
echo "β
All links are healthy!"
filink-guardian/
βββ Cargo.toml # Project metadata and dependencies
βββ README.md # This file
βββ src/
βββ main.rs # Entry point, orchestrates everything
βββ cli.rs # Command-line parsing (clap)
βββ checker/
β βββ mod.rs # Checker module exports
β βββ http.rs # HTTP link checking logic
β βββ markdown.rs # Extract links from Markdown
β βββ html.rs # Extract links from HTML
βββ github/
β βββ mod.rs # GitHub module exports
β βββ fetch.rs # Fetch files from GitHub repos
βββ crawl/
βββ mod.rs # Crawl module exports
βββ queue.rs # Website crawling with BFS
- Parse the GitHub URL to extract owner/repo
- Fetch README.md from
raw.githubusercontent.com - Parse Markdown and extract all HTTP/HTTPS links
- Check each link concurrently (up to 50 at a time)
- Report results
- Fetch the starting URL
- Extract all links from the HTML
- If max-depth > 1, crawl same-domain links (breadth-first)
- Collect all unique links found across all pages
- Check each link concurrently (up to 50 at a time)
- Report results
For each link:
- Make an HTTP HEAD request (lightweight, no body)
- Categorize the response:
- 200-299: β OK
- 300-399: π Redirect
- 404/410: β Broken
- Timeout: β±οΈ Timeout
- SSL errors: π SSL Error
- DNS errors: π DNS Error
- Other:
β οΈ Error
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run a specific test
cargo test test_check_valid_link# Run without building a release binary
cargo run -- github https://github.com/rust-lang/rust
cargo run -- site https://example.com --max-depth 2# Format code
cargo fmt
# Check for common mistakes
cargo clippyThe code is heavily commented to teach Rust concepts. Look for:
- Function-level comments: Explain what each function does
- Inline comments: Explain tricky Rust concepts
- "BEGINNER NOTES" sections: Deep dives into Rust concepts
Key Rust concepts used in this project:
- Modules: Organizing code into namespaces
- async/await: Concurrent programming for network I/O
- Result<T, E>: Type-safe error handling
- Option: Representing values that might not exist
- Ownership: Who owns data and when it's freed
- Borrowing: Temporary access to data without owning it
- Traits: Like interfaces in other languages
- Pattern matching: The
matchkeyword for control flow - Iterators: Processing sequences of items efficiently
- The repository might use
masterinstead ofmainbranch - The repository might not have a README.md
- Check the URL is correct:
https://github.com/owner/repo
- Some websites have invalid or expired SSL certificates
- This is reported as a "broken" link for safety
- The URL might have a redirect loop
- Default limit is 5 redirects
- GitHub's raw.githubusercontent.com has rate limits
- For heavy usage, consider implementing GitHub API with authentication
- Websites might rate-limit or block rapid requests
- Use GitHub API (octocrab) for better repo access
- Colored terminal output
- Progress bars for long scans
- Configurable ignore patterns (skip certain URLs)
- Support for other platforms (GitLab, Bitbucket)
- Retry logic for transient failures
- HTML report generation
- Recursive docs/ folder scanning for GitHub repos
This is a learning project! Contributions welcome:
- Fork the repository
- Create a feature branch
- Make your changes (keep the teaching style!)
- Add tests
- Submit a pull request
MIT License - see LICENSE file for details
Built with:
- clap - Command-line parsing
- tokio - Async runtime
- reqwest - HTTP client
- scraper - HTML parsing
- pulldown-cmark - Markdown parsing
- url - URL parsing
- serde - Serialization
- anyhow - Error handling
Made with β€οΈ for Rust learners