Fast, lightweight Python library for crawling websites and converting HTML to Markdown. Perfect for documentation extraction, content migration, and offline reading.
docu-crawler is a production-ready Python web crawler that extracts, converts, and stores web content efficiently. Crawl documentation sites, migrate content, and create offline documentation with minimal dependencies.
Current Version: 1.1.0 - See CHANGELOG.md for latest updates and bug fixes.
docu-crawler is a specialized Python web crawler library designed for:
- Web Crawling: Systematically crawl websites while respecting robots.txt
- HTML to Markdown Conversion: Convert HTML pages to clean, readable Markdown format
- Content Extraction: Extract and preserve website structure and content
- Multi-Cloud Storage: Store crawled content locally or in cloud storage (AWS S3, Google Cloud Storage, Azure Blob Storage, SFTP)
- Minimal Dependencies: Only requires
requestsandbeautifulsoup4for core functionality - Easy to Use: Simple Python API and CLI interface
- Production Ready: Built-in retry logic, rate limiting, and comprehensive error handling
- Extensible: Plugin-based storage system, easy to add custom backends
- Cross-Platform: Works on Linux, Windows, and macOS
- Robust: Extensive bug fixes and improvements in v1.1.0 for stability and reliability
- Lightweight: Minimal dependencies (only
requestsandbeautifulsoup4required) - Multi-Cloud Storage: Support for local filesystem, AWS S3, Google Cloud Storage, Azure Blob Storage, and SFTP
- Flexible API: Use as a Python library or CLI tool
- HTML to Markdown: Intelligent conversion preserving structure and formatting
- Robots.txt Support: Respects robots.txt and crawl-delay directives
- Sitemap Support: Automatically detects and crawls pages from sitemap.xml
- Performance: Configurable rate limiting and retry logic
- Cross-Platform: Works on Linux, Windows, and macOS
- Optional Dependencies: Install only what you need
pip install docu-crawlerThis installs only the core dependencies (requests and beautifulsoup4).
pip install docu-crawler[yaml] # YAML config file support
pip install docu-crawler[s3] # AWS S3 storage
pip install docu-crawler[gcs] # Google Cloud Storage
pip install docu-crawler[azure] # Azure Blob Storage
pip install docu-crawler[sftp] # SFTP storage
pip install docu-crawler[all] # Install everythingfrom docu_crawler import crawl_to_local
result = crawl_to_local("https://docs.example.com", output_dir="my_docs")
print(f"Crawled {result['pages_crawled']} pages")docu-crawler https://docs.example.com --output my-docs --delay 2 --max-pages 100 --single-file --frontmatterfrom docu_crawler import DocuCrawler
crawler = DocuCrawler(
start_url="https://docs.example.com",
output_dir="downloaded_docs",
delay=1.0,
max_pages=100
)
crawler.crawl()from docu_crawler import crawl_to_s3, crawl_to_gcs, crawl_to_azure, crawl_to_sftp
# AWS S3
crawl_to_s3(url="https://docs.example.com", bucket="my-bucket", region="us-east-1")
# Google Cloud Storage
crawl_to_gcs(url="https://docs.example.com", bucket="my-bucket", project="my-project")
# Azure Blob Storage
crawl_to_azure(url="https://docs.example.com", container="my-container")
# SFTP
crawl_to_sftp(url="https://docs.example.com", host="sftp.example.com", user="username")from docu_crawler import crawl
def on_page_crawled(url, page_count):
print(f"Page {page_count}: {url}")
def on_error(url, error):
print(f"Error crawling {url}: {error}")
result = crawl(
url="https://docs.example.com",
output_dir="my_docs",
on_page_crawled=on_page_crawled,
on_error=on_error
)Note: Callbacks are now passed directly to the crawler constructor (v1.1.0+), providing better error handling and reliability.
No additional dependencies required. Files are saved to the specified output directory.
from docu_crawler import crawl_to_local
result = crawl_to_local("https://docs.example.com", output_dir="docs")Requires: pip install docu-crawler[s3]
from docu_crawler import crawl_to_s3
result = crawl_to_s3(
url="https://docs.example.com",
bucket="my-bucket",
region="us-east-1"
)Credentials via: Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), AWS credentials file, or IAM roles.
Requires: pip install docu-crawler[gcs]
from docu_crawler import crawl_to_gcs
result = crawl_to_gcs(
url="https://docs.example.com",
bucket="my-bucket",
project="my-project"
)Credentials via: GOOGLE_APPLICATION_CREDENTIALS environment variable or service account key file.
Requires: pip install docu-crawler[azure]
from docu_crawler import crawl_to_azure
result = crawl_to_azure(
url="https://docs.example.com",
container="my-container"
)Credentials via: AZURE_STORAGE_CONNECTION_STRING environment variable or connection string parameter.
Requires: pip install docu-crawler[sftp]
from docu_crawler import crawl_to_sftp
result = crawl_to_sftp(
url="https://docs.example.com",
host="sftp.example.com",
user="username",
password="password" # or key_file="/path/to/key"
)Main crawler class for programmatic use.
from docu_crawler import DocuCrawler
crawler = DocuCrawler(
start_url: str, # Starting URL (required, must be HTTP/HTTPS)
output_dir: str = "downloaded_docs", # Output directory for local storage
delay: float = 1.0, # Delay between requests in seconds
max_pages: int = 0, # Maximum pages to crawl (0 = unlimited)
timeout: int = 10, # Request timeout in seconds
storage_config: Optional[Dict[str, Any]] = None, # Storage configuration dict
single_file: bool = False # Combine all output into one file
)
crawler.crawl()Parameters:
start_url(str): Starting URL for crawling. Must be a valid HTTP or HTTPS URL.output_dir(str): Directory where files will be saved (default: "downloaded_docs")delay(float): Delay between requests in seconds (default: 1.0, must be >= 0)max_pages(int): Maximum number of pages to crawl. 0 means unlimited (default: 0, must be >= 0)timeout(int): Request timeout in seconds (default: 10, must be > 0)storage_config(dict, optional): Storage configuration dictionary. If None, uses local storage.single_file(bool): Whether to combine all crawled content into a singledocumentation.mdfile (default: False)on_page_crawled(callable, optional): Callback function called when a page is successfully crawled. Signature:(url: str, page_count: int) -> Noneon_error(callable, optional): Callback function called when an error occurs. Signature:(url: str, error: Exception) -> None
Raises:
InvalidURLError: If start_url is invalidValueError: If other parameters are invalid
Methods:
crawl(): Start the crawling process. Saves converted Markdown files to configured storage.
All convenience functions return a dictionary with crawl results:
{
'pages_crawled': int, # Number of pages successfully crawled
'pages_failed': int, # Number of pages that failed
'urls_visited': int, # Total number of URLs visited
'bytes_downloaded': int, # Total bytes downloaded
'elapsed_time': float # Total elapsed time in seconds
}General-purpose crawl function with full control.
from docu_crawler import crawl
result = crawl(
url: str, # Starting URL
output_dir: str = "downloaded_docs", # Output directory
delay: float = 1.0, # Delay between requests
max_pages: int = 0, # Max pages (0 = unlimited)
timeout: int = 10, # Request timeout
storage_config: Optional[Dict] = None, # Storage config
on_page_crawled: Optional[Callable] = None, # Callback(url, page_count)
on_error: Optional[Callable] = None # Callback(url, error)
)Callbacks:
on_page_crawled(url: str, page_count: int): Called when a page is successfully crawledon_error(url: str, error: Exception): Called when an error occurs
Note: In v1.1.0+, callbacks are passed directly to DocuCrawler constructor for improved reliability and error handling.
Crawl website and save to local filesystem.
from docu_crawler import crawl_to_local
result = crawl_to_local(
url: str,
output_dir: str = "downloaded_docs",
**kwargs # Additional crawl() parameters
)Crawl website and save to AWS S3.
from docu_crawler import crawl_to_s3
result = crawl_to_s3(
url: str,
bucket: str, # S3 bucket name (required)
region: Optional[str] = None, # AWS region
**kwargs # Additional crawl() parameters
)Note: Requires boto3. Install with pip install docu-crawler[s3]
Crawl website and save to Google Cloud Storage.
from docu_crawler import crawl_to_gcs
result = crawl_to_gcs(
url: str,
bucket: str, # GCS bucket name (required)
project: Optional[str] = None, # GCP project ID
credentials: Optional[str] = None, # Path to credentials JSON
**kwargs # Additional crawl() parameters
)Note: Requires google-cloud-storage. Install with pip install docu-crawler[gcs]
Crawl website and save to Azure Blob Storage.
from docu_crawler import crawl_to_azure
result = crawl_to_azure(
url: str,
container: str, # Azure container name (required)
connection_string: Optional[str] = None, # Azure connection string
**kwargs # Additional crawl() parameters
)Note: Requires azure-storage-blob. Install with pip install docu-crawler[azure]
Crawl website and save via SFTP.
from docu_crawler import crawl_to_sftp
result = crawl_to_sftp(
url: str,
host: str, # SFTP server hostname (required)
user: str, # SFTP username (required)
password: Optional[str] = None, # SFTP password
port: int = 22, # SFTP port
key_file: Optional[str] = None, # Path to SSH key file
remote_path: str = '', # Base remote path
**kwargs # Additional crawl() parameters
)Note: Requires paramiko. Install with pip install docu-crawler[sftp]
The library defines custom exceptions:
DocuCrawlerError: Base exception for all docu-crawler errorsInvalidURLError: Raised when an invalid URL is providedContentTooLargeError: Raised when content exceeds maximum size limitConfigurationError: Raised when there's a configuration errorStorageError: Raised when there's a storage operation errorCrawlerError: Raised when there's a crawler operation error
- AWS S3:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_DEFAULT_REGION - GCS:
GOOGLE_APPLICATION_CREDENTIALS - Azure:
AZURE_STORAGE_CONNECTION_STRING - SFTP:
SFTP_PASSWORD,SFTP_KEY_FILE
Create crawler_config.yaml:
url: https://docs.example.com
output: downloaded_docs
delay: 1.0
max_pages: 0
timeout: 10
log_level: INFO
storage_type: s3
s3_bucket: my-bucket
s3_region: us-east-1Config file locations (checked in order):
./crawler_config.yaml./config/crawler_config.yaml~/.config/docu-crawler/config.yaml/etc/docu-crawler/config.yaml
docu-crawler automatically detects and uses sitemaps if:
- The start URL ends with
.xml(e.g.,https://example.com/sitemap.xml) - The URL contains "sitemap"
This ensures comprehensive coverage of all documentation pages.
Use the --frontmatter flag (CLI) or include_frontmatter=True (Python) to add metadata to each Markdown file:
---
title: "Page Title"
source: "https://example.com/page"
date: 2023-10-27
---This is essential for RAG (Retrieval-Augmented Generation) systems and LLM indexing.
Automatically respects robots.txt files and crawl-delay directives.
Built-in rate limiting to respect server limits. The crawler automatically uses rate limiting based on the delay parameter and robots.txt crawl-delay directives. For custom rate limiting:
from docu_crawler.utils.rate_limiter import SimpleRateLimiter
limiter = SimpleRateLimiter(delay=1.0) # 1 second delay between requests
limiter.wait_if_needed(domain="example.com")Automatic retry with exponential backoff:
from docu_crawler.utils.retry import retry_with_backoff
@retry_with_backoff(max_retries=3, initial_delay=1.0)
def fetch_url(url):
passCrawl and convert documentation websites to Markdown for offline reading or migration.
Migrate content from one platform to another by converting HTML to Markdown.
Create offline versions of online documentation for local access.
Extract and analyze web content programmatically.
pip install docu-crawlerFor all features: pip install docu-crawler[all]
Python 3.9 and above.
Yes! Local filesystem storage is the default and requires no additional dependencies.
Yes, docu-crawler automatically respects robots.txt files and crawl-delay directives.
Yes, the HTML processor can be extended and customized for your specific needs.
docu-crawler is designed for efficiency with configurable rate limiting and retry logic.
Import Error: Install docu-crawler: pip install docu-crawler
Storage Backend Error: Install the required backend: pip install docu-crawler[s3]
YAML Config Error: Install YAML support: pip install docu-crawler[yaml]
SSL Certificate Error: The crawler uses proper SSL verification by default.
InvalidURLError: Ensure the URL starts with http:// or https://
ContentTooLargeError: The default maximum content size is 10MB. Large files are skipped to prevent memory issues.
Connection Errors: Check your network connection and ensure the target website is accessible. The crawler will retry failed requests up to 3 times.
Encoding Errors: v1.1.0+ includes improved encoding handling with automatic UTF-8/Latin-1 fallback for better compatibility with international sites.
Sitemap Parsing Issues: v1.1.0+ includes recursion protection and depth limits to prevent infinite loops when parsing nested sitemaps.
- JavaScript-rendered content: The crawler does not execute JavaScript, so content rendered dynamically by JavaScript will not be captured.
- Authentication: The crawler does not support authenticated sessions or login forms.
- Rate limiting: Always respects robots.txt and implements rate limiting, but some sites may still block aggressive crawling.
- Large files: Files larger than 10MB are skipped to prevent memory issues.
- Content types: Only HTML content is processed. Binary files, images, PDFs, etc. are skipped.
Contributions are welcome! See CONTRIBUTING.md for guidelines.
Releases are automated via GitHub Actions. See RELEASE.md for details on how releases work.
Bug Fixes:
- Fixed 14+ critical bugs including NameError, IndexError, AttributeError, and KeyError issues
- Improved encoding handling for robots.txt and sitemap parsing
- Fixed type signature mismatches across storage backends
- Enhanced error handling and exception propagation
Improvements:
- Better callback mechanism (passed directly to constructor)
- Improved code comments and documentation
- Enhanced recursion protection for nested sitemap parsing
- Better URL validation and CLI argument handling
See CHANGELOG.md for complete version history.
MIT License - see LICENSE file for details.
- GitHub: https://github.com/dataiscool/docu-crawler
- Issues: https://github.com/dataiscool/docu-crawler/issues
- PyPI: https://pypi.org/project/docu-crawler/
- Scrapy - Full-featured web scraping framework
- Beautiful Soup - HTML parsing library
- Markdownify - HTML to Markdown converter