A simple and efficient web crawler for Python.
- Crawl web pages and extract links starting from a root URL recursively
- Concurrent workers and custom delay
- Handle relative and absolute URLs
- Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks
Install using pip:
pip install tiny-web-crawlerfrom tiny_web_crawler.crawler import Spider
root_url = 'http://github.com'
max_links = 2
crawl = Spider(root_url, max_links)
crawl.start()
# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0
crawl = Spider(root_url='https://github.com', max_links=5, max_workers=5, delay=1, verbose=False)
crawl.start()Crawled output sample for https://github.com
{
"http://github.com": {
"urls": [
"http://github.com/",
"https://githubuniverse.com/",
"..."
],
"https://github.com/solutions/ci-cd": {
"urls": [
"https://github.com/solutions/ci-cd/",
"https://githubuniverse.com/",
"..."
]
}
}
}