Tiny Web Crawler

A simple and efficient web crawler for Python.

Features

Crawl web pages and extract links starting from a root URL recursively
Concurrent workers and custom delay
Handle relative and absolute URLs
Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks

Installation

Install using pip:

pip install tiny-web-crawler

Usage

from tiny_web_crawler.crawler import Spider

root_url = 'http://github.com'
max_links = 2

crawl = Spider(root_url, max_links)
crawl.start()


# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0

crawl = Spider(root_url='https://github.com', max_links=5, max_workers=5, delay=1, verbose=False)
crawl.start()

Output Format

Crawled output sample for https://github.com

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
tests		tests
tiny_web_crawler		tiny_web_crawler
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tiny Web Crawler

Features

Installation

Usage

Output Format

About

Uh oh!

Releases

Packages

Languages

License

devavinothm/tiny-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Tiny Web Crawler

Features

Installation

Usage

Output Format

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages