new_frontera

Overview

new_frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

new_frontera takes care of the logic and policies to follow during the crawl. It stores and prioritizes links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Main features

Online operation: small requests batches, with parsing done right after fetch.
Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
Two run modes: single process and distributed.
Built-in SqlAlchemy, Redis and HBase backends.
Built-in Apache Kafka and ZeroMQ message buses.
Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
Transparent data flow, allowing to integrate custom components easily using Kafka.
Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
Optional use of Scrapy for fetching and parsing.
3-clause BSD license, allowing to use in any commercial product.
Python 3 support.

Installation

Development version:

$ pip install git+https://github.com/ZeroCool940711/new-frontera.git

or from PyPi:

$ pip install new-frontera

Documentation

Community

If you have any question or want to contribute, feel free to open an issue/discusion on GitHub or make a Pull Requests.

Name		Name	Last commit message	Last commit date
Latest commit History 947 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
new_frontera		new_frontera
requirements		requirements
tests		tests
.coveragerc		.coveragerc
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.travis.yml		.travis.yml
AUTHORS		AUTHORS
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini
versioneer.py		versioneer.py

License

ZeroCool940711/new-frontera

Folders and files

Latest commit

History

Repository files navigation

new_frontera

Overview

Main features

Installation

Documentation

Community

About

Topics

Resources

License

Stars

Watchers

Forks

Languages