GitHub

IOWeb Framework

Python framework to build web crawlers.

What we have at the moment:

system designed to run large number network threads (like 100 or 500) on on CPU core
built-in feature to combine things in chunks and then doing something with chunks (like mongodb bulk write)
asynchronous things are powered by gevent
network requests are handled with urllib3
urllib3 monkey-patched to extract cert details
urllib3 monkey-patched to not do domain resolving if domain IP has been provided
built-in stat module to count events, built-in logging into influxdb
retrying on errors
no tests
no documentation

I am using ioweb to do bulk web scraping like crawling 500M pages in few days.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
ioweb		ioweb
ioweb_gevent		ioweb_gevent
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
.hgignore		.hgignore
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
pytest.ini		pytest.ini
setup.py		setup.py
tox.ini		tox.ini