Distributed crawling framework for documents and structured data.
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs Add redis and threading config options Dec 13, 2018
example Fix pip call Jan 9, 2019
memorious Get rid of base class Jan 31, 2019
.bumpversion.cfg Bump version: 0.9.2 → 0.10.0 Jan 31, 2019
.dockerignore store http sessions in redis separately Aug 19, 2018
.gitignore added pycharm to gitignore Aug 15, 2018
.travis.yml Clean out sentry cruft Aug 19, 2018
Dockerfile
LICENSE Add LICENSE Sep 5, 2017
Makefile Switch to alpine for Docker image Jan 9, 2019
README.rst Added ui screenshot to README. Nov 27, 2017
env.sh.tmpl
setup.cfg Version up Mar 14, 2018
setup.py
tox.ini store http sessions in redis separately Aug 19, 2018

README.rst

Memorious

The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.

Funes the Memorious, Jorge Luis Borges

memorious is a distributed web scraping toolkit. It is a light-weight tool that schedules, monitors and supports scrapers that collect structured or un-structured data. This includes the following use cases:

  • Maintain an overview of a fleet of crawlers
  • Schedule crawler execution in regular intervals
  • Store execution information and error messages
  • Distribute scraping tasks across multiple machines
  • Make crawlers modular and simple tasks re-usable
  • Get out of your way as much as possible

docs/memorious-ui.png

Design

When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.

memorious handles this by managing a set of crawlers, each of which can be composed of multiple stages. Each stage is implemented using a Python function, which can be re-used across different crawlers.

The basic steps of writing a Memorious crawler:

  1. Make YAML crawler configuration file
  2. Add different stages
  3. Write code for stage operations (optional)
  4. Test, rinse, repeat

Documentation

The documentation for Memorious is available at memorious.readthedocs.io. Feel free to edit the source files in the docs folder and send pull requests for improvements.

To build the documentation, inside the docs folder run make html

You'll find the resulting HTML files in /docs/_build/html.