Tools for access, "diff"-ing, and analyzing archived web pages
dependabot-bot and Mr0grog Update coverage requirement from ~=4.5.1 to ~=4.5.2
Updates the requirements on [coverage]( to permit the latest version.
- [Changelog](
- [Commits](

Signed-off-by: dependabot[bot] <>
Latest commit 1330bf6 Nov 19, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci Auto-publish docker images from `release` branch Oct 13, 2018
archives Readme, api, and examples (#7) Feb 3, 2017
docs Fix typo Sep 19, 2018
page_freezer_python_module REF/TST: Revisit PageFreezer module and add tests. May 14, 2017
scripts Upgrade from Raven to the new Sentry SDK Nov 7, 2018
web_monitoring Merge pull request #313 from edgi-govdata-archiving/310-the-encoding-… Nov 8, 2018
.coveragerc Add coveragerc to skip vendored and generated versioneer code. Nov 7, 2017
.dockerignore Add .dockerignore to reduce container size and protect .env. Jan 3, 2018
.editorconfig Move most editorconfig settings to all file types May 17, 2018
.env.example Use a single, long-lived process pool for diffs Nov 5, 2018
.gitattributes MNT: Use versioneer. Jun 30, 2017
.gitignore Remove wrongly-committed generated docs. Ignore them in future. Sep 12, 2018
.travis.yml Use doctr to deploy built docs to GH pages. Sep 14, 2018 Add Code of Conduct to README and CONTRIBUTING docs Sep 5, 2018 Chunking Diffs From PageFreezer API (#8) Feb 6, 2017
Dockerfile Add maintainer. Dec 31, 2017
LICENSE Initial commit Feb 2, 2017 Add some example HTML for unit tests and test htmltreediff. Oct 12, 2017
Pagefreezer diff service demo.ipynb Added a diff service demo file, fixed variable typo issues in pagefre… Jun 23, 2017 Add Code of Conduct to README and CONTRIBUTING docs Sep 5, 2018 Use Supervisor's process grouping features Sep 8, 2018
dev-requirements.txt Update coverage requirement from ~=4.5.1 to ~=4.5.2 Nov 20, 2018
environment.yml Add a conda environment file. Mar 14, 2018
github_deploy_key_edgi_govdata_archiving_web_monitoring_processing.enc Use doctr to deploy built docs to GH pages. Sep 14, 2018
requirements.txt Update sentry-sdk requirement from ~=0.5.4 to ~=0.5.5 Nov 20, 2018 Add function for querying IA for list of versions (#38) May 13, 2017
setup.cfg MNT: Use versioneer. Jun 30, 2017 Cache downloaded content and export result HTML for inspection. Oct 13, 2017 MNT: Use versioneer. Jun 30, 2017


A component of the EDGI Web Monitoring Project.

Overview of this component's tasks

This component is intended to hold various backend tools serving different tasks:

  1. Query external sources of captured web pages (e.g. Internet Archive, Page Freezer, Sentry), and formulate a request for importing their version and page metadata into web-monitoring-db.
  2. Provide a web service that computes the "diff" between two versions of a page in response to a query from web-monitoring-db.
  3. Query web-monitoring-db for new Changes, analyze them in an automated pipeline to assign priority and/or filter out uninteresting ones, and submit this information back to web-monitoring-db.

Development status

Working and Under Active Development:

  • A Python API to PageFreezer's diffing service in web_monitoring.page_freezer
  • A Python API to the Internet Archive Wayback Machine's archived webpage snapshots in web_monitoring.internetarchive
  • A Python API to the web-monitoring-db Rails app in web_monitoring.db
  • Python functions and a command-line tool for importing snapshots from PF and IA into web-monitoring-db.

Legacy projects that may be revisited:

Installation Instructions

  1. Get Python 3.6. This packages makes use of modern Python features and requires Python 3.6+. If you don't have Python 3.6, we recommend using conda to install it. (You don't need admin privileges to install or use it, and it won't interfere with any other installations of Python already on your system.)

  2. Install the package.

    pip install -r requirements.txt
    python develop
  3. Copy the script .env.example to .env and supply any local configuration info you need. (Only some of the package's functionality requires this.) Apply the configuration:

    source .env
  4. See module comments and docstrings for more usage information. Also see the command line tool wm, which is installed with the package. For help, use

    wm --help
  5. To run the tests or build the documentation, first install the development requirements.

    pip install -r dev-requirements.txt
  6. To build the docs:

    cd docs
    make html
  7. To run the tests:


    Any additional arguments are passed through to py.test.


The Dockerfile runs wm-diffing-server on port 80 in the container. To build and run:

docker build -t processing .
docker run -p 4000:80 processing

Point your browser or curl at http://localhost:4000.

Code of Conduct

This repository falls under EDGI's Code of Conduct.


This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for all their contributions! See our contributing guidelines to find out how you can help.

Contributions Name
💻 ⚠️ 🚇 📖 💬 👀 Dan Allan
💻 Vangelis Banos
💻 📖 Chaitanya Prakash Bapat
💻 ⚠️ 🚇 📖 💬 👀 Rob Brackett
💻 Stephen Buckley
💻 📖 📋 Ray Cha
💻 ⚠️ Janak Raj Chadha
💻 Autumn Coleman
💻 Luming Hao
💻 Stuart Lynn
💻 Allan Pichardo
📖 📋 Matt Price
📖 Susan Tan
💻 ⚠️ Fotis Tsalampounis
📖 📋 Dawn Walker

(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)

License & Copyright

Copyright (C) 2017-2018 Environmental Data and Governance Initiative (EDGI)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the LICENSE file for details.