EDGI: Web Monitoring Project
- Loading, storing, and analyzing historical snapshots of web pages
- Providing an API for retrieving and updating data about those snapshots
- A website for visualizing and browsing changes between those snapshots
- Tools for managing the workflow of a team of human analysts using the above tools to track and publicize information about meaningful changes to government websites.
EDGI uses these tools to publish reports that are written about in major publications such as The Atlantic or Vice. Teams at other organizations use parts of this project for similar purposes or to provide comparisons between different versions of public web pages.
This project and its associated efforts are already monitoring tens of thousands of government web pages. But we aspire for larger impact, eventually monitoring tens of millions or more. Currently, there is a lot of manual labor that goes into reviewing all changes, regardless of whether they are meaningful or not. Any system will need to emphasize usability of the UI and efficiency of computational resources.
- Project Structure
- Get Involved
- Project Overview
- Code of Conduct
- Contributors & Sponsors
- License & Copyright
The technical tooling for Web Monitoring is broken up into several repositories, each named
|web-monitoring||(This Repo!) Project-wide documentation and issue tracking.||Markdown|
|web-monitoring-db||A database and API that stores metadata about the pages, versions, changes we track, as well as human annotations about those changes.||Ruby, Rails, Postgresql|
|web-monitoring-processing||Python-based tools for importing data and for extracting and analyzing data in our database of monitored pages and changes.||Python|
|web-monitoring-diff||Algorithms for diffing web pages in a variety of ways and a web server for providing those diffs via an HTTP API.||Python, Tornado|
|web-monitoring-versionista-scraper||A set of Node.js scripts that extract data from Versionista and load it into web-monitoring-db. It also generates the CSV files that analysts currently use to manage their work on a weekly basis.||Node.js|
|web-monitoring-ops||Server configuration and other deployment information for managing EDGI’s live instance of all these tools.||Kubernetes, Bash, AWS|
|wayback||A Python API to the Internet Archive’s Wayback Machine. It gives you tools to search for and load mementos (historical copies of web pages).||Python|
For more on how all these parts fit together, see ARCHITECTURE.md.
We’d love your help on improving this project! If you are interested in getting involved…
- Chat with us on Slack (https://archivers.slack.com)
- You can sign up for an account at https://archivers-slack.herokuapp.com/
- Join us in the
- Please follow EDGI's Code of Conduct
This project is two-part! We rely both on open source code contributors (building this tool) and on volunteer analysts who use the tool to identify and characterize changes to government websites.
Get involved as an analyst
- Read through the Project Overview and especially the section on "meaningful changes" to get a better idea of the work
- Contact us either over Slack or at firstname.lastname@example.org to ask for a quick training
Get involved as a programmer
- Be sure to check our contributor guidelines
- Take a look through the repos listed in the Project Structure section and choose one that feels appropriate to your interests and skillset
- Try to get the repo running on your machine (and if you have any challenges, please make issues about them!)
- Find an issue labeled
good-first-issueand work to resolve it
The purpose of the system is to enable analysts to quickly review monitored government websites in order to report on meaningful changes. In order to do so, the system, a.k.a. Scanner, does several major tasks:
- Interfaces with other archival services (like the Internet Archive) to save snapshots of web pages.
- Imports those snapshots and other metadata from archival sources.
- Determines which snapshots represent a change from a previous version of the page.
- Process changes to automatically determine a priority or sift out meaningful changes for deeper analysis by humans.
- Volunteers and experts work together to further sift out meaningful changes and qualify them for journalists by writing reports.
- Journalists build narratives and amplify stories for the wider public.
Identifying "Meaningful Changes"
The majority of changes to web pages are not relevant and we want to avoid presenting those irrelevant changes to human analysts. Identifying irrelevant changes in an automated way is not easy, and we expect that analysts will always be involved in a decision about whether some changes are "important" or not.
However, as we expand the number of web pages we monitor, we definitely need to develop tools to reduce the number of pages that analysts must look at.
Some examples of meaningless changes:
- it's not unusual for a page to have a view counter on the bottom. In this case, the page changes by definition every time you view it.
- many sites have "content sliders" or news feeds that update periodically. This change may be "meaningful", in that it's interesting to see news updates. But it's only interesting once, not (as is sometimes seen) 1000 or 10000 times.
An example of a meaningful change:
- In February, we noticed a systematic replacement of the word "impact" with the word "effect" on one website. This change is very interesting because while "impact" and "effect" have similar meanings, "impact" is a stronger word. So, there is an effort being made to weaken the language on existing sites. Our question is in part: what tools would we need in order to have this change flagged by our tools and presented to the analyst as potentially interesting?
example-data folder contains examples of website changes to use for analysis.
Code of Conduct
This repository falls under EDGI's Code of Conduct.
This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for their work reviewing URL's, monitoring changes, writing reports, and a slew of so many other things!
(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)
Sponsors & Partners
Finally, we want to give a huge thanks to partner organizations that have helped to support this project with their tools and services:
- The David and Lucile Packard Foundation
- Doris Duke Charitable Foundation
- Amazon Web Services
- Google Cloud Platform
- Google Summer of Code
- The Internet Archive
License & Copyright
Copyright (C) 2017-2020 Environmental Data and Governance Initiative (EDGI)
Web Monitoring documentation in this repository is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. See the
LICENSE file for details.
Software code in other Web Monitoring repositories is generally licensed under the GPL v3 license, but make sure to check each repository’s README for specifics.