Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Clone or download
Latest commit 0c8f4d4 Oct 12, 2018

README.md

EDGI: Web Monitoring Project

Environmental Data & Governance Initiative (EDGI) is an international network of academics and non-profits addressing potential threats to federal environmental and energy policy, and to the scientific research infrastructure built to investigate, inform, and enforce them.

Website Monitoring is an EDGI project aspiring to build tools and community around monitoring changes to government websites, both environment-related and otherwise.

This repository is for project-wide documentation and issues-tracking.

This project and its associated efforts are already monitoring tens of thousands of government web pages. But we aspire for larger impact, eventually monitoring tens of millions or more. Currently, there is a lot of manual labor that goes into reviewing all changes, regardless of whether they are meaningful or not. Any system will need to emphasize usability of the UI and efficiency of computational resources.

You can track upcoming releases by exploring our milestones.

πŸ”¨ Technologies Used

To help newcomers better understand where they might be able to contribute, these are the main tools that we use and are having discussions around within this project. This includes software in the application, but also platforms that we rely on.

And don't worry -- you definitely don't need to know all of them!

  • HTML & CSS.
  • SASS.
  • Ruby on Rails. A web application development framework used for API server.
  • Javascript. Notable packages include:
    • Webpack. A static module bundler for modern JavaScript applications.
    • ReactJS. A JavaScript library for building user interfaces.
  • Python. Notable packages include:
    • Beautiful Soup. A library for pulling data out of HTML and XML files.
    • Tornado. A web server and application framework.
  • Postgresql. A powerful, open source object-relational database system.
  • Redis. An open source, in-memory data structure store, used as a database, cache and message broker.
  • Swagger. A spec and framework for API developer tools.
  • Heroku. A platform for easily deploying applications.
  • Ansible. An open source automation platform for software configuration.
  • Docker. Runs "containers images" to help make running software simpler for developers.
  • Versionista. Enterprise tool for webpage change detection and alerts.
  • Internet Archive. A nonprofit-led digital library of Internet website history going back 20+ years.
  • Amazon Web Services. (AWS) A hosted cloud services platform for servers, databases, file storage, etc.
  • Sentry. A hosted error-tracking service, that happens to be open source.

Project Goals

The purpose of the system is to enable analysts to quickly review monitored government websites in order to report on meaningful changes. The Website Monitoring automated system a.k.a. Scanner aims to make these changes easy to track, review, and report on.

Broadly speaking:

  1. Scanner receives periodic scrapes of target websites from archival sources.
  2. (Not yet implemented) Scanner processes data to sift out meaningful changes for volunteer analysts.
  3. Volunteers and experts work together to further sift out meaningful changes and qualify them for journalists by writing reports.
  4. Journalists build narratives and amplify stories for the wider public.

How to Help

The best way to get involved is to take a run through our onboarding process, for which we rely on Trello. It's designed to be self-directed, so you can run through it at your own pace. But don't worry -- along the way, it will introduce you to the humans of EDGI's Web Monitoring project! Yay humans!

Onboarding screenshot

We are currently revamping that process so check back soon for a link. In the meantime these developer onboarding videos, though long, will be useful.

Developer Orientation

Architecture Overview

Where we work

  • Say hi on our chat!
    • Create an account on Slack team.
    • Join us in the #webmonitoring channel.
  • Attend a software development call.
    • Join our call, every Wednesday at 12pm ET.
    • Zoom Meeting account is optional. See above link for details.
    • We keep notes for all meetings.

Project Overview

Use Case

  1. Access captured data (starting with HTML, later encompassing more types) from multiple archival sources including Versionista and the Internet Archive.
  2. Compare versions of the same page over time --- potentially using multiple different strategies.
  3. Automatically filter out "nonmeaningful" or repetitive changes: for example, the "Page Last Viewed" timestamp updated or the same news article was added to 100 pages from the same website.
  4. Prioritize the changes most likely to be "meaningful," meaning that some item of importance to fact-based governance was deleted or changed in a harmful way.
  5. Present changes to human analysts with useful visualizations and statistics to help them differentiate meaningful changes. Each user will have been assigned a "subdomain", a full or partial government domain that has been identified as relevant to fact-based governance.
  6. Collect annotations from the analysts. Use this to flag changes for special attention from EDGI administrators. Also, use it to feed back into the filtering and prioritization process.

Architecture

See ARCHITECTURE.md

Identifying "Meaningful Changes"

The vast majority of changes to web pages are not relevant to analysts and we want to avoid presenting those irrelevant changes to analysts at all. It is, of course, not trivial to identify "meaningful" changes immediately, and we expect that analysts will always be involved in a decision about whether some changes are "important" or not. However, as we expand from 104 to 107 web pages, we need to drastically reduce the number of pages that analysts look at.

Some examples of meaningless changes:

  • it's not unusual for a page to have a view counter on the bottom. In this case, the page changes by definition every time you view it.
  • many sites have "content sliders" or news feeds that update periodically. This change may be "meaningful", in that it's interesting to see news updates. But it's only interesting once, not (as is sometimes seen) 1000 or 10000 times.

An example of a meaningful change:

  • In February, we noticed a systematic replacement of the word "impact" with the word "effect" on one website. This change is very interesting because while "impact" and "effect" have similar meanings, "impact" is a stronger word. So, there is an effort being made to weaken the language on existing sites. Our question is in part: what tools would we need in order to have this change flagged by our tools and presented to the analyst as potentially interesting?

Sample Data

example-data contains examples of website changes:

  • falsepos-... files are cases any filter should catch
  • truepos... files are cases of changes we care about

This is a small but illustrative sample. Many more samples will be made available as soon as possible.

Code of Conduct

This repository falls under EDGI's Code of Conduct.

Contributing

Don't forget to check out the "How To Help" section above.

See our contributor guidelines.

This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for their work reviewing URL's, monitoring changes, writing reports, and a slew of so many other things!

Contributions Name
πŸ”’ Chris Amoss
πŸ”’ πŸ“‹ πŸ€” Maya Anjur-Dietrich
πŸ”’ Marcy Beck
πŸ”’ πŸ“‹ πŸ€” Andrew Bergman
πŸ”’ Madelaine Britt
πŸ”’ Ed Byrne
πŸ”’ Morgan Currie
πŸ”’ Justin Derry
πŸ”’ πŸ“‹ πŸ€” Gretchen Gehrke
πŸ”’ Jon Gobeil
πŸ”’ Pamela Jao
πŸ”’ Sara Johns
πŸ”’ Abby Klionski
πŸ”’ Katherine Kulik
πŸ”’ Aaron Lamelin
πŸ”’ πŸ“‹ πŸ€” Rebecca Lave
πŸ”’ Eric Nost
πŸ“– Karna Patel
πŸ”’ Lindsay Poirier
πŸ”’ πŸ“‹ πŸ€” Toly Rinberg
πŸ”’ Justin Schell
πŸ”’ Lauren Scott
πŸ€” πŸ” Nick Shapiro
πŸ”’ Miranda Sinnott-Armstrong
πŸ”’ Julia Upfal
πŸ”’ Tyler Wedrosky
πŸ”’ Adam Wizon
πŸ”’ Jacob Wylie

(For a key to the contribution emoji or more info on this format, check out β€œAll Contributors.”)

Sponsors & Partners

Finally, we want to give a huge thanks to partner organizations that have helped to support this project with their tools and services:

License & Copyright

Copyright (C) 2017 Environmental Data and Governance Initiative (EDGI)
Creative Commons License Web Monitoring documentation is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file for details.