# web-monitoring backend demo

1. Ingest a cache of captured HTML files, representing a **Page** as a series of **Versions** through time.
2. Two Versions of the same Page are a **Change**.
3. To examine a given Change, start by sending requests to PageFreezer. Store its respones (**Diffs**).
4. Assign a Priority to each Change.
4. Access prioritized Changes and store user-submitted **Annotations** (potentially multiple Annotations per Change.)

In [1]:
from datetime import datetime, timedelta
import functools
import hashlib
import os

import sqlalchemy
from web_monitoring.db import (Pages, Versions, Changes, Diffs, Annotations, create,
                               compare, NoAncestor, diff_version, logger)

engine = sqlalchemy.create_engine(os.environ['WEB_MONITORING_SQL_DB_URI'])

In [2]:
create(engine)  # one time only: create tables

# Reflect SQL tables in Python.
versions = Versions(engine)
pages = Pages(engine)
changes = Changes(engine)
diffs = Diffs(engine)
annotations = Annotations(engine)

## Ingesting new HTML

Either manually or via some webhook, the backend is alerted that new captured HTML is avaialbe at some path.

In this example, we load the example files in the web-monitoring repo.

In [3]:
def load_examples():
    EXAMPLES = [
        'falsepos-footer',
        'falsepos-num-views',
        'falsepos-small-changes',
        'truepos-dataset-removal',
        'truepos-image-removal',
        'truepos-major-changes',
    ]
    archives_dir = os.path.join('archives')
    time1 = datetime.now()
    time0 = time1 - timedelta(days=1)
    for example in EXAMPLES:
        simulated_url = 'https://PLACEHOLDER.com/{}.html'.format(example)
        page_uuid = pages.insert(simulated_url, 'some page title', 'some agency', 'some site')
        for suffix, _time in (('-a.html', time0), ('-b.html', time1)):
            filename = example + suffix
            path = os.path.abspath(os.path.join(archives_dir, filename))
            with open(path) as f:
                version_hash = hashlib.sha256(str(f.read()).encode()).hexdigest()
            versions.insert(page_uuid, _time, path, version_hash, 'test', {})
            
load_examples()

No we have a pile of unprocessed Snapshots. Some might be the first time we have seen a Page, while others might be just another Snapshot of a Page we have seen before.

In [4]:
versions.unprocessed

deque(['43142531-6248-4e5c-92a7-8cedd5b237f2',
       'd1c9d1b6-758f-4c32-a7af-23f6712cddc6',
       'a71ab8fc-1692-4974-aadc-16aaaa61c461',
       '3ecf477c-c269-42c7-88f1-acaa1ed8ecb1',
       'a253491b-ead0-45b9-9039-162f72626485',
       'd5540874-94f8-4ef8-8365-c23267be635a',
       '892d8641-7b3d-46cd-8b54-ab6c1d2eef0a',
       'e9f1922b-1254-4217-9746-17409a4a7c60',
       '4a9133e6-ad8e-4363-90b5-a8886e64de23',
       'fda9c053-facf-4530-a5b4-c262ba95a152',
       'ce089f4e-9ddf-4d77-9c6d-36e98ad385ba',
       'ca1c4725-a437-4fbd-a884-61c30635c321'])

The Python API provides uuid-based lookup and returns the data as a `namedtuple` (low memory footprint, convenient attribute access).

In [5]:
v = versions[versions.unprocessed[0]]
v

Version(uuid='43142531-6248-4e5c-92a7-8cedd5b237f2', page_uuid='0b702ae9-a4f8-4f5e-958c-fce2036896ff', capture_time=datetime.datetime(2017, 3, 8, 14, 25, 46, 745157), uri='/Users/dallan/Documents/Repos/web-monitoring-processing/archives/falsepos-footer-a.html', version_hash='41af79e31884c6745834961f435cf233de702065b6bba032a82ec68fc5fd03b7', source_time='test', source_metadata={})

In [6]:
pages[v.page_uuid]

Page(uuid='0b702ae9-a4f8-4f5e-958c-fce2036896ff', url='https://PLACEHOLDER.com/falsepos-footer.html', title='some page title', agency='some agency', site='some site')

## Computing Diffs between Snapshots

Iterate through the unprocessed Snapshots and requests diffs from PageFreezer. Stash the JSON response (which is large) in a file on disk. Store the filepath, the two Snapshots' UUIDs, and other small summary info in the database.

In [7]:
# Set up standard Python logging.
import logging
logging.basicConfig(level='DEBUG')
# This logger will show progress with PageFreezer requests.
logger.setLevel('DEBUG')

def diff_new_versions():
    f = functools.partial(diff_version, versions=versions, changes=changes, diffs=diffs,
                          source_type='test', source_metadata={})
    while True:
        # Get the uuid of a Version to be processed.
        try:
            version_uuid  = versions.unprocessed.popleft()
        except IndexError:
            # nothing left to process
            return
        try:
            f(version_uuid)
        except NoAncestor:
            # This is the oldest Version for this Page -- nothing to compare.
            continue

diff_new_versions()

Logger output:
```
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 6.507 seconds with status ok.
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 9.260 seconds with status ok.
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 2.576 seconds with status ok.
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 13.063 seconds with status ok.
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 2.529 seconds with status ok.
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 2.448 seconds with status ok.
```

Now we have Diffs that need to be prioritized.

In [7]:
from collections import deque
diffs.unprocessed = deque(['912bcfaf-d849-4c01-b76e-ed83e69749e8',
       'a1912e06-b1c8-4513-9a45-aff35765700c',
       '27e7bb72-1cad-4638-b126-df6182f25bd6',
       '76ba3709-cd6a-4781-be7e-73324487a05e',
       '016ed89e-c1cc-4aa5-b526-9cf446a90da7',
       '2440e14e-6fc6-4df0-ad97-410c624977c8'])

Accessing the diff from the Python API access that stashed JSON file and transparently fills it into the result. Since it's quite verbose, we'll just look at the *fields* here, not the values.

In [18]:
diffs[diffs.unprocessed[0]]._fields

('uuid',
 'change_uuid',
 'diffhash',
 'uri',
 'source_type',
 'source_metadata',
 'content')

## Prioritizing Diffs

Iterate through the unprocessed Diffs and assign a priority. This is where the clever text processing code would come in.

In [20]:
def assign_priorities(diff_uuids):
    priorities = {}
    for diff_uuid in diff_uuids:
        d = diffs[diff_uuid]
        priority = 0  # replace this with:  priority = clever_ML_routine(d)
        priorities[diff_uuid] = priority
    return priorities

assign_priorities(diffs.unprocessed)

{'016ed89e-c1cc-4aa5-b526-9cf446a90da7': 0,
 '2440e14e-6fc6-4df0-ad97-410c624977c8': 0,
 '27e7bb72-1cad-4638-b126-df6182f25bd6': 0,
 '76ba3709-cd6a-4781-be7e-73324487a05e': 0,
 '912bcfaf-d849-4c01-b76e-ed83e69749e8': 0,
 'a1912e06-b1c8-4513-9a45-aff35765700c': 0}