# web-monitoring backend demo

1. Ingest a cache of captured HTML files, representing a **Page** as a series of **Snapshots** through time.
2. Compare Snapshots of the same Page by sending requests to PageFreezer. Store the respones (**Diffs**).
3. Assign **Priorities** to the Diffs.
4. Access prioritized diffs through a **Work Queue** that assigns diffs to users in priority order.
5. Store user-submitted **Annotations** of a Diff (potentially multiple per diff.)

In [1]:
import functools
from datetime import datetime, timedelta
import os

import sqlalchemy
from web_monitoring.db import (Pages, Snapshots, Diffs, Priorities, WorkQueue, Annotations, create,
                               compare, NoAncestor, diff_snapshot, logger)

engine = sqlalchemy.create_engine(os.environ['WEB_MONITORING_SQL_DB_URI'])

In [2]:
create(engine)  # one time only: create tables

# Reflect SQL tables in Python.
snapshots = Snapshots(engine)
pages = Pages(engine)
diffs = Diffs(engine)
annotations = Annotations(engine)
priorities = Priorities(engine)

## Ingesting new HTML

Either manually or via some webhook, the backend is alerted that new captured HTML is avaialbe at some path.

In this example, we load the example files in the web-monitoring repo.

In [3]:
def load_examples():
    EXAMPLES = [
        'falsepos-footer',
        'falsepos-num-views',
        'falsepos-small-changes',
        'truepos-dataset-removal',
        'truepos-image-removal',
        'truepos-major-changes',
    ]
    archives_dir = os.path.join('archives')
    time1 = datetime.now()
    time0 = time1 - timedelta(days=1)
    for example in EXAMPLES:
        simulated_url = 'https://PLACEHOLDER.com/{}.html'.format(example)
        page_uuid = pages.insert(simulated_url)
        for suffix, _time in (('-a.html', time0), ('-b.html', time1)):
            filename = example + suffix
            path = os.path.abspath(os.path.join(archives_dir, filename))
            snapshots.insert(page_uuid, _time, path)
            
load_examples()

No we have a pile of unprocessed Snapshots. Some might be the first time we have seen a Page, while others might be just another Snapshot of a Page we have seen before.

In [4]:
snapshots.unprocessed

deque(['8adba7b6-a3c8-400a-9bed-029e462951a9',
       '6bc1de9f-7ed0-43c1-89d5-0c12b30baa7d',
       '8981eb5c-0abd-4adb-a52d-7fb23e3b3398',
       '18959ab8-5c62-406a-9a26-f90cdc5d4b49',
       'beb4c10d-ba6c-4f4c-821b-0eec103ede0c',
       '9564fd0b-568e-420b-a3ec-10a434b92e2f',
       '2ae058c5-41a1-4211-ab17-110cc9a83c7b',
       'a7656b01-a852-4277-bcee-f6092be95e62',
       '807ad480-6693-4ad9-8973-8b968c0c9224',
       'd196b61b-21ae-4524-88a6-ec904084765f',
       '2e49867d-cda1-4a21-8f69-4b5d7b90cc09',
       '09b3bbd7-4c44-4ff9-8922-ce786ad0a0ee'])

The Python API provides uuid-based lookup and returns the data as a `namedtuple` (low memory footprint, convenient attribute access).

In [5]:
s = snapshots[snapshots.unprocessed[0]]
s

Snapshot(uuid='8adba7b6-a3c8-400a-9bed-029e462951a9', page_uuid='2fd62ccf-678e-4c3e-babe-041a3e41ba22', capture_time=datetime.datetime(2017, 2, 27, 16, 57, 1, 575553), path='/Users/dallan/Documents/Repos/web-monitoring/archives/falsepos-footer-a.html')

In [6]:
pages[s.page_uuid]

Page(uuid='2fd62ccf-678e-4c3e-babe-041a3e41ba22', url='https://PLACEHOLDER.com/falsepos-footer.html', title='', agency='', site='')

## Computing Diffs between Snapshots

Iterate through the unprocessed Snapshots and requests diffs from PageFreezer. Stash the JSON response (which is large) in a file on disk. Store the filepath, the two Snapshots' UUIDs, and other small summary info in the database.

In [7]:
# Set up standard Python logging.
import logging
logging.basicConfig(level='DEBUG')
# This logger will show progress with PageFreezer requests.
logger.setLevel('DEBUG')

def diff_new_snapshots():
    f = functools.partial(diff_snapshot, snapshots=snapshots, diffs=diffs)
    while True:
        # Get the uuid of a Snapshot to be processed.
        try:
            snapshot_uuid = snapshots.unprocessed.popleft()
        except IndexError:
            # nothing left to process
            return
        try:
            f(snapshot_uuid)
        except NoAncestor:
            # This is the oldest Snapshot for this Page -- nothing to compare.
            continue

diff_new_snapshots()

Logger output:
```
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 6.507 seconds with status ok.
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 9.260 seconds with status ok.
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 2.576 seconds with status ok.
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 13.063 seconds with status ok.
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 2.529 seconds with status ok.
DEBUG:web_monitoring.db:Sending PageFreezer request...
DEBUG:web_monitoring.db:Response received in 2.448 seconds with status ok.
```

Now we have Diffs that need to be prioritized.

In [8]:
diffs.unprocessed

deque(['219ba583-bf0b-4610-9794-7a5a50461c55',
       '78b02f15-5ea2-4741-ba2c-265bff5091b3',
       '4ae1b697-8bd5-4a69-b47f-1013d0456ed3',
       'cc81208d-329e-4e4e-91df-179e0e8328ad',
       'ad162c4a-b6b0-4000-83c2-8c0aa028f745',
       '9a8373e5-5855-4e57-bc9f-0761eed83a9b'])

Accessing the diff from the Python API access that stashed JSON file and transparently fills it into the result, so it's quite verbose. We'll just look at the field names here.

In [9]:
diffs[diffs.unprocessed[0]]._fields

('uuid', 'diffhash', 'uuid1', 'uuid2', 'result', 'annotation')

## Prioritizing Diffs

Iterate through the unprocessed Diffs and assign a priority. This is where the clever text processing code would come in.

In [10]:
def assign_priorities():
    while True:
        try:
            diff_uuid = diffs.unprocessed.popleft()
        except IndexError:
            # Nothing left to do.
            return
            
        # d = diffs[diff_uuid]
        # priority = clever_ML_routine(d)
        priority = 1  # simple case for now
        priorities.insert(diff_uuid, priority)

assign_priorities()

## Annotating Diffs

**This stuff will probably get replaced by something on the Rails side.**

The WorkQueue interfaces with Priorities and Diffs and keeps track of what is currently being evaluated by a user.

In [11]:
work_queue = WorkQueue(priorities, diffs)

In [12]:
user_id = 1  # could potentially be associated with domain of expertise/interest
diff = work_queue.checkout_next(user_id)  # get the next-highest-priority diff

In [13]:
a = {'interesting': True}  # for example ... this would really have ~20 keys
annotations.insert(diff.uuid, a)

'07ed0635-ee83-4373-b9a3-2b7ceb1bdd88'

In [17]:
diff = work_queue.checkin(user_id)  # release the lock on the last diff
diff = work_queue.checkout_next(user_id)  # get the next-highest-priority diff