Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build PageFreezer-Outputter that fits into current Versionista workflow #17

Closed
ambergman opened this issue Feb 10, 2017 · 16 comments
Closed

Comments

@ambergman
Copy link

To replicate the Versionista workflow with the new PageFreezer archives, we need a little module that takes as input a diff already returned (either by PageFreezer's server or another diff service we build), and simply outputs a row in a CSV, as our versionista-outputter already does. If a particular URL has not been altered, then the diff being returned as input to this module should be null, and no row should be output. Please see @danielballan's issue summarizing the Versionista workflow for reference.

I'll follow up soon of a list of the current columns being output to the CSV by the versionista-outputter, a short description of how the analysts use the CSV they're working with, and some screenshots of what everything looks like for clarity

@ambergman ambergman added this to the Replicate versionista-based workflow and scale with PageFreezer milestone Feb 10, 2017
@allanpichardo
Copy link
Collaborator

From what I've observed with the Pagefreezer API, taking a diff of 2 pages takes an average of 5 seconds. If there are ~ 30,000 pages to monitor, then pagefreezer is probably not the most appropriate diff service for this task. I think the main bottleneck in Pagefreezer is that they transcode the diff information into HTML for every request. I have run similar diffs on my machine using Git diff and it usually takes one second or less.

Here's what I suggest:

  1. We make a command line tool that creates git diffs of 2 pages and saves them to file (or a local database). This command line tool can take optional filters so we can remove unimportant parts of the page over time. The cli tool could be set as a cron task on a server and run daily to diff the 30,000 pages in the background.

  2. If a significant difference is found, then the CLI creates the CSV entry for the analysts as per your description.

  3. Since the last diff has been stored as a file (or database row) the git visualizer can pull that and do the parsing at that particular time. Thus we incur the cost of transcoding to HTML only when it's necessary.

@titaniumbones
Copy link
Contributor

titaniumbones commented Feb 10, 2017 via email

@allanpichardo
Copy link
Collaborator

allanpichardo commented Feb 10, 2017

@titaniumbones Yeah, I suspect that it will. If we do this with Node.js, then we have the option of using jQuery to parse the HTML archives. We can determine that certain HTML nodes are insignificant, such as <head meta> for example or <rel type="stylesheet"> etc etc... Therefore, at the time that an archive is loaded from disk, then the CLI uses jQuery to delete such nodes from the text blob and then execute the diff.

The diff would look like a regular git diff, but then the visualizer would have some logic that could convert the ++++ ----- syntax into <ins><del> syntax and we'll output that on screen.

When viewing the page in the visualizer, an analyst would have the option of selecting a DOM element and saying that it's insignificant, thus adding it to an ongoing list that would be fed back to the CLI on the next cycle.

@ambergman
Copy link
Author

@allanpichardo Really disappointing to hear the PageFreezer API moves so slow but, as you've described, it seems like we have plenty of of other options (and, of course, we always new we had git as a backup). Your 1-3 above sound really great, and I think it makes perfect sense, as you said in point 3, to only parse the diff in the visualizer when it's called back up by an analyst.

Regarding you last comment - I think it sounds great to have the option in the visualizer, maybe in some simple dropdown form, of marking a diff as insignificant. That'll mean everything lives just in that simple visualizer, and everything else will run in the CLI.

The only other thing to add, then, would be to perhaps have a couple of different visualization options, perhaps a "side-by-side" or "in-line" page view for changes, but then also a "changes only" view (very useful for huge pages with only a few changes. I'll write something about that in issue #19, the visualization issue, as well.

@titaniumbones
Copy link
Contributor

titaniumbones commented Feb 10, 2017 via email

@allanpichardo
Copy link
Collaborator

allanpichardo commented Feb 10, 2017

@ambergman Yes, here's what I see overall at a high level,

I understand that the 30,000 URLs are kept in the spreadsheet. Those are live URLs. Where are the previous versions archived? Is it held remotely on another server?

I ask because if it's possible to have a structure such that:

  1. There is a directory which holds the last known HTML of each page and each filename, for simplicity, could be the URL itself. (plaintext)

  2. Start node script that can open the spreadsheet and go line by line, comparing the stored HTML in the directory with the live version of the page, and update the spreadsheet accordingly for each HTML. (I say Node because JavaScript is so good for DOM traversal, but if there's a better idea for this, I'm all ears)

  3. (this part needs to be worked out) By some heuristic, determine if a diff is significant enough to keep, and store that diff as a text file in another directory.

  4. If the diff text file is readable from the web, that would be OK (preferable), otherwise, we would have to insert the path and some kind of ID into a database table. The visualization link can be something with the ID of the diff text.

  5. Insert that URL in the spreadsheet

  6. Have the visualizer recognize the IDs and pull the corresponding diff file and display it.

@titaniumbones If this architecture is something we can work with, then maybe the SF group can set up the directory structure, and put some test files in it, and start a cli utility that would read the files and use git to diff them with the live URLs and save the diffs into another directory.

@titaniumbones
Copy link
Contributor

titaniumbones commented Feb 10, 2017 via email

@allanpichardo
Copy link
Collaborator

@titaniumbones the directory structure from the zip files will work because the url is preserved in the file structure. So, it seems that the directories per domain mirror the exact directory structure that is remote. So this is good for knowing what compares to what. The only issue is that the archives come with a lot of other files that we don't need, but that's OK, for this purpose, we can just traverse the tree and take html files and ignore the others.

So I suppose, wherever this service runs, it could download an archive zip, extract it, and run through it creating the diffs and updating the spreadsheet. When the process is done, it can delete the downloaded archive. Then rinse and repeat.

@lh00000000
Copy link
Collaborator

concerns

  1. how will this address sites with dynamically loaded (e.g. react js) content?
  2. i suspect the pagefreezer isn't really "slow". my guess is that the main causes for latency are network (nontrivial payloads sent to them), and that they might be running the pages in a headless browser and using a few seconds to allow clientside stuff to happen (in my experience, a lot of sites have low-priority / below the fold stuff won't finish loading until 3 or more seconds later) and that they're equiped to handle a bit (polite) multithreading. the issue of 10K daily limit remains though.
  3. i've had bad experiences using jquery selectors to clean up html at scale. things like throwing out and <script> nodes is straightforward, but a lot of dynamic content (even if it's somewhat consistent serverside rendered stuff and not clientside run js) can have branching logic that is unique to that site. personally, i've considered handcrafted jquery/dom/css selectors as a brittle last resort, but i'm definitely no jquery master
  4. re: ui that allows click-to-ignore-dom-node. this would be awesome! i'm worried it would really hard though. i've tried click-to-get-css-selector tools before (chrome extensions) but they never seem to work right. layers of nested divs without strictly correlated visual embedding seemed to be the tough part.

i've been kind of kind of playing around with the idea of an architecture that relies on s3 and aws lambdas (to avoid running-server costs). i wanted to get more details on the current situation to avoid proposing solutions to problems already solved but maybe it's better to just spit it out:

two pieces:
A. archiving / diff emitter system
B. the diff alerting/pre-filtering/exploration system

A. archiving / diff emitter system

  1. using cron-ish timer (e.g. cloudwatch cron events, or something like airflow), an aws lambda function to fill a sqs-toPullRaw SQS queue with all the urls to be scraped. this allows the list of watched urls to be scraped and diffed everyday to be editable in one place. the order of tasks in the queue can be shifted around to avoid hitting the same sites too aggresively. using SQS queue also provides a nice interface to watch the jobs complete in realtime and track failed jobs.
  2. sqs-toPullRaw is consumed by lambda that will hit the urls, and get the RAW html (i.e. dynamic JS not loaded). raw html is dumped to s3 bucket s3-raw-snapshots. (name of the object is reversedata+url e.g. 30-01-2017-poop.gov)
  3. there are three lambdas that are triggered by new-object events in the s3-raw-snapshots bucket:
  4. a lambda to diff raw to raw. given the name of the new snapshot (e.g. 30-01-2017-poop.gov), look for the snapshot of the url at the previous day (29-01-2017-poop.gov) and run it through the diffing service (page-freezer or inhouse). diffs will be written to another s3 bucket s3-raw-diffs for posterity
  5. another lambda just like above, but runs raw html through a text extraction preprocessor (gets all the visible text WITHOUT dynamic content). diffs put into s3-plaintext-diffs
  6. a third lambda just like the above, but runs page through headless browser to evaluate dynamically load content (javascript). both the raw post-evaluated html is diffed (and put into s3-loaded-raw-diffs), and a text-extraction preprocessed version is diffed (and put into s3-loaded-plaintext-diffs). a similar pattern would be used for other ideas of processing the raw html e.g. using an article-text extraction method such as the python package newspaper3k
  7. for each of the s3 diff buckets above, lambdas are triggered to load each of the diff documents (i.e. one json object per diff of a single page) into the B. diff alerting/pre-filtering/exploration system
  8. to cut down on s3 costs, an automated process could move older snapshots from s3 to s3-cold storage or glacier (see https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/)

B. the diff alerting/pre-filtering/exploration system
my suggestion is that this system (which could include another webapp but also entails the generation of the daily csv people are already using) would rely on elasticsearch. reasons why:

  • hardcoded policies for filtering out junk could be implemented and changed easily (as filter queries), but all the diffs would still be there if needed. figuring out that filtering logic could be done interactively and the daily csv could be regenerated rapidly.
  • the opposite (i.e. watching for the deletion of the phrase "climate change") could be implemented using the same DSL and take advantage of common elasticsearch alert techniques (elasticsearch perculator API) and send out email alerts in high-priority situations. we could also have the email create a versionista view for the particular diff.
  • elasticsearch already has an ecosystem of exploratory tools (kibana / sense) that could aid in exploring and finding larger trends
  • similar to the s3 archiving strategy above, a retiring strategy could be uses to save on cost (see https://www.elastic.co/guide/en/elasticsearch/guide/current/retiring-data.html)

(if you're not familar with aws lambda, it's an aws service that allows you to upload code for single node or python functions, where they are invoked according to "triggers" you define (such as a new object being put in s3 or an http request to some static url). they're kind of a pain in the butt but the payoff is that you only pay $0.000000208 - $0.000002501 per call for month and aws will take care of handling scale-outs (for unexpected bursts of load))

@leepro
Copy link

leepro commented Feb 13, 2017

FYI. I am from Pagefreezer. To make clear on the slowness of our Diff API, I would like to give some idea on it. As @allanpichardo mentioned, one API call takes around avg 5 sec. Actually, it is due to the network latency of AWS Lambda / API Gateway. For the purpose of this project, we took the diff service from our production and made a AWS Lambda version.

Our internal benchmark is as follows:

  1. Native/two local files: 0.3 sec.
  2. API/two URLs: 5 sec.
  3. API/two files uploaded (url1, url2): 4.6 sec

So, to use the diff API with a large number of files, I recommend to use multi-threaded(or event-driven) client codes to make the API calls. AWS Lambda will serve them at scale. To compare 30k files with 100 threads, it simply needs to take 41 minutes. The multithreaded client doesn't do anything except of making a connection to API and waiting its result.

As a note, whatever you use a tool/algorithm to make diff, the time taking to make diff will be somehow in proportion to the source's HTML structure (#DOM nodes and text size).

EOF

@mriedijk
Copy link

mriedijk commented Feb 14, 2017

In addition: consider bulk-uploading the HTML pages to AWS first before you use the PageFreezer Diff service, it may decrease the network latency.

@mekarpeles
Copy link

Have you folks talked to @markjohngraham about how some of these html diff problems are (or could be addressed) in the Wayback Machine -- web.archive.org?

@mekarpeles
Copy link

Also, in terms of long-term storage, has Internet Archive's S3 API been considered? https://github.com/vmbrasseur/IAS3API

@titaniumbones
Copy link
Contributor

@mekarpeles have not talked to markjohngraham, but we've been talking to Jefferson a little about IA as the end-game for this effort. Haven't got into the nitty-gritty, but clearly IA seems like the best home for this effort in the long run!

sorry to have missed you in SF! was looking forward to meeting you but missed the connection somehow.

@mekarpeles
Copy link

As long as everything gets backed up and the community is able to find a way to produce a registry so other institutions can cross-check and participate, I'm a super happy camper! Very thankful for your + team's efforts. Sorry to have missed you in SF as well! @flyingzumwalt had great things to say about you :)

@dcwalk
Copy link
Collaborator

dcwalk commented Mar 9, 2017

This issue was moved to edgi-govdata-archiving/web-monitoring#9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants