-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build PageFreezer-Outputter that fits into current Versionista workflow #17
Comments
From what I've observed with the Pagefreezer API, taking a diff of 2 pages takes an average of 5 seconds. If there are ~ 30,000 pages to monitor, then pagefreezer is probably not the most appropriate diff service for this task. I think the main bottleneck in Pagefreezer is that they transcode the diff information into HTML for every request. I have run similar diffs on my machine using Git diff and it usually takes one second or less. Here's what I suggest:
|
I love the idea. Question : what if the HTML (DOM) context is what tells us whether a diff is significant?
…On February 10, 2017 10:07:57 AM EST, Allan Pichardo ***@***.***> wrote:
From what I've observed with the Pagefreezer API, taking a diff of 2
pages takes an average of 5 seconds. If there are ~ 30,000 pages to
monitor, then pagefreezer is probably not the most appropriate diff
service for this task. I think the main bottleneck in Pagefreezer is
that they transcode the diff information into HTML for every request. I
have run similar diffs on my machine using Git diff and it usually
takes one second or less.
Here's what I suggest:
1. We make a command line tool that creates git diffs of 2 pages and
saves them to file (or a local database). This command line tool can
take optional filters so we can remove unimportant parts of the page
over time. The cli tool could be set as a cron task on a server and run
daily to diff the 30,000 pages in the background.
2. If a significant difference is found, then the CLI creates the CSV
entry for the analysts as per your description.
3. Since the last diff has been stored as a file (or database row) the
git visualizer can pull that and do the parsing at that particular
time. Thus we incur the cost of transcoding to HTML only when it's
necessary.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#17 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
@titaniumbones Yeah, I suspect that it will. If we do this with Node.js, then we have the option of using jQuery to parse the HTML archives. We can determine that certain HTML nodes are insignificant, such as The diff would look like a regular git diff, but then the visualizer would have some logic that could convert the When viewing the page in the visualizer, an analyst would have the option of selecting a DOM element and saying that it's insignificant, thus adding it to an ongoing list that would be fed back to the CLI on the next cycle. |
@allanpichardo Really disappointing to hear the PageFreezer API moves so slow but, as you've described, it seems like we have plenty of of other options (and, of course, we always new we had git as a backup). Your 1-3 above sound really great, and I think it makes perfect sense, as you said in point 3, to only parse the diff in the visualizer when it's called back up by an analyst. Regarding you last comment - I think it sounds great to have the option in the visualizer, maybe in some simple dropdown form, of marking a diff as insignificant. That'll mean everything lives just in that simple visualizer, and everything else will run in the CLI. The only other thing to add, then, would be to perhaps have a couple of different visualization options, perhaps a "side-by-side" or "in-line" page view for changes, but then also a "changes only" view (very useful for huge pages with only a few changes. I'll write something about that in issue #19, the visualization issue, as well. |
I think this is great. In yr opinion are there pieces of this I should ask folks to work on in SF tomorrow?
…On February 10, 2017 10:29:33 AM EST, Allan Pichardo ***@***.***> wrote:
@titaniumbones Yeah, I suspect that it will. If we do this with
Node.js, then we have the option of using jQuery to parse the HTML
archives. We can determine that certain HTML nodes are insignificant,
such as `<head meta> ` for example or `<rel type="stylesheet">` etc
etc... Therefore, at the time that an archive is loaded from disk, then
the CLI uses jQuery to delete such nodes from the text blob and _then_
execute the diff.
The diff would look like a regular git diff, but then the visualizer
would have some logic that could convert the ++++ ----- syntax into
<ins><del> syntax and we'll output that on screen.
When viewing the page in the visualizer, an analyst would have the
option of selecting a DOM element and saying that it's insignificant,
thus adding it to an ongoing list that would be fed back to the CLI on
the next cycle.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#17 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
@ambergman Yes, here's what I see overall at a high level, I understand that the 30,000 URLs are kept in the spreadsheet. Those are live URLs. Where are the previous versions archived? Is it held remotely on another server? I ask because if it's possible to have a structure such that:
@titaniumbones If this architecture is something we can work with, then maybe the SF group can set up the directory structure, and put some test files in it, and start a cli utility that would read the files and use git to diff them with the live URLs and save the diffs into another directory. |
Allan Pichardo <notifications@github.com> writes:
@ambergman Yes, here's what I see overall at a high level,
I understand that the 30,000 URLs are kept in the spreadsheet. Those
are live URLs. Where are the previous versions archived? Is it held
remotely on another server?
We don't know where the long-term storage will be. We have talked to a
bunch of differnt sponsors about this and have not been able to get a
firm commitment from anyone yet. For now they are stored on a cluster
that's not entirely easy to access from the web. I've exposed the
zipfile for the 30,000 (actually for some reason there are 2 zipfiles,
one large and one small, fro mthe same day) & a couple of the domains here, over http:
http://edgistorage.hackinghistory.ca/
You can dl it yourself there, but the large zipfile is about 7g.
I've also unzipped the zipfile in
http://edgistorage.hackinghistory.ca/storage, and you can see the funky
directory structure there.
I think there's something about this in the docs in `pagefreezer-cli`,
but I'm on a low-bandwidth connection in the airport and browsing is a little
hard.
I ask because if it's possible to have a structure such that:
1. There is a directory which holds the last known HTML of each page and each filename, for simplicity, could be the URL itself. (plaintext)
take a look at the zipfile structure. Probably we could do that but it
might be a little frustrating.
2. Start node script that can open the spreadsheet and go line by line, comparing the stored HTML in the directory with the live version of the page, and update the spreadsheet accordingly for each HTML. (I say Node because JavaScript is so good for DOM traversal, but if there's a better idea for this, I'm all ears)
yeah sounds great. I think ruby & python also have dom-arware diff
programs, again see docs.
3. (this part needs to be worked out)
^^ yup!!
By some heuristic, determine if a diff is significant enough to keep, and store that diff as a text file in another directory.
4. If the diff text file is readable from the web, that would be OK (preferable), otherwise, we would have to insert the path and some kind of ID into a database table. The visualization link can be something with the ID of the diff text.
Whatever solution we come up with, we will need to make all this stuff
accessible to the web. So, let's take that as a given when we're
building a test case.
5. Insert that URL in the spreadsheet
6. Have the visualizer recognize the IDs and pull the corresponding diff file and display it.
yup, sounds great.
…--
|
@titaniumbones the directory structure from the zip files will work because the url is preserved in the file structure. So, it seems that the directories per domain mirror the exact directory structure that is remote. So this is good for knowing what compares to what. The only issue is that the archives come with a lot of other files that we don't need, but that's OK, for this purpose, we can just traverse the tree and take html files and ignore the others. So I suppose, wherever this service runs, it could download an archive zip, extract it, and run through it creating the diffs and updating the spreadsheet. When the process is done, it can delete the downloaded archive. Then rinse and repeat. |
concerns
i've been kind of kind of playing around with the idea of an architecture that relies on s3 and aws lambdas (to avoid running-server costs). i wanted to get more details on the current situation to avoid proposing solutions to problems already solved but maybe it's better to just spit it out: two pieces: A. archiving / diff emitter system
B. the diff alerting/pre-filtering/exploration system
(if you're not familar with aws lambda, it's an aws service that allows you to upload code for single node or python functions, where they are invoked according to "triggers" you define (such as a new object being put in s3 or an http request to some static url). they're kind of a pain in the butt but the payoff is that you only pay $0.000000208 - $0.000002501 per call for month and aws will take care of handling scale-outs (for unexpected bursts of load)) |
FYI. I am from Pagefreezer. To make clear on the slowness of our Diff API, I would like to give some idea on it. As @allanpichardo mentioned, one API call takes around avg 5 sec. Actually, it is due to the network latency of AWS Lambda / API Gateway. For the purpose of this project, we took the diff service from our production and made a AWS Lambda version. Our internal benchmark is as follows:
So, to use the diff API with a large number of files, I recommend to use multi-threaded(or event-driven) client codes to make the API calls. AWS Lambda will serve them at scale. To compare 30k files with 100 threads, it simply needs to take 41 minutes. The multithreaded client doesn't do anything except of making a connection to API and waiting its result. As a note, whatever you use a tool/algorithm to make diff, the time taking to make diff will be somehow in proportion to the source's HTML structure (#DOM nodes and text size). EOF |
In addition: consider bulk-uploading the HTML pages to AWS first before you use the PageFreezer Diff service, it may decrease the network latency. |
Have you folks talked to @markjohngraham about how some of these html diff problems are (or could be addressed) in the Wayback Machine -- web.archive.org? |
Also, in terms of long-term storage, has Internet Archive's S3 API been considered? https://github.com/vmbrasseur/IAS3API |
@mekarpeles have not talked to markjohngraham, but we've been talking to Jefferson a little about IA as the end-game for this effort. Haven't got into the nitty-gritty, but clearly IA seems like the best home for this effort in the long run! sorry to have missed you in SF! was looking forward to meeting you but missed the connection somehow. |
As long as everything gets backed up and the community is able to find a way to produce a registry so other institutions can cross-check and participate, I'm a super happy camper! Very thankful for your + team's efforts. Sorry to have missed you in SF as well! @flyingzumwalt had great things to say about you :) |
This issue was moved to edgi-govdata-archiving/web-monitoring#9 |
To replicate the Versionista workflow with the new PageFreezer archives, we need a little module that takes as input a diff already returned (either by PageFreezer's server or another diff service we build), and simply outputs a row in a CSV, as our versionista-outputter already does. If a particular URL has not been altered, then the diff being returned as input to this module should be null, and no row should be output. Please see @danielballan's issue summarizing the Versionista workflow for reference.
I'll follow up soon of a list of the current columns being output to the CSV by the versionista-outputter, a short description of how the analysts use the CSV they're working with, and some screenshots of what everything looks like for clarity
The text was updated successfully, but these errors were encountered: