Only request diffs for types we can diff #200

Mr0grog · 2018-02-05T08:10:05Z

This got much more complex than anticipated! There are a few major bits here:

We now have a MediaType type for modeling and comparing different media types. It includes a concept of canonical types, so parseMediaType('text/html').equals(parseMediaType('application/xhtml')) === true. If we can reliably clean this up in the DB, we can get rid of that, but it's useful for now.
Similarly, we can also get media types for file extensions (it only supports a minimal list of extensions we know we have in the DB). Once the DB has content types for all versions, we can get rid of that, too.
The diffTypes module now has a diffTypesFor(mediaType) method, allowing us to get the diffs that are appropriate to present for a given type of content. We've also added three "raw" diff types which indicate we should simply show the content in an <iframe> because we don't know how to diff it (or if the two versions in question have differing content types, which often happens when a PDF or image responded with an HTML error page [like a 404] for a given version).
The SelectDiffType form control now takes a list of diff types (from the above diffTypesFor()) to show.
There's a little extra complexity in retrieving an actual diff where we check to see if the diff type has an actual diff service associated with it ("raw" types don't, because they represent viewing the actual content, not a diff). We request the actual raw content there because we need to sniff it for HTML (when we present HTML, we need to set a <base> tag so associated images, styles, and scripts work). This part is definitely a little janky, but I don't see a clear improvement without a lot more other changes. It slightly exacerbates the existing race condition issue. The sniffing HTML situation should/could probably be fixed alongside Solve rendering problem when iframe loads content from https or touches storage/cookies web-monitoring#92, where we basically need some kind of proxy or pre-cleaned-up version to render.
Finally, this makes a slight tweak to the “no changes” message; it shows a message indicating two versions are exactly the same if their hashes are identical, but shows the old message (with slightly more detail) if the hashes are different but change_count is 0.

Fixes #179.

A page where two versions have differing content types (HTML vs. PDF)

Viewing only the newer version of a page and getting raw content because it is a PDF

Message about identical version content

This got much more complex than anticipated! There are a few major bits here: - We now have a `MediaType` type for modeling and comparing different media types. It includes a concept of canonical types, so `parseMediaType('text/html').equals(parseMediaType('application/xhtml')) === true`. If we can reliably clean this up in the DB, we can get rid of that, but it's useful for now. Similarly, we can also get media types for file extensions (it only supports a minimal list of extensions we know we have in the DB). Once the DB has content types for all versions, we can get rid of that, too. - The `diffTypes` module now has a `diffTypesFor(mediaType)` method, allowing us to get the diffs that are appropriate to present for a given type of content. We've also added three "raw" diff types which indicate we should simply show the content in an `<iframe>` because we don't know how to diff it (or if the two versions in question have differing content types, which often happens when a PDF or image responded with an HTML error page [like a 404] for a given version). - The `SelectDiffType` form control now takes a list of diff types (from the above `diffTypesFor()`) to show. - There's a little extra complexity in retrieving an actual diff where we check to see if the diff type has an actual diff service associated with it ("raw" types don't, because they represent viewing the actual content, not a diff). We request the actual raw content there because we need to sniff it for HTML (when we present HTML, we need to set a `<base>` tag so associated images, styles, and scripts work). This part is definitely a little janky, but I don't see a clear improvement without a lot more other changes. It slightly exacerbates the existing race condition issue. The sniffing HTML situation should/could probably be fixed alongside edgi-govdata-archiving/web-monitoring#92, where we basically need some kind of proxy or pre-cleaned-up version to render. - Finally, this makes a slight tweak to the 'no changes' message; it shows a message indicating two versions are exactly the same if their hashes are identical, but shows the old message (with slightly more detail) if the hashes are different but `change_count` is 0. Fixes #179.

Mr0grog · 2018-02-05T20:55:06Z

@danielballan A thing I didn’t attempt to do here but might be a good follow-on would be using the “raw” side-by-side rendering as a fallback when the differ responds with an error. On the flip side, are there some errors where we should simply stop and not fall back to viewing raw versions? (Maybe hash validation?) If yes, we need a way to distinguish. Curious about your thoughts on this.

Mr0grog · 2018-02-07T20:24:52Z

src/components/change-view.jsx

+  }
+
+  if (version.uri) {
+    const extension = version.uri.match(/^([^:]+:\/\/)?.*\/[^/]*(\.[^/]+)$/);


This should probably fall back to the page’s URL if there was no extension in the version URI. It looks like we have quite a few .doc/.docx files where the raw version winds up with no extension, but the original URL had one.

Mr0grog · 2018-02-12T06:04:25Z

If no other feedback/review, I’m going to merge this on Monday (Feb 12).

When checking the file extension on a version URI, also check the page URL (if the version URI doesn't have an extension).

Mr0grog · 2018-02-18T05:31:01Z

Updated the fallback as noted above and added some tests for the MediaType module. Actually going to merge this in the morning this time :P

Mr0grog added the in progress label Feb 5, 2018

Fix typo.

7ca7bc7

Mr0grog requested a review from danielballan February 5, 2018 08:13

Clean up timing/coordination amont state/props

f1f9632

Mr0grog mentioned this pull request Feb 5, 2018

html_token: Stop early with an error for non-HTML content edgi-govdata-archiving/web-monitoring-processing#155

Merged

Mr0grog commented Feb 7, 2018

View reviewed changes

Mr0grog added 3 commits February 17, 2018 21:03

Use extension in page URL as fallback to version

1788bb6

When checking the file extension on a version URI, also check the page URL (if the version URI doesn't have an extension).

Fix copy-paste error in .docx media type

118bce9

Add tests for MediaType

2b2101f

Mr0grog merged commit 474453c into master Feb 18, 2018

Mr0grog deleted the 179-different-diffs-for-different-content branch February 18, 2018 17:42

Mr0grog removed the in progress label Feb 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only request diffs for types we can diff #200

Only request diffs for types we can diff #200

Mr0grog commented Feb 5, 2018 •

edited

Mr0grog commented Feb 5, 2018

Mr0grog Feb 7, 2018

Mr0grog commented Feb 12, 2018

Mr0grog commented Feb 18, 2018

Only request diffs for types we *can* diff #200

Only request diffs for types we *can* diff #200

Conversation

Mr0grog commented Feb 5, 2018 • edited

Mr0grog commented Feb 5, 2018

Mr0grog Feb 7, 2018

Choose a reason for hiding this comment

Mr0grog commented Feb 12, 2018

Mr0grog commented Feb 18, 2018

Only request diffs for types we can diff #200

Only request diffs for types we can diff #200

Mr0grog commented Feb 5, 2018 •

edited