Don’t request diffs for PDFs #179

Mr0grog · 2018-01-15T17:46:59Z

We know we don’t yet have a differ that supports PDFs, so we should not request diffs for them (because they’ll always appear broken). Instead, we should:

Show each PDF side-by-side for “Side-by-Side Rendered”
Show the new PDF for “Highlighted Rendered”
Disable all other diff types

Does that seem right? Or should we just show alternate options altogether for types that we know we can’t diff? (e.g. “Side by Side,” “Before,” and “After”)

We could also definitely do better on surfacing mime types in source_metadata. Or maybe make it an actual, top-level field on version objects (any thoughts there, @dallan?).

The text was updated successfully, but these errors were encountered:

Mr0grog · 2018-01-15T17:49:33Z

This also goes for situations where the before/after versions are different mime types. This change, for example, is HTML (an error page) before and PDF after: https://monitoring.envirodatagov.org/page/4ca4bc6f-6dc2-4b01-8e7a-94d50baf276e/e053c830-029a-4925-88c2-114b370c4fcc..8c5583c7-02f2-4ee2-a495-748865dc1330

Mr0grog · 2018-01-15T19:11:02Z

Quick survey of file extensions (our only reliable proxy for mime types right now):

Extension	Count
html	449437
pdf	3892
???	2377
wsdl	1689
xml	1506
ksh	306
ics	257
txt	203
rss	118
jpg	106
obj	87
doc	85
zip	35
atom	33
xlb	19
pwz	15
gif	11
rtf	8
csv	8
xls	7
xlsx	6
png	5
docx	4
jpeg	4
mp3	2
ai	1

lightandluck · 2018-01-15T19:32:45Z

How would we find, on staging or production, versions that we could use as tests for those use cases outlined?

Mr0grog · 2018-01-15T19:49:18Z

I would just do what I just did above, use a quick script to roll through the API (keep paging through /api/v0/versions until you get to the end or as far as you want to go) to find a PDF, for example.

You could probably also just search for *.pdf (or *.whatever) in the UI, though you’d only find ones where the URL has the file extension, and not ones where it was only in the Content-Type header (hitting the API as above would fix that). You’d probably still get plenty of results to test with that way, though.

This got much more complex than anticipated! There are a few major bits here: - We now have a `MediaType` type for modeling and comparing different media types. It includes a concept of canonical types, so `parseMediaType('text/html').equals(parseMediaType('application/xhtml')) === true`. If we can reliably clean this up in the DB, we can get rid of that, but it's useful for now. Similarly, we can also get media types for file extensions (it only supports a minimal list of extensions we know we have in the DB). Once the DB has content types for all versions, we can get rid of that, too. - The `diffTypes` module now has a `diffTypesFor(mediaType)` method, allowing us to get the diffs that are appropriate to present for a given type of content. We've also added three "raw" diff types which indicate we should simply show the content in an `<iframe>` because we don't know how to diff it (or if the two versions in question have differing content types, which often happens when a PDF or image responded with an HTML error page [like a 404] for a given version). - The `SelectDiffType` form control now takes a list of diff types (from the above `diffTypesFor()`) to show. - There's a little extra complexity in retrieving an actual diff where we check to see if the diff type has an actual diff service associated with it ("raw" types don't, because they represent viewing the actual content, not a diff). We request the actual raw content there because we need to sniff it for HTML (when we present HTML, we need to set a `<base>` tag so associated images, styles, and scripts work). This part is definitely a little janky, but I don't see a clear improvement without a lot more other changes. It slightly exacerbates the existing race condition issue. The sniffing HTML situation should/could probably be fixed alongside edgi-govdata-archiving/web-monitoring#92, where we basically need some kind of proxy or pre-cleaned-up version to render. - Finally, this makes a slight tweak to the 'no changes' message; it shows a message indicating two versions are exactly the same if their hashes are identical, but shows the old message (with slightly more detail) if the hashes are different but `change_count` is 0. Fixes #179.

Mr0grog added this to the v0.0.2 Stop Versionista Imports milestone Jan 15, 2018

This was referenced Jan 19, 2018

Render word docs using Google or Microsoft embedded renderers #186

Open

Update bad hashes (from hashing UTF-8 encoded version of non-UTF content) edgi-govdata-archiving/web-monitoring-db#205

Closed

Mr0grog self-assigned this Feb 5, 2018

Mr0grog added the in progress label Feb 5, 2018

Mr0grog mentioned this issue Feb 5, 2018

Only request diffs for types we *can* diff #200

Merged

Mr0grog closed this as completed in #200 Feb 18, 2018

Mr0grog removed the in progress label Feb 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don’t request diffs for PDFs #179

Don’t request diffs for PDFs #179

Mr0grog commented Jan 15, 2018

Mr0grog commented Jan 15, 2018 •

edited

Mr0grog commented Jan 15, 2018 •

edited

lightandluck commented Jan 15, 2018

Mr0grog commented Jan 15, 2018

Don’t request diffs for PDFs #179

Don’t request diffs for PDFs #179

Comments

Mr0grog commented Jan 15, 2018

Mr0grog commented Jan 15, 2018 • edited

Mr0grog commented Jan 15, 2018 • edited

lightandluck commented Jan 15, 2018

Mr0grog commented Jan 15, 2018

Mr0grog commented Jan 15, 2018 •

edited

Mr0grog commented Jan 15, 2018 •

edited