Handle charset encoding declaration #29

Treora · 2018-07-23T18:17:54Z

The document may have a <meta charset="..."> tag in the <head>, but that will be obsoleted as we use the parsed document, and later stringify it again. I suppose we could/should delete it from the DOM when capturing it.

Vice versa, we may want to add the appropriate <meta charset="..."> tag to the snapshot; but this seems a task for the application invoking freeze-dry, as we do not know in which encoding the application will store the string.

We could thus..

leave the snapshot without charset declaration, tell callers to add it themselves. But they won't have the parsed DOM, making this a hassle.
Easier then is to let the application tell the desired encoding tag as an option to freezeDry(...).
Alternatively, we could html-encode all characters so our string only contains plain ASCII, which I presume (rightly or wrongly?) removes the need for declaring the charset.

The text was updated successfully, but these errors were encountered:

Treora · 2019-06-02T09:44:15Z

Resolved in commit cefd79c, which adds an encoding declaration as requested by the user (the second option above), while presumptively defaulting to set it as utf-8. My reasoning as put in the commit message:

Since we return a string, how the user will encode that string should
ideally not matter to us. However, as HTML has the remarkable approach
of declaring the encoding somewhere inside the string, the user would
need to parse part of the DOM again to insert the declaration at the
right spot. If the user already knows how it will encode the string
afterward, I suppose we can help by inserting the declaration already.

In any case, we should remove any encoding declarations that the page
originally had, because the file is always reencoded.

Regarding the default action, an intuitive behaviour would be to not add
any meta tag. But because utf-8 is the most widespread and officially
recommended encoding for web documents, and also because many javascript
APIs use it as the default (or only) encoding (e.g. the Blob
constructor), it feels like a helpful default.

I suppose that snapshots have so often worked fine so far simply because many web pages have an utf-8 declaration which we did not remove, while applications (at least the WebMemex browser extension) also use utf-8 encoding.

Treora added the snapshot quality Improving fidelity/size/durability/etc of the output label Apr 6, 2019

Treora mentioned this issue Jun 1, 2019

Handle encoding of subresources #46

Open

Treora closed this as completed Jun 2, 2019

Treora mentioned this issue Mar 11, 2020

Fix charset encoding of framed documents #51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle charset encoding declaration #29

Handle charset encoding declaration #29

Treora commented Jul 23, 2018

Treora commented Jun 2, 2019

Handle charset encoding declaration #29

Handle charset encoding declaration #29

Comments

Treora commented Jul 23, 2018

Treora commented Jun 2, 2019