Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle charset encoding declaration #29

Closed
Treora opened this issue Jul 23, 2018 · 1 comment
Closed

Handle charset encoding declaration #29

Treora opened this issue Jul 23, 2018 · 1 comment
Labels
snapshot quality Improving fidelity/size/durability/etc of the output

Comments

@Treora
Copy link
Contributor

Treora commented Jul 23, 2018

The document may have a <meta charset="..."> tag in the <head>, but that will be obsoleted as we use the parsed document, and later stringify it again. I suppose we could/should delete it from the DOM when capturing it.

Vice versa, we may want to add the appropriate <meta charset="..."> tag to the snapshot; but this seems a task for the application invoking freeze-dry, as we do not know in which encoding the application will store the string.

We could thus..

  • leave the snapshot without charset declaration, tell callers to add it themselves. But they won't have the parsed DOM, making this a hassle.
  • Easier then is to let the application tell the desired encoding tag as an option to freezeDry(...).
  • Alternatively, we could html-encode all characters so our string only contains plain ASCII, which I presume (rightly or wrongly?) removes the need for declaring the charset.
@Treora Treora added the snapshot quality Improving fidelity/size/durability/etc of the output label Apr 6, 2019
@Treora
Copy link
Contributor Author

Treora commented Jun 2, 2019

Resolved in commit cefd79c, which adds an encoding declaration as requested by the user (the second option above), while presumptively defaulting to set it as utf-8. My reasoning as put in the commit message:

Since we return a string, how the user will encode that string should
ideally not matter to us. However, as HTML has the remarkable approach
of declaring the encoding somewhere inside the string, the user would
need to parse part of the DOM again to insert the declaration at the
right spot. If the user already knows how it will encode the string
afterward, I suppose we can help by inserting the declaration already.

In any case, we should remove any encoding declarations that the page
originally had, because the file is always reencoded.

Regarding the default action, an intuitive behaviour would be to not add
any meta tag. But because utf-8 is the most widespread and officially
recommended encoding for web documents, and also because many javascript
APIs use it as the default (or only) encoding (e.g. the Blob
constructor), it feels like a helpful default.

I suppose that snapshots have so often worked fine so far simply because many web pages have an utf-8 declaration which we did not remove, while applications (at least the WebMemex browser extension) also use utf-8 encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
snapshot quality Improving fidelity/size/durability/etc of the output
Projects
None yet
Development

No branches or pull requests

1 participant