Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop HTML encoding unicode characters when serializing XForm #685

Merged
merged 2 commits into from
May 31, 2022

Conversation

seadowg
Copy link
Member

@seadowg seadowg commented May 24, 2022

Closes getodk/collect#4554

What has been done to verify that this works as intended?

New test and verified it fixed the issue using Collect.

Why is this the best possible solution? Were any other approaches considered?

We could attempt to only stop HTML encoding for formid and version, but that would actually be more difficult as we'd have to change the way we serialize - KXmlSerializer only allows us to switch Unicode escaping on or off for the whole document (based on the desired output encoding). This solution also removes weirdness with Emoji (and other high digit Unciode characters) being encoded to two HTML characters which causes problems downstream for Central.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

People doing analysis on submission XML directly will be affected as they might expect HTML encoded files (although this wasn't the case for Enketo submissions anyway). There could be other unforeseen consequences, so I'd advocate for us getting this into an early Collect beta and then announcing the change on the forum.

Do we need any specific form for testing your changes? If so, please attach one.

Any form that has Unicode characters anywhere and also submissions that include Unicode characters (like an Emoji in a text question)

@seadowg seadowg marked this pull request as ready for review May 25, 2022 12:35
@seadowg seadowg requested a review from lognaturel May 25, 2022 12:35
@lognaturel
Copy link
Member

To evaluate risk and potential downstream effects, @seadowg will look into:

  • Whether instances written by Collect are currently UTF-8 encoded
  • Whether instances written by either Enketo or Collect have anything in the encoding XML root attribute

@lognaturel
Copy link
Member

lognaturel commented May 25, 2022

Strangely, someone just reported running into a related issue. Enketo can't parse an XML submission with complex emojis. That led me to poke around some. No encoding on the XML root.

I believe then the resulting files are ASCII which is a subset of UTF-8. Currently all characters in submissions are guaranteed to only include the 128 ASCII chars (which are also part of UTF-8). Actually UTF-8 encoding unicode characters instead of writing their HTML codes would be a change for sure. We would probably want to add the encoding attribute as well. We know Central will be fine with this but it's hard to reason about possible implications.

@lognaturel
Copy link
Member

lognaturel commented May 25, 2022

A submission that fails has �� (UTF-16 surrogate pairs). But when I try to add the 😔 emoji my XML has 😔 in it.

Similarly, the issue you saw was with 👍, right? When I add that in Collect it is 👍 and works end-to-end. You got ��. Seems like it's device-related which encoding is used? This explains why we haven't heard of this issue. I was pretty sure I'd tried emoji before and seen people use them.

I can't create a submission that fails from my device no matter what. 👍🏿👨‍👨‍👧‍👦🐼 -- all fine. See https://test.getodk.cloud/#/projects/149/forms/all-widgets/submissions and feel free to make additional submissions.

@seadowg
Copy link
Member Author

seadowg commented May 26, 2022

Whether instances written by Collect are currently UTF-8 encoded

Ok so after doing a bunch of (probably too frantic) research, it seems that the only way to flag that a file uses UTF-8 or ASCII (or something else) is to add a BOM (byte-order marker) to the beginning of the file. It seems like that's not actually that common and most text editors etc just assume UTF-8. Collect's instance/submission files don't have a BOM as far as I can see (using hexdump -n 3 -C <file> and then comparing to https://www.garykessler.net/library/file_sigs.html) and it seems that's pretty normal from seeing Stack Overflows where people are having a hard time reading files WITH a BOM. My impression is that the best way of flagging "this file is UTF-8" is to use the encoding attribute in XML...

Whether instances written by either Enketo or Collect have anything in the encoding XML root attribute

...but it doesn't seem like Collect (or Enketo) include that.

I'd definitely be happy for someone to double-check all my thinking here, but from all this I think we should do a few things:

  1. Merge this PR (which as far as we can see creates UTF-8 files)
  2. Add the encoding attribute to instance/submission files in Collect and Enketo
  3. Announce on the forum that Collect submissions files will be UTF-8 and no longer have escaped unicode

Escaping feels more scary to me as it creates more uncertainty downstream around what kind of HTML escaping as been used. If we're able to flag UTF-8 using standard XML means, we're making it a lot cleaner to any consumers what to expect when reading the file. For the golden path of Collect/Enketo -> Central this shouldn't change anything (other than fixing problems with complex emoji) as Central as people will most likely be using CSV exports/OData APIs and these were already decoding the HTML encoded characters.

@lognaturel
Copy link
Member

I agree with your reasoning. Let's make sure this is in our next beta.

Happy to write the forum post unless you'd rather take this one all the way now that you've done a deep dive!

@lognaturel lognaturel merged commit adde512 into getodk:master May 31, 2022
@seadowg seadowg mentioned this pull request Jun 1, 2022
3 tasks
@seadowg seadowg deleted the utf branch June 2, 2022 09:27
@seadowg
Copy link
Member Author

seadowg commented Jun 2, 2022

Happy to write the forum post unless you'd rather take this one all the way now that you've done a deep dive!

Can do! I'll write it up as part of the release notes for the v2022.3 Beta so that it can go in Github Releases and the Forum (for maximum 👀).

lognaturel pushed a commit to lognaturel/javarosa that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't HTML-encode formid or version
2 participants