Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/non ascii in xml (rebased) #796

Closed
wants to merge 5 commits into from

Conversation

shino
Copy link
Contributor

@shino shino commented Feb 6, 2014

This PR addresses two issues #787 and #628 (rebased version of #794 ).

The plan to fix them and possible other XML-with-multibyte(non-ASCII)-character
issues are as follows:

  • riak_cs_xml:format_value(Val) treats conversion between lists/binaries and
    Unicode strings
    (xmerl accepts only Unicode strings but not binary or latin-1-ish strings).
  • In the function clause when is_list(Val), use list_to_binary and call
    format_value again with converted binary.
  • In the function clause when is_binary(Val) use unicode:characters_to_list
    to produce Unicode strings.
  • Remove all other unicode:characters_to_list around XML output creation.

Some misc notes:

  • When we take into account Unicode, lists created by binary_to_list should not
    be considered as "strings". This conversion is NOT Unicode-aware, so lists and
    binaries have byte-wise correspondence.
  • On the other hand, unicode:characters_to_list is Unicode-aware, so converted
    lists are not necessarily [0..255]. Each element of a converted list is
    correspond to Unicode codepoint (not so precise but it's almost a character).
  • The talk [1] is good reference for Latin-1 and Unicode in Erlang.
    Codepoints, encodings, string literals, binaries and more.
> A = <<16#E3, 16#81, 16#82>>. % A binary of Japanese Hiragana "A" in UTF-8
<<227,129,130>>
> binary_to_list(A).
[227,129,130]                  % Just byte-wise
> unicode:characters_to_list(A).
[12354]                        % A list of single element (single character)

[1] http://www.erlang-factory.com/conference/ErlangUserConference2013/speakers/PatrikNyblom

@shino shino mentioned this pull request Feb 6, 2014
@shino shino added the Bug label Feb 6, 2014
@shino shino added this to the 1.4.5 milestone Feb 6, 2014
There are some codes where characters are represented by lists
those are generated from user-input binary (e.g. key names)
by binary_to_list/1.
It works for ASCII (or Latin-1) but not for unicode in general.

Handling of characters (binaries or lists of integers) of this fix are as follows:

- For binaries, we assume that they are encoded by UTF-8.
- For lists, we assume they are converted from UTF-8 encoded binaries
  by applying binary_to_list/1.
  This is tricky point but binary_to_list/1 treats binaries as if it is
  Latin-1 encoded, so we should turn them back by list_to_binary first.
  Then we can apply unicode:characters_to_binary safely.
This is the case of #628.
Fix has already been made by the commit b244470
@shino
Copy link
Contributor Author

shino commented Feb 12, 2014

Rebased (again) and force pushed in order to clear cluttered diff.

@reiddraper
Copy link
Contributor

Should this be closed in favor of #807?

@shino
Copy link
Contributor Author

shino commented Feb 14, 2014

Thank you for review! Bug fixed by #807. Close this PR without merge.

@shino shino closed this Feb 14, 2014
@shino shino deleted the bugfix/non-ascii-in-xml-rebased branch February 14, 2014 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants