Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/non ascii in xml #794

Closed
wants to merge 5 commits into from
Closed

Bugfix/non ascii in xml #794

wants to merge 5 commits into from

Conversation

shino
Copy link
Contributor

@shino shino commented Feb 5, 2014

This PR addresses two issues #787 and #628.

The plan to fix them and possible other XML-with-multibyte(non-ASCII)-character
issues are as follows:

  • riak_cs_xml:format_value(Val) treats conversion between lists/binaries and
    Unicode strings
    (xmerl accepts only Unicode strings but not binary or latin-1-ish strings).
  • In the function clause when is_list(Val), use list_to_binary and call
    format_value again with converted binary.
  • In the function clause when is_binary(Val) use unicode:characters_to_list
    to produce Unicode strings.
  • Remove all other unicode:characters_to_list around XML output creation.

Some misc notes:

  • When we take into account Unicode, lists created by binary_to_list should not
    be considered as "strings". This conversion is NOT Unicode-aware, so lists and
    binaries have byte-wise correspondence.
  • On the other hand, unicode:characters_to_list is Unicode-aware, so converted
    lists are not necessarily [0..255]. Each element of a converted list is
    correspond to Unicode codepoint (not so precise but it's almost a character).
  • The talk [1] is good reference for Latin-1 and Unicode in Erlang.
    Codepoints, encodings, string literals, binaries and more.
> A = <<16#E3, 16#81, 16#82>>. % A binary of Japanese Hiragana "A" in UTF-8
<<227,129,130>>
> binary_to_list(A).
[227,129,130]                  % Just byte-wise
> unicode:characters_to_list(A).
[12354]                        % A list of single element (single character)

[1] http://www.erlang-factory.com/conference/ErlangUserConference2013/speakers/PatrikNyblom

There are some codes where characters are represented by lists
those are generated from user-input binary (e.g. key names)
by binary_to_list/1.
It works for ASCII (or Latin-1) but not for unicode in general.

Handling of characters (binaries or lists of integers) of this fix are as follows:

- For binaries, we assume that they are encoded by UTF-8.
- For lists, we assume they are converted from UTF-8 encoded binaries
  by applying binary_to_list/1.
  This is tricky point but binary_to_list/1 treats binaries as if it is
  Latin-1 encoded, so we should turn them back by list_to_binary first.
  Then we can apply unicode:characters_to_binary safely.
This is the case of #628.
Fix has already been made by the commit d742add
@shino
Copy link
Contributor Author

shino commented Feb 5, 2014

memo: This PR has conflict with #791. (almost trivial to merge)

@shino shino added the Bug label Feb 5, 2014
@shino shino added this to the 1.4.5 milestone Feb 5, 2014
@reiddraper
Copy link
Contributor

All tests passing. Reviewing the code now.

@reiddraper
Copy link
Contributor

Double-checking my Erlang and unicode: is this true? For-all utf-8 encoded binaries, B:

B =:= list_to_binary(binary_to_list(B)).

end, ListOfNodeStats),
rtcs:json_get(<<"Samples">>, NodeStats);
node_samples_from_content(xml, Node, Content) ->
{Usage, _Rest} = xmerl_scan:string(unicode:characters_to_list(Content, utf8)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a lot of the other code, we simply binary_to_list/1 UTF-8 encoded binaries into lists. Does the xmerl API demand we give it a list of character points instead?

@reiddraper
Copy link
Contributor

I can't think of 'a better way', but one thing that is confusing (and already exists in the codebase) is the mixed use of binaries (presumably UTF-8 encoded), lists (constructed from binary_to_list/1) and lists of character points. Is there any way to be more consistent, or are we doing the best we can already?

%% These five numbers represents "あいうえお" (A-I-U-E-O in Japanese).
%% unicode:characters_to_binary([12354,12356,12358,12360,12362]).
Chars = [12354,12356,12358,12360,12362],
binary_to_list(unicode:characters_to_binary(Chars)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal note: this is equal to <<"あいうえお"/utf8>>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minus the binary_to_list part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must use it! Will fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as your comment mentions, your code doesn't depend on UTF-8 encoding in the actual file, which might be desirable. That being said, I assume most (all?) of us are using text-editors that use UTF-8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, editor's default encoding is UTF-8 :)
Another file encoding issue is about compiler. Erlang compiler thinks files are Latin-1 encoded (until R15, also in R16 but can be overwritten magic comments on files).

"あ" is three-byte in utf-8 encoded file, compiler sees it as if latin-1 and convert to three latin-1 characters. Then applies /utf8 conversion so it results a binary of six bytes (oops).

@shino
Copy link
Contributor Author

shino commented Feb 6, 2014

Double-checking my Erlang and unicode: is this true? For-all utf-8 encoded binaries, B:

True (or I believe it's true).

@shino
Copy link
Contributor Author

shino commented Feb 6, 2014

Does the xmerl API demand we give it a list of character points instead?

xmerl's document says it accepts binary as element of deep list (direct binary is not accepted).
Sample code of list of code points and binary > https://gist.github.com/shino/3c5ab84980dde4e7ab24
It seems xmerl converts a binary to byte-wise list (same as binary_to_list).

@shino
Copy link
Contributor Author

shino commented Feb 6, 2014

One trial branch : release/1.4...bugfix/non-ascii-in-xml-2
Difference from bugfix/non-ascii-in-xml is only this commit: 6a13a83

No Unicode code points appears in generating XML.
The trick behind is that xmerl:export_simple convert list of binaries to b2l'd lists.
(I wish it accepted direct binary()).

Anyway, XML generation is now encapsulated in riak_cs_xml, so switching
these two ways (or another?) is rather easy.

@shino
Copy link
Contributor Author

shino commented Feb 6, 2014

Because of conflict with #789, rebased and created new PR #796 .

@shino
Copy link
Contributor Author

shino commented Feb 6, 2014

(reply to myself)

Avoiding unicode module at bugfix/non-ascii-in-xml-2 seems not good for me...
unicode can be avoided in generating XML by xmerl:export_simple but parsing XML
by xmerl_scan:string returns text value of Unicode code points. So they are not symmetric. 🙈

@shino
Copy link
Contributor Author

shino commented Feb 12, 2014

Continue in #796 or #807

@shino shino closed this Feb 12, 2014
@shino shino deleted the bugfix/non-ascii-in-xml branch February 14, 2014 00:38
@shino
Copy link
Contributor Author

shino commented Feb 14, 2014

memo: #807 is finally merged to release branch release/1.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants