Bugfix/non ascii in xml #794

shino · 2014-02-05T14:00:31Z

This PR addresses two issues #787 and #628.

The plan to fix them and possible other XML-with-multibyte(non-ASCII)-character
issues are as follows:

riak_cs_xml:format_value(Val) treats conversion between lists/binaries and
Unicode strings
(xmerl accepts only Unicode strings but not binary or latin-1-ish strings).
In the function clause when is_list(Val), use list_to_binary and call
format_value again with converted binary.
In the function clause when is_binary(Val) use unicode:characters_to_list
to produce Unicode strings.
Remove all other unicode:characters_to_list around XML output creation.

Some misc notes:

When we take into account Unicode, lists created by binary_to_list should not
be considered as "strings". This conversion is NOT Unicode-aware, so lists and
binaries have byte-wise correspondence.
On the other hand, unicode:characters_to_list is Unicode-aware, so converted
lists are not necessarily [0..255]. Each element of a converted list is
correspond to Unicode codepoint (not so precise but it's almost a character).
The talk [1] is good reference for Latin-1 and Unicode in Erlang.
Codepoints, encodings, string literals, binaries and more.

> A = <<16#E3, 16#81, 16#82>>. % A binary of Japanese Hiragana "A" in UTF-8
<<227,129,130>>
> binary_to_list(A).
[227,129,130]                  % Just byte-wise
> unicode:characters_to_list(A).
[12354]                        % A list of single element (single character)

[1] http://www.erlang-factory.com/conference/ErlangUserConference2013/speakers/PatrikNyblom

There are some codes where characters are represented by lists those are generated from user-input binary (e.g. key names) by binary_to_list/1. It works for ASCII (or Latin-1) but not for unicode in general. Handling of characters (binaries or lists of integers) of this fix are as follows: - For binaries, we assume that they are encoded by UTF-8. - For lists, we assume they are converted from UTF-8 encoded binaries by applying binary_to_list/1. This is tricky point but binary_to_list/1 treats binaries as if it is Latin-1 encoded, so we should turn them back by list_to_binary first. Then we can apply unicode:characters_to_binary safely.

This is the case of #628. Fix has already been made by the commit d742add

shino · 2014-02-05T14:02:43Z

memo: This PR has conflict with #791. (almost trivial to merge)

reiddraper · 2014-02-05T20:21:15Z

All tests passing. Reviewing the code now.

reiddraper · 2014-02-05T20:49:22Z

Double-checking my Erlang and unicode: is this true? For-all utf-8 encoded binaries, B:

B =:= list_to_binary(binary_to_list(B)).

reiddraper · 2014-02-05T21:49:40Z

riak_test/tests/access_stats_test.erl

+                        end, ListOfNodeStats),
+    rtcs:json_get(<<"Samples">>, NodeStats);
+node_samples_from_content(xml, Node, Content) ->
+    {Usage, _Rest} = xmerl_scan:string(unicode:characters_to_list(Content, utf8)),


In a lot of the other code, we simply binary_to_list/1 UTF-8 encoded binaries into lists. Does the xmerl API demand we give it a list of character points instead?

reiddraper · 2014-02-05T21:51:01Z

I can't think of 'a better way', but one thing that is confusing (and already exists in the codebase) is the mixed use of binaries (presumably UTF-8 encoded), lists (constructed from binary_to_list/1) and lists of character points. Is there any way to be more consistent, or are we doing the best we can already?

reiddraper · 2014-02-05T22:22:16Z

riak_test/tests/user_test.erl

+    %% These five numbers represents "あいうえお" (A-I-U-E-O in Japanese).
+    %% unicode:characters_to_binary([12354,12356,12358,12360,12362]).
+    Chars = [12354,12356,12358,12360,12362],
+    binary_to_list(unicode:characters_to_binary(Chars)).


Personal note: this is equal to <<"あいうえお"/utf8>>.

Minus the binary_to_list part.

I must use it! Will fix.

Well, as your comment mentions, your code doesn't depend on UTF-8 encoding in the actual file, which might be desirable. That being said, I assume most (all?) of us are using text-editors that use UTF-8.

Yeah, editor's default encoding is UTF-8 :)
Another file encoding issue is about compiler. Erlang compiler thinks files are Latin-1 encoded (until R15, also in R16 but can be overwritten magic comments on files).

"あ" is three-byte in utf-8 encoded file, compiler sees it as if latin-1 and convert to three latin-1 characters. Then applies /utf8 conversion so it results a binary of six bytes (oops).

shino · 2014-02-06T00:03:05Z

Double-checking my Erlang and unicode: is this true? For-all utf-8 encoded binaries, B:

True (or I believe it's true).

shino · 2014-02-06T01:16:46Z

Does the xmerl API demand we give it a list of character points instead?

xmerl's document says it accepts binary as element of deep list (direct binary is not accepted).
Sample code of list of code points and binary > https://gist.github.com/shino/3c5ab84980dde4e7ab24
It seems xmerl converts a binary to byte-wise list (same as binary_to_list).

shino · 2014-02-06T04:45:19Z

One trial branch : release/1.4...bugfix/non-ascii-in-xml-2
Difference from bugfix/non-ascii-in-xml is only this commit: 6a13a83

No Unicode code points appears in generating XML.
The trick behind is that xmerl:export_simple convert list of binaries to b2l'd lists.
(I wish it accepted direct binary()).

Anyway, XML generation is now encapsulated in riak_cs_xml, so switching
these two ways (or another?) is rather easy.

shino · 2014-02-06T06:44:15Z

Because of conflict with #789, rebased and created new PR #796 .

shino · 2014-02-06T06:47:35Z

(reply to myself)

Avoiding unicode module at bugfix/non-ascii-in-xml-2 seems not good for me...
unicode can be avoided in generating XML by xmerl:export_simple but parsing XML
by xmerl_scan:string returns text value of Unicode code points. So they are not symmetric. 🙈

shino · 2014-02-12T09:25:20Z

Continue in #796 or #807

shino · 2014-02-14T00:39:58Z

memo: #807 is finally merged to release branch release/1.4.

shino added 5 commits February 4, 2014 14:57

Change riak_cs_xml:export_xml to unicode-aware riak_cs_xml:to_xml

5349bcc

Fix bug of accept content type

7cdef50

Add test case of listing users which has non-ASCII UTF-8 characters

bad0589

This is the case of #628. Fix has already been made by the commit d742add

Add test case of access stats with xml format

1160fa4

shino mentioned this pull request Feb 5, 2014

Support multibyte username on xml response. #628

Closed

shino added the Bug label Feb 5, 2014

shino added this to the 1.4.5 milestone Feb 5, 2014

reiddraper reviewed Feb 5, 2014
View reviewed changes

shino mentioned this pull request Feb 6, 2014

Bugfix/non ascii in xml (rebased) #796

Closed

shino closed this Feb 12, 2014

shino deleted the bugfix/non-ascii-in-xml branch February 14, 2014 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix/non ascii in xml #794

Bugfix/non ascii in xml #794

shino commented Feb 5, 2014

shino commented Feb 5, 2014

reiddraper commented Feb 5, 2014

reiddraper commented Feb 5, 2014

reiddraper Feb 5, 2014

reiddraper commented Feb 5, 2014

reiddraper Feb 5, 2014

reiddraper Feb 5, 2014

shino Feb 6, 2014

reiddraper Feb 6, 2014

shino Feb 6, 2014

shino commented Feb 6, 2014

shino commented Feb 6, 2014

shino commented Feb 6, 2014

shino commented Feb 6, 2014

shino commented Feb 6, 2014

shino commented Feb 12, 2014

shino commented Feb 14, 2014

Bugfix/non ascii in xml #794

Bugfix/non ascii in xml #794

Conversation

shino commented Feb 5, 2014

shino commented Feb 5, 2014

reiddraper commented Feb 5, 2014

reiddraper commented Feb 5, 2014

reiddraper Feb 5, 2014

Choose a reason for hiding this comment

reiddraper commented Feb 5, 2014

reiddraper Feb 5, 2014

Choose a reason for hiding this comment

reiddraper Feb 5, 2014

Choose a reason for hiding this comment

shino Feb 6, 2014

Choose a reason for hiding this comment

reiddraper Feb 6, 2014

Choose a reason for hiding this comment

shino Feb 6, 2014

Choose a reason for hiding this comment

shino commented Feb 6, 2014

shino commented Feb 6, 2014

shino commented Feb 6, 2014

shino commented Feb 6, 2014

shino commented Feb 6, 2014

shino commented Feb 12, 2014

shino commented Feb 14, 2014