Bugfix/non ascii in xml (rebased) #796

shino · 2014-02-06T06:43:29Z

This PR addresses two issues #787 and #628 (rebased version of #794 ).

The plan to fix them and possible other XML-with-multibyte(non-ASCII)-character
issues are as follows:

riak_cs_xml:format_value(Val) treats conversion between lists/binaries and
Unicode strings
(xmerl accepts only Unicode strings but not binary or latin-1-ish strings).
In the function clause when is_list(Val), use list_to_binary and call
format_value again with converted binary.
In the function clause when is_binary(Val) use unicode:characters_to_list
to produce Unicode strings.
Remove all other unicode:characters_to_list around XML output creation.

Some misc notes:

When we take into account Unicode, lists created by binary_to_list should not
be considered as "strings". This conversion is NOT Unicode-aware, so lists and
binaries have byte-wise correspondence.
On the other hand, unicode:characters_to_list is Unicode-aware, so converted
lists are not necessarily [0..255]. Each element of a converted list is
correspond to Unicode codepoint (not so precise but it's almost a character).
The talk [1] is good reference for Latin-1 and Unicode in Erlang.
Codepoints, encodings, string literals, binaries and more.

> A = <<16#E3, 16#81, 16#82>>. % A binary of Japanese Hiragana "A" in UTF-8
<<227,129,130>>
> binary_to_list(A).
[227,129,130]                  % Just byte-wise
> unicode:characters_to_list(A).
[12354]                        % A list of single element (single character)

[1] http://www.erlang-factory.com/conference/ErlangUserConference2013/speakers/PatrikNyblom

There are some codes where characters are represented by lists those are generated from user-input binary (e.g. key names) by binary_to_list/1. It works for ASCII (or Latin-1) but not for unicode in general. Handling of characters (binaries or lists of integers) of this fix are as follows: - For binaries, we assume that they are encoded by UTF-8. - For lists, we assume they are converted from UTF-8 encoded binaries by applying binary_to_list/1. This is tricky point but binary_to_list/1 treats binaries as if it is Latin-1 encoded, so we should turn them back by list_to_binary first. Then we can apply unicode:characters_to_binary safely.

This is the case of #628. Fix has already been made by the commit b244470

shino · 2014-02-12T03:24:32Z

Rebased (again) and force pushed in order to clear cluttered diff.

reiddraper · 2014-02-13T18:17:57Z

Should this be closed in favor of #807?

shino · 2014-02-14T00:38:41Z

Thank you for review! Bug fixed by #807. Close this PR without merge.

shino mentioned this pull request Feb 6, 2014

Bugfix/non ascii in xml #794

Closed

shino added the Bug label Feb 6, 2014

shino added this to the 1.4.5 milestone Feb 6, 2014

shino added 5 commits February 12, 2014 12:23

Change riak_cs_xml:export_xml to unicode-aware riak_cs_xml:to_xml

f078556

Fix bug of accept content type

e3cd379

Add test case of listing users which has non-ASCII UTF-8 characters

d09ecad

This is the case of #628. Fix has already been made by the commit b244470

Add test case of access stats with xml format

8029153

shino mentioned this pull request Feb 12, 2014

Bugfix/non ascii in xml (reorganized) #807

Merged

shino closed this Feb 14, 2014

shino deleted the bugfix/non-ascii-in-xml-rebased branch February 14, 2014 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix/non ascii in xml (rebased) #796

Bugfix/non ascii in xml (rebased) #796

shino commented Feb 6, 2014

shino commented Feb 12, 2014

reiddraper commented Feb 13, 2014

shino commented Feb 14, 2014

Bugfix/non ascii in xml (rebased) #796

Bugfix/non ascii in xml (rebased) #796

Conversation

shino commented Feb 6, 2014

shino commented Feb 12, 2014

reiddraper commented Feb 13, 2014

shino commented Feb 14, 2014