-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/non ascii in xml #794
Conversation
There are some codes where characters are represented by lists those are generated from user-input binary (e.g. key names) by binary_to_list/1. It works for ASCII (or Latin-1) but not for unicode in general. Handling of characters (binaries or lists of integers) of this fix are as follows: - For binaries, we assume that they are encoded by UTF-8. - For lists, we assume they are converted from UTF-8 encoded binaries by applying binary_to_list/1. This is tricky point but binary_to_list/1 treats binaries as if it is Latin-1 encoded, so we should turn them back by list_to_binary first. Then we can apply unicode:characters_to_binary safely.
memo: This PR has conflict with #791. (almost trivial to merge) |
All tests passing. Reviewing the code now. |
Double-checking my Erlang and unicode: is this true? For-all utf-8 encoded binaries, B =:= list_to_binary(binary_to_list(B)). |
end, ListOfNodeStats), | ||
rtcs:json_get(<<"Samples">>, NodeStats); | ||
node_samples_from_content(xml, Node, Content) -> | ||
{Usage, _Rest} = xmerl_scan:string(unicode:characters_to_list(Content, utf8)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a lot of the other code, we simply binary_to_list/1
UTF-8 encoded binaries into lists. Does the xmerl
API demand we give it a list of character points instead?
I can't think of 'a better way', but one thing that is confusing (and already exists in the codebase) is the mixed use of binaries (presumably UTF-8 encoded), lists (constructed from |
%% These five numbers represents "あいうえお" (A-I-U-E-O in Japanese). | ||
%% unicode:characters_to_binary([12354,12356,12358,12360,12362]). | ||
Chars = [12354,12356,12358,12360,12362], | ||
binary_to_list(unicode:characters_to_binary(Chars)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal note: this is equal to <<"あいうえお"/utf8>>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minus the binary_to_list
part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must use it! Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, as your comment mentions, your code doesn't depend on UTF-8 encoding in the actual file, which might be desirable. That being said, I assume most (all?) of us are using text-editors that use UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, editor's default encoding is UTF-8 :)
Another file encoding issue is about compiler. Erlang compiler thinks files are Latin-1 encoded (until R15, also in R16 but can be overwritten magic comments on files).
"あ"
is three-byte in utf-8 encoded file, compiler sees it as if latin-1 and convert to three latin-1 characters. Then applies /utf8
conversion so it results a binary of six bytes (oops).
True (or I believe it's true). |
xmerl's document says it accepts binary as element of deep list (direct binary is not accepted). |
One trial branch : release/1.4...bugfix/non-ascii-in-xml-2 No Unicode code points appears in generating XML. Anyway, XML generation is now encapsulated in |
(reply to myself) Avoiding |
memo: #807 is finally merged to release branch |
This PR addresses two issues #787 and #628.
The plan to fix them and possible other XML-with-multibyte(non-ASCII)-character
issues are as follows:
riak_cs_xml:format_value(Val)
treats conversion between lists/binaries andUnicode strings
(xmerl accepts only Unicode strings but not binary or latin-1-ish strings).
when is_list(Val)
, use list_to_binary and callformat_value
again with converted binary.when is_binary(Val)
useunicode:characters_to_list
to produce Unicode strings.
unicode:characters_to_list
around XML output creation.Some misc notes:
binary_to_list
should notbe considered as "strings". This conversion is NOT Unicode-aware, so lists and
binaries have byte-wise correspondence.
unicode:characters_to_list
is Unicode-aware, so convertedlists are not necessarily
[0..255]
. Each element of a converted list iscorrespond to Unicode codepoint (not so precise but it's almost a character).
Codepoints, encodings, string literals, binaries and more.
[1] http://www.erlang-factory.com/conference/ErlangUserConference2013/speakers/PatrikNyblom