json_spirit: use utf8 intenally when parsing \uHHHH #4527

tserong · 2015-05-01T16:28:27Z

When the python CLI is given non-ASCII characters, it converts them to
\uHHHH escapes in JSON. json_spirit parses these internally into 16 bit
characters, which could only work if json_spirit were built to use
std::wstring, which it isn't; it's using std::string, so the high byte
ends up being zero'd, leaving the low byte which is effectively garbage.

This hack^H^H^H^H change makes json_spirit convert to utf8 internally
instead, which can be stored just fine inside a std::string.

Note that this implementation still assumes \uHHHH escapes are four hex
digits, so it'll only cope with characters in the Basic Multilingual
Plane. Still, that's rather a lot more characters than it could cope
with before ;)

(For characters outside the BMP, Python seems to generate escapes in the
form \uHHHHHHHH, i.e. 8 hex digits, which the current implementation
doesn't expect to see)

Fixes: #7387

Signed-off-by: Tim Serong tserong@suse.com

loic-bot · 2015-05-01T18:07:22Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/4905/

tchaikov · 2015-05-04T09:07:24Z

@tserong looks good, but can we have a test case which reproduces the funny test in http://pastebin.com/KzWab33X ?

tserong · 2015-05-04T09:13:16Z

@tchaikov, sure, will write a test case and see about the cleanup you suggested above.

tserong · 2015-05-06T02:11:54Z

@tchaikov done. Assuming the above is OK, are you happy with it as three commits, or would you prefer I squash it back to one?

loic-bot · 2015-05-06T02:38:18Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/4989/

tchaikov · 2015-05-06T08:47:04Z

thanks @tserong =)

yeah, i'd prefer we have a single commit for this change, could you do that?

When the python CLI is given non-ASCII characters, it converts them to \uHHHH escapes in JSON. json_spirit parses these internally into 16 bit characters, which could only work if json_spirit were built to use std::wstring, which it isn't; it's using std::string, so the high byte ends up being zero'd, leaving the low byte which is effectively garbage. This hack^H^H^H^H change makes json_spirit convert to utf8 internally instead, which can be stored just fine inside a std::string. Note that this implementation still assumes \uHHHH escapes are four hex digits, so it'll only cope with characters in the Basic Multilingual Plane. Still, that's rather a lot more characters than it could cope with before ;) (For characters outside the BMP, Python seems to generate escapes in the form \uHHHHHHHH, i.e. 8 hex digits, which the current implementation doesn't expect to see) Fixes: ceph#7387 Signed-off-by: Tim Serong <tserong@suse.com>

tserong · 2015-05-06T10:23:00Z

No problem, squashed.

loic-bot · 2015-05-06T13:35:34Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/5004/

json_spirit: use utf8 intenally when parsing \uHHHH Reviewed-by: Kefu Chai <kchai@redhat.com>

gregsfortytwo · 2015-05-08T03:46:50Z

Looks like this busted things up a bit? http://tracker.ceph.com/issues/11574

tchaikov · 2015-05-08T03:48:22Z

@gregsfortytwo ack.

tserong · 2015-05-08T04:38:25Z

This might fix it:

diff --git a/src/json_spirit/json_spirit_reader_template.h b/src/json_spirit/json_spirit_reader_template.h
index 2eaf743..c50f885 100644
--- a/src/json_spirit/json_spirit_reader_template.h
+++ b/src/json_spirit/json_spirit_reader_template.h
@@ -79,7 +79,7 @@ namespace json_spirit
     template<>
     std::string unicode_str_to_utf8( std::string::const_iterator & begin )
     {
-        typedef typename std::string::value_type Char_type;
+        typedef std::string::value_type Char_type;

         const Char_type c1( *( ++begin ) );
         const Char_type c2( *( ++begin ) );

Although I can't say for sure, as I never had that error ("using 'typename' outside of template") in my test builds -- for me it builds fine with or without typename, actually.

tchaikov · 2015-05-08T04:42:54Z

@tserong yes. i just pushed the same patch to wip-11574-fix-FTBFS, seems the build on centos 6.5 is happy, see http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-rpm-rhel6-5-amd64-basic/#origin/wip-11574-fix-FTBFS .

tchaikov · 2015-05-08T04:49:24Z

#4614 is posted to address this FTBFS.

tchaikov added bug-fix common labels May 4, 2015

tchaikov self-assigned this May 6, 2015

tserong force-pushed the wip-hack-utf8-into-json-parser branch from 9aa7ecb to 8add15b Compare May 6, 2015 10:19

tchaikov added a commit that referenced this pull request May 6, 2015

Merge pull request #4527 from SUSE/wip-hack-utf8-into-json-parser

b65e93b

json_spirit: use utf8 intenally when parsing \uHHHH Reviewed-by: Kefu Chai <kchai@redhat.com>

tchaikov merged commit b65e93b into ceph:master May 6, 2015

tserong deleted the wip-hack-utf8-into-json-parser branch May 6, 2015 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json_spirit: use utf8 intenally when parsing \uHHHH #4527

json_spirit: use utf8 intenally when parsing \uHHHH #4527

tserong commented May 1, 2015

loic-bot commented May 1, 2015

tchaikov commented May 4, 2015

tserong commented May 4, 2015

tserong commented May 6, 2015

loic-bot commented May 6, 2015

tchaikov commented May 6, 2015

tserong commented May 6, 2015

loic-bot commented May 6, 2015

gregsfortytwo commented May 8, 2015

tchaikov commented May 8, 2015

tserong commented May 8, 2015

tchaikov commented May 8, 2015

tchaikov commented May 8, 2015

json_spirit: use utf8 intenally when parsing \uHHHH #4527

json_spirit: use utf8 intenally when parsing \uHHHH #4527

Conversation

tserong commented May 1, 2015

loic-bot commented May 1, 2015

tchaikov commented May 4, 2015

tserong commented May 4, 2015

tserong commented May 6, 2015

loic-bot commented May 6, 2015

tchaikov commented May 6, 2015

tserong commented May 6, 2015

loic-bot commented May 6, 2015

gregsfortytwo commented May 8, 2015

tchaikov commented May 8, 2015

tserong commented May 8, 2015

tchaikov commented May 8, 2015

tchaikov commented May 8, 2015