Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json_spirit: use utf8 intenally when parsing \uHHHH #4527

Merged
merged 1 commit into from May 6, 2015

Conversation

tserong
Copy link
Contributor

@tserong tserong commented May 1, 2015

When the python CLI is given non-ASCII characters, it converts them to
\uHHHH escapes in JSON. json_spirit parses these internally into 16 bit
characters, which could only work if json_spirit were built to use
std::wstring, which it isn't; it's using std::string, so the high byte
ends up being zero'd, leaving the low byte which is effectively garbage.

This hack^H^H^H^H change makes json_spirit convert to utf8 internally
instead, which can be stored just fine inside a std::string.

Note that this implementation still assumes \uHHHH escapes are four hex
digits, so it'll only cope with characters in the Basic Multilingual
Plane. Still, that's rather a lot more characters than it could cope
with before ;)

(For characters outside the BMP, Python seems to generate escapes in the
form \uHHHHHHHH, i.e. 8 hex digits, which the current implementation
doesn't expect to see)

Fixes: #7387

Signed-off-by: Tim Serong tserong@suse.com

@tchaikov
Copy link
Contributor

tchaikov commented May 4, 2015

@tserong looks good, but can we have a test case which reproduces the funny test in http://pastebin.com/KzWab33X ?

@tserong
Copy link
Contributor Author

tserong commented May 4, 2015

@tchaikov, sure, will write a test case and see about the cleanup you suggested above.

@tserong
Copy link
Contributor Author

tserong commented May 6, 2015

@tchaikov done. Assuming the above is OK, are you happy with it as three commits, or would you prefer I squash it back to one?

@tchaikov tchaikov self-assigned this May 6, 2015
@tchaikov
Copy link
Contributor

tchaikov commented May 6, 2015

thanks @tserong =)

yeah, i'd prefer we have a single commit for this change, could you do that?

When the python CLI is given non-ASCII characters, it converts them to
\uHHHH escapes in JSON.  json_spirit parses these internally into 16 bit
characters, which could only work if json_spirit were built to use
std::wstring, which it isn't; it's using std::string, so the high byte
ends up being zero'd, leaving the low byte which is effectively garbage.

This hack^H^H^H^H change makes json_spirit convert to utf8 internally
instead, which can be stored just fine inside a std::string.

Note that this implementation still assumes \uHHHH escapes are four hex
digits, so it'll only cope with characters in the Basic Multilingual
Plane.  Still, that's rather a lot more characters than it could cope
with before ;)

(For characters outside the BMP, Python seems to generate escapes in the
form \uHHHHHHHH, i.e. 8 hex digits, which the current implementation
doesn't expect to see)

Fixes: ceph#7387

Signed-off-by: Tim Serong <tserong@suse.com>
@tserong tserong force-pushed the wip-hack-utf8-into-json-parser branch from 9aa7ecb to 8add15b Compare May 6, 2015 10:19
@tserong
Copy link
Contributor Author

tserong commented May 6, 2015

No problem, squashed.

tchaikov added a commit that referenced this pull request May 6, 2015
json_spirit: use utf8 intenally when parsing \uHHHH

Reviewed-by: Kefu Chai <kchai@redhat.com>
@tchaikov tchaikov merged commit b65e93b into ceph:master May 6, 2015
@tserong tserong deleted the wip-hack-utf8-into-json-parser branch May 6, 2015 23:06
@gregsfortytwo
Copy link
Member

Looks like this busted things up a bit? http://tracker.ceph.com/issues/11574

@tchaikov
Copy link
Contributor

tchaikov commented May 8, 2015

@gregsfortytwo ack.

@tserong
Copy link
Contributor Author

tserong commented May 8, 2015

This might fix it:

diff --git a/src/json_spirit/json_spirit_reader_template.h b/src/json_spirit/json_spirit_reader_template.h
index 2eaf743..c50f885 100644
--- a/src/json_spirit/json_spirit_reader_template.h
+++ b/src/json_spirit/json_spirit_reader_template.h
@@ -79,7 +79,7 @@ namespace json_spirit
     template<>
     std::string unicode_str_to_utf8( std::string::const_iterator & begin )
     {
-        typedef typename std::string::value_type Char_type;
+        typedef std::string::value_type Char_type;

         const Char_type c1( *( ++begin ) );
         const Char_type c2( *( ++begin ) );

Although I can't say for sure, as I never had that error ("using 'typename' outside of template") in my test builds -- for me it builds fine with or without typename, actually.

@tchaikov
Copy link
Contributor

tchaikov commented May 8, 2015

@tserong yes. i just pushed the same patch to wip-11574-fix-FTBFS, seems the build on centos 6.5 is happy, see http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-rpm-rhel6-5-amd64-basic/#origin/wip-11574-fix-FTBFS .

@tchaikov
Copy link
Contributor

tchaikov commented May 8, 2015

#4614 is posted to address this FTBFS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants