Skip to content
This repository has been archived by the owner on Mar 26, 2020. It is now read-only.

UTF8 string input not supported #67

Closed
szatmary opened this issue Jul 17, 2016 · 2 comments
Closed

UTF8 string input not supported #67

szatmary opened this issue Jul 17, 2016 · 2 comments

Comments

@szatmary
Copy link

json.org specifies a "A string is a sequence of zero or more Unicode characters" I have always taken that to mean utf8. parse_string() does not seem to handle multi byte utf8. I will be happy to do the work and provide a pull request, If its agreed that utf8 support is desirable.

@j4cbo
Copy link
Contributor

j4cbo commented Jul 17, 2016

Is there a particular bug you're seeing? json11 is intended to handle UTF8 properly throughout, which is usually transparent, but not in a few cases:

  • This is not specified by the JSON standard, but some browsers require \u2028 and \u2029 to be escaped:

    json11/json11.cpp

    Lines 90 to 97 in 8452587

    } else if (static_cast<uint8_t>(ch) == 0xe2 && static_cast<uint8_t>(value[i+1]) == 0x80
    && static_cast<uint8_t>(value[i+2]) == 0xa8) {
    out += "\\u2028";
    i += 2;
    } else if (static_cast<uint8_t>(ch) == 0xe2 && static_cast<uint8_t>(value[i+1]) == 0x80
    && static_cast<uint8_t>(value[i+2]) == 0xa9) {
    out += "\\u2029";
    i += 2;
  • JSON only provides \uXXXX escapes, which can't encode characters outside the BMP. The recommendation is to use surrogate pairs (UTF-16), so we need to pay special attention during decode in order to produce valid UTF-8 instead of CESU-8:

    json11/json11.cpp

    Lines 520 to 534 in 8452587

    // JSON specifies that characters outside the BMP shall be encoded as a pair
    // of 4-hex-digit \u escapes encoding their surrogate pair components. Check
    // whether we're in the middle of such a beast: the previous codepoint was an
    // escaped lead (high) surrogate, and this is a trail (low) surrogate.
    if (in_range(last_escaped_codepoint, 0xD800, 0xDBFF)
    && in_range(codepoint, 0xDC00, 0xDFFF)) {
    // Reassemble the two surrogate pairs into one astral-plane character, per
    // the UTF-16 algorithm.
    encode_utf8((((last_escaped_codepoint - 0xD800) << 10)
    | (codepoint - 0xDC00)) + 0x10000, out);
    last_escaped_codepoint = -1;
    } else {
    encode_utf8(last_escaped_codepoint, out);
    last_escaped_codepoint = codepoint;
    }

@szatmary
Copy link
Author

Yeah, I may have jumped the gun there and imprudently blamed json11, Still investigating. but will close this in the mean time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants