UTF8 string input not supported #67

szatmary · 2016-07-17T16:43:06Z

json.org specifies a "A string is a sequence of zero or more Unicode characters" I have always taken that to mean utf8. parse_string() does not seem to handle multi byte utf8. I will be happy to do the work and provide a pull request, If its agreed that utf8 support is desirable.

j4cbo · 2016-07-17T17:22:04Z

Is there a particular bug you're seeing? json11 is intended to handle UTF8 properly throughout, which is usually transparent, but not in a few cases:

This is not specified by the JSON standard, but some browsers require \u2028 and \u2029 to be escaped:

json11/json11.cpp

Lines 90 to 97 in 8452587

    
           } else if (static_cast<uint8_t>(ch) == 0xe2 && static_cast<uint8_t>(value[i+1]) == 0x80 
        
                      && static_cast<uint8_t>(value[i+2]) == 0xa8) { 
        
               out += "\\u2028"; 
        
               i += 2; 
        
           } else if (static_cast<uint8_t>(ch) == 0xe2 && static_cast<uint8_t>(value[i+1]) == 0x80 
        
                      && static_cast<uint8_t>(value[i+2]) == 0xa9) { 
        
               out += "\\u2029"; 
        
               i += 2;

JSON only provides \uXXXX escapes, which can't encode characters outside the BMP. The recommendation is to use surrogate pairs (UTF-16), so we need to pay special attention during decode in order to produce valid UTF-8 instead of CESU-8:

json11/json11.cpp

Lines 520 to 534 in 8452587

    
           // JSON specifies that characters outside the BMP shall be encoded as a pair 
        
           // of 4-hex-digit \u escapes encoding their surrogate pair components. Check 
        
           // whether we're in the middle of such a beast: the previous codepoint was an 
        
           // escaped lead (high) surrogate, and this is a trail (low) surrogate. 
        
           if (in_range(last_escaped_codepoint, 0xD800, 0xDBFF) 
        
                   && in_range(codepoint, 0xDC00, 0xDFFF)) { 
        
               // Reassemble the two surrogate pairs into one astral-plane character, per 
        
               // the UTF-16 algorithm. 
        
               encode_utf8((((last_escaped_codepoint - 0xD800) << 10) 
        
                            | (codepoint - 0xDC00)) + 0x10000, out); 
        
               last_escaped_codepoint = -1; 
        
           } else { 
        
               encode_utf8(last_escaped_codepoint, out); 
        
               last_escaped_codepoint = codepoint; 
        
           }

szatmary · 2016-07-17T19:47:46Z

Yeah, I may have jumped the gun there and imprudently blamed json11, Still investigating. but will close this in the mean time.

szatmary closed this as completed Jul 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 string input not supported #67

UTF8 string input not supported #67

szatmary commented Jul 17, 2016

j4cbo commented Jul 17, 2016

szatmary commented Jul 17, 2016

UTF8 string input not supported #67

UTF8 string input not supported #67

Comments

szatmary commented Jul 17, 2016

j4cbo commented Jul 17, 2016

szatmary commented Jul 17, 2016