Interpreting JSON encoded strings

codehero edited this page Sep 14, 2010 · 2 revisions

Conceptually, strings are a series of code points as defined by Unicode. Unicode defines code points between U+00000 and U+10FFFF.
The problem of extracting these code point strings from a JSON file is divided between two domains:

The input domain (how the strings are encoded)
There are 3 basic “paths” to code points in JSON string encoding
A1) direct code points (unescaped character values <= 127 in value)
A2) simply escaped character values (such as \n)
B1) escaped single UTF-16 characters (\uXXXX)
B2) surrogate pairs of escaped UTF-16 values (\uD8XX\uDCXX)
(Note range for first is [0xD800, 0xDBFF] and second is [0xDC00,0xDFFF])
C1) UTF-8 (code points requiring multiple points depending on value)
It is BeneJSON’s responsibility to verify the validity of these encodings and map each sequence of characters into the correct code point.

The output domain (how the strings will be stored)
For simplicity, BeneJSON does not “eagerly” translate the JSON character input into UTF-8, like most JSON libraries do.
BeneJSON users will desire these formats:

  • In Linux, strings are stored in UTF-8 or as wchar_t (UTF-32) arrays.
  • In Windows, strings are stored in UTF-8 or UTF-16. From what I know, Windows uses surrogate pairs for extended code points.
  • The possible formats are UTF-8, UTF-16, or UTF-32. Note that only UTF-32 is guaranteed to store a code point in a fixed size; UTF-8 and UTF-16 are multi character encodings.

The interface
Since BeneJSON does not push strings to the user, the user must “pull” strings from BeneJSON. Before the user can pull strings, he must first know how long the string will be in his desired format.

  • UTF-8: bnj_strlen8() calculates the length of the string encoded in UTF-8
  • UTF-16: bnj_strlen16() calculates the length of the string encoded in UTF-16
  • UTF-32: bnj_strlen32() calculates the length of the string encoded in UTF-32
    For length calculation, it is immaterial HOW the JSON string was presented (A, B, or C), but rather which code points the encoding contained. By counting which code points were in which range, the presented calls operate in constant time.

Given the length in their desired format, the user can now prepare a buffer for storing the entire string. They then call the following functions

  • UTF-8: bnj_stpcpy() copies the raw JSON encoded bytes into UTF-8 format
  • UTF-16 and UTF-32 are not yet written

For those of you who are not familiar with stpcpy():

  • Return value of strcpy() points to the first byte of the destination
  • Return value of stpcpy() points to the NULL terminator byte. This is more sensible since it simultaneously simplifies string concatenation and informs the user know how many bytes were actually copied. Why this function is not used over strcpy() is beyond me…