Conceptually, strings are a series of code points as defined by Unicode. Unicode defines code points between U+00000 and U+10FFFF.
The problem of extracting these code point strings from a JSON file is divided between two domains:
The input domain (how the strings are encoded)
There are 3 basic “paths” to code points in JSON string encoding
A1) direct code points (unescaped character values <= 127 in value)
A2) simply escaped character values (such as \n)
B1) escaped single UTF-16 characters (\uXXXX)
B2) surrogate pairs of escaped UTF-16 values (\uD8XX\uDCXX)
(Note range for first is [0xD800, 0xDBFF] and second is [0xDC00,0xDFFF])
C1) UTF-8 (code points requiring multiple points depending on value)
It is BeneJSON’s responsibility to verify the validity of these encodings and map each sequence of characters into the correct code point.
The output domain (how the strings will be stored)
For simplicity, BeneJSON does not “eagerly” translate the JSON character input into UTF-8, like most JSON libraries do.
BeneJSON users will desire these formats:
Since BeneJSON does not push strings to the user, the user must “pull” strings from BeneJSON. Before the user can pull strings, he must first know how long the string will be in his desired format.
Given the length in their desired format, the user can now prepare a buffer for storing the entire string. They then call the following functions
For those of you who are not familiar with stpcpy():