Motivation
During the stdlib audit, std.base64 / std.base64Decode showed a non-ASCII string semantics difference between sjsonnet and official C++ Jsonnet v0.22.0.
We are not changing this in the current stdlib correctness PR because this behavior may be visible to users and jrsonnet currently makes the same UTF-8 choice as sjsonnet. This issue tracks the discrepancy separately.
Evidence
Official jsonnet v0.22.0:
std.base64("é") // "6Q=="
std.base64Decode("w6k=") // "é"
std.base64Decode("6Q==") // "é"
std.base64DecodeBytes(std.base64("é")) // [233]
std.base64("Ā") // runtime error: Can only base64 encode strings / arrays of single bytes.
Current sjsonnet:
std.base64("é") // "w6k="
std.base64Decode("w6k=") // "é"
std.base64Decode("6Q==") // "�"
std.base64DecodeBytes(std.base64("é")) // [195, 169]
Local jrsonnet check:
jrsonnet 0.5.0-pre98
commit 80cd36abd868507312e2cc2c78cb0f55a684c620
jrsonnet matches sjsonnet's UTF-8-byte behavior here:
std.base64("é") // "w6k="
std.base64Decode("w6k=") // "é"
std.base64Decode("6Q==") // runtime error: bad utf8
std.base64DecodeBytes(std.base64("é")) // [195, 169]
Root Cause
Official C++ Jsonnet's stdlib implements string base64 as codepoint/char bytes:
local bytes =
if std.isString(input) then
std.map(std.codepoint, input)
else
input;
base64Decode(str)::
local bytes = std.base64DecodeBytes(str);
std.join('', std.map(std.char, bytes)),
sjsonnet currently encodes string input as UTF-8 bytes and decodes bytes as UTF-8 strings.
Proposed Direction
If sjsonnet decides to align strictly with official Jsonnet:
- Keep byte-array
std.base64 and std.base64DecodeBytes behavior unchanged.
- Change string
std.base64 to encode each character as a single byte, rejecting codepoints above 255.
- Change
std.base64Decode to map decoded bytes directly to std.char(byte) semantics instead of UTF-8 decoding.
- Add directional tests for
"é", "w6k=", "6Q==", and a high-codepoint rejection case such as "Ā".
This would intentionally diverge from jrsonnet's current UTF-8 behavior but match official C++ Jsonnet v0.22.0.
Motivation
During the stdlib audit,
std.base64/std.base64Decodeshowed a non-ASCII string semantics difference between sjsonnet and official C++ Jsonnetv0.22.0.We are not changing this in the current stdlib correctness PR because this behavior may be visible to users and jrsonnet currently makes the same UTF-8 choice as sjsonnet. This issue tracks the discrepancy separately.
Evidence
Official
jsonnet v0.22.0:Current sjsonnet:
Local jrsonnet check:
jrsonnet matches sjsonnet's UTF-8-byte behavior here:
Root Cause
Official C++ Jsonnet's stdlib implements string base64 as codepoint/char bytes:
sjsonnet currently encodes string input as UTF-8 bytes and decodes bytes as UTF-8 strings.
Proposed Direction
If sjsonnet decides to align strictly with official Jsonnet:
std.base64andstd.base64DecodeBytesbehavior unchanged.std.base64to encode each character as a single byte, rejecting codepoints above255.std.base64Decodeto map decoded bytes directly tostd.char(byte)semantics instead of UTF-8 decoding."é","w6k=","6Q==", and a high-codepoint rejection case such as"Ā".This would intentionally diverge from jrsonnet's current UTF-8 behavior but match official C++ Jsonnet
v0.22.0.