Skip to content

Track std.base64 non-ASCII string semantics vs official Jsonnet #793

@He-Pin

Description

@He-Pin

Motivation

During the stdlib audit, std.base64 / std.base64Decode showed a non-ASCII string semantics difference between sjsonnet and official C++ Jsonnet v0.22.0.

We are not changing this in the current stdlib correctness PR because this behavior may be visible to users and jrsonnet currently makes the same UTF-8 choice as sjsonnet. This issue tracks the discrepancy separately.

Evidence

Official jsonnet v0.22.0:

std.base64("é")                         // "6Q=="
std.base64Decode("w6k=")                 // "é"
std.base64Decode("6Q==")                 // "é"
std.base64DecodeBytes(std.base64("é"))   // [233]
std.base64("Ā")                          // runtime error: Can only base64 encode strings / arrays of single bytes.

Current sjsonnet:

std.base64("é")                         // "w6k="
std.base64Decode("w6k=")                 // "é"
std.base64Decode("6Q==")                 // "�"
std.base64DecodeBytes(std.base64("é"))   // [195, 169]

Local jrsonnet check:

jrsonnet 0.5.0-pre98
commit 80cd36abd868507312e2cc2c78cb0f55a684c620

jrsonnet matches sjsonnet's UTF-8-byte behavior here:

std.base64("é")                         // "w6k="
std.base64Decode("w6k=")                 // "é"
std.base64Decode("6Q==")                 // runtime error: bad utf8
std.base64DecodeBytes(std.base64("é"))   // [195, 169]

Root Cause

Official C++ Jsonnet's stdlib implements string base64 as codepoint/char bytes:

local bytes =
  if std.isString(input) then
    std.map(std.codepoint, input)
  else
    input;

base64Decode(str)::
  local bytes = std.base64DecodeBytes(str);
  std.join('', std.map(std.char, bytes)),

sjsonnet currently encodes string input as UTF-8 bytes and decodes bytes as UTF-8 strings.

Proposed Direction

If sjsonnet decides to align strictly with official Jsonnet:

  • Keep byte-array std.base64 and std.base64DecodeBytes behavior unchanged.
  • Change string std.base64 to encode each character as a single byte, rejecting codepoints above 255.
  • Change std.base64Decode to map decoded bytes directly to std.char(byte) semantics instead of UTF-8 decoding.
  • Add directional tests for "é", "w6k=", "6Q==", and a high-codepoint rejection case such as "Ā".

This would intentionally diverge from jrsonnet's current UTF-8 behavior but match official C++ Jsonnet v0.22.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions