Skip to content

Optimize encode by reusing original JSON slices for unchanged materialized data #128

@membphis

Description

@membphis

Background

qjson.encode(qjson.decode(src)) already has a fast path for unmodified lazy proxies: it can return the original JSON byte span from the retained input buffer. However, once callers run qjson.materialize(qjson.decode(src)), the result is a plain Lua table. The plain table no longer carries the original Doc, byte spans, dirty state, or value provenance, so qjson.encode() must walk and re-encode the whole table.

This leaves a major performance opportunity for workloads that materialize JSON, modify only a small part, and then encode it again. The common case is a large unchanged string/object/array subtree plus a few small edits.

Recent local benchmark context on an Apple M4, 600,000-byte JSON payload dominated by one large ASCII string:

qjson.encode(materialized table)      ~896 ops/s
qjson lazy unchanged reference      ~24763 ops/s

The lazy fast path is not directly comparable as an API result, but it shows the size of the avoidable work when the output is mostly unchanged.

Goal

Add a provenance-aware materialized encode path that can reuse original JSON slices for unchanged data, especially large unchanged subtrees or strings, while preserving the chosen qjson.encode semantics for opt-in materialized values.

The optimization should target materialized or semi-materialized workflows, not only the existing lazy proxy encode path.

Proposed API contract

This should be opt-in. Default qjson.materialize(value) behavior must remain simple and plain-table compatible.

Recommended MVP API:

local t = qjson.materialize(qjson.decode(src), { keep_origin = true })
local out = qjson.encode(t)

Possible explicit alternative if the implementation cannot keep qjson.encode() dispatch simple enough:

local t = qjson.materialize_with_origin(qjson.decode(src))
local out = qjson.encode_reuse(t)

Before implementation, choose one public API shape and make tests target only that shape. The rest of this issue assumes the recommended keep_origin form.

For keep_origin = true:

  • qjson.materialize(value) with no options must keep existing behavior.
  • qjson.materialize(value, { keep_origin = true }) may attach provenance metadata to returned containers.
  • Values without provenance, or values whose provenance can no longer be proven safe, must fall back to the existing encoder.
  • If the returned value is not a normal Lua table in any observable way, the API contract must explicitly document the differences. Examples: metatable identity, pairs, ipairs, #, rawget, rawset, setmetatable, and whether overwriting existing keys is tracked by proxy logic or encode-time validation.
  • No raw pointers, byte ranges, or provenance internals should be exposed through the public Lua API.

Semantic contract

The opt-in provenance path may preserve original lexical JSON bytes for unchanged values. That means output is not necessarily byte-for-byte identical to encoding an equivalent hand-built Lua table.

The correctness contract should be:

  • The encoded output must be valid JSON.
  • Decoding the encoded output should produce the same JSON/Lua value as the current materialized table, subject to existing qjson/lua-cjson compatibility rules.
  • Existing encode error behavior must be preserved for unsupported value types, sparse array policy, max depth, and cycles.
  • Raw slice reuse must not reintroduce semantics that materialization intentionally lost. In particular, materialized object duplicate-key collapse must be respected.
  • Existing lazy proxy fast-path behavior should not change except for shared internal helpers.

Likely metadata model

The implementation will likely need sidecar metadata, probably weak tables, keyed by materialized containers.

Metadata should be able to answer:

  • Which retained Doc/input buffer keeps the original bytes alive.
  • Original byte start/end for each reusable value.
  • Original JSON type for each value.
  • Container kind: object or array.
  • Child key/index provenance for original children.
  • Whether an object/subtree had duplicate keys.
  • Whether a container or descendant is known changed.
  • Enough original decoded scalar state to decide whether a current Lua value still matches the original value.

Important scalar detail: Lua strings cannot safely carry per-field provenance by themselves. For unchanged string reuse, provenance should be recorded on the parent container's child metadata. Encoding a changed/walked parent can then splice the original string token when the current child value still matches the original decoded string.

Slice reuse safety policy

Raw JSON slices may be reused only when reuse is semantically safe. A conservative MVP policy:

  • A reused value must keep the original input buffer alive.
  • A reused container must have no added, deleted, replaced, reordered, or changed children.
  • A changed parent container may still walk its children and splice safe unchanged child strings or child subtrees.
  • A raw object or array subtree must not be reused if doing so would output duplicate object keys that were collapsed by materialization. This likely requires a duplicate-key/subtree-safe flag.
  • A raw slice must not bypass cycle detection. If a provenance table is aliased after materialization, qjson.encode must still detect recursive cycles and respect the existing max-depth guard before deciding to splice.
  • Plain Lua tables do not trigger __newindex when overwriting existing keys. If the API returns plain tables, dirty detection needs snapshot comparison or encode-time validation. If the API returns proxies, proxy semantics must be documented and tested.
  • Numeric scalar policy must be explicit before implementation:
    • Conservative option: when a container is encoded by walking, emit numbers through the existing encoder and do not splice standalone number tokens.
    • If whole-subtree reuse preserves numeric spellings such as 1.0, 1e3, or -0, tests must define that this is acceptable for unchanged provenance subtrees.
    • If byte-for-byte compatibility with normal materialized-table encode is required, containers containing non-canonical numeric spellings must not be raw-spliced.

Object ordering policy

The implementation must choose and test a deterministic policy for provenance-aware materialized containers.

Recommended policy for the MVP:

  • Original object keys keep source order after duplicate-key collapse.
  • Deleted keys are omitted.
  • Replaced original keys keep their original position.
  • New keys are emitted after retained original keys. If the implementation cannot track insertion order for plain tables, new-key ordering may follow the existing plain-table pairs behavior, but tests should avoid relying on a stronger guarantee.
  • Arrays keep numeric index order and existing sparse-array behavior.

Requirements

  • Preserve existing qjson.materialize() behavior unless the opt-in API is used.
  • Preserve existing qjson.encode() behavior for values without provenance.
  • Reuse original JSON slices only when the safety policy says it is safe.
  • Keep the original input buffer alive as long as provenance-aware materialized tables can reference it.
  • Support large unchanged string values without rescanning and escaping them during encode.
  • Support large unchanged object/array subtrees when the subtree is known unchanged and safe to splice.
  • Fall back to normal encode for changed values, values without provenance, or provenance that cannot be validated.
  • Avoid adding hidden overhead to default materialization unless it is measured and negligible.

Acceptance criteria

Functional tests should cover:

  • qjson.materialize(qjson.decode(src)) remains unchanged and returns the same kind of plain values as today.
  • The chosen opt-in API validates its arguments and does not affect callers that do not opt in.
  • Fully unchanged provenance-aware materialized data can encode successfully and can reuse safe original slices.
  • A 600k unchanged string value can be encoded from provenance without rescanning/escaping the string.
  • A small top-level field modification with a large unchanged string/object/array sibling reuses the safe unchanged sibling.
  • A nested modification with a large unchanged sibling subtree reuses the safe sibling and encodes the changed branch normally.
  • A structure-dense payload still produces correct JSON and does not regress badly versus the normal materialized encoder.
  • Duplicate object keys are not reintroduced after materialization. Example: materializing {"a":1,"a":2} with origin and encoding after provenance decisions should respect last-wins materialized semantics.
  • Numeric lexical forms are covered by tests matching the chosen numeric policy: 1.0, 1e3, and -0.
  • Object key ordering follows the chosen ordering policy for replace, delete, delete-and-readd, and add-new-key cases.
  • Cycles and post-materialization aliasing still produce existing bounded encode errors or valid repeated output as appropriate.
  • Provenance metadata keeps the original input buffer alive for as long as reused slices may be emitted.
  • Values whose provenance is missing, ambiguous, or invalidated fall back to the existing encoder.

Benchmark coverage should include:

  • unchanged materialized 600k string payload
  • one small top-level field modification with large unchanged blob
  • one nested modification with large unchanged sibling subtree
  • structure-dense payload where slice reuse may provide less benefit

Expected impact

For payloads dominated by unchanged large strings/subtrees, encode-only throughput may improve from the current materialized-table path by several multiples. On the 600k single-large-string benchmark, a realistic target is 5x-15x for small edits and potentially higher for fully unchanged opt-in provenance tables. Structure-dense payloads should be measured separately and may see smaller gains.

Non-goals

  • Do not change the existing lazy proxy fast path unless needed for shared internals.
  • Do not make default plain Lua tables retain hidden provenance unless the overhead is proven negligible and semantics are clear.
  • Do not expose raw pointers or byte ranges through the public Lua API.
  • Do not require callers to manage source-buffer lifetime manually.

Open decisions before implementation

These should be settled before the first implementation PR:

  • Final API shape: recommended qjson.materialize(value, { keep_origin = true }) versus a separate materialize_with_origin/encode_reuse pair.
  • Whether keep_origin must return observably plain Lua tables, or whether a documented proxy/metatable-backed table is acceptable.
  • Numeric lexical policy for unchanged raw subtree reuse, especially 1.0, 1e3, and -0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions