Skip to content

Optimize keep_origin provenance with threshold-based lightweight records #130

@membphis

Description

@membphis

Background

qjson.materialize(v, { keep_origin = true }) currently records full provenance for every container and child record in the materialized tree. This preserves original bytes aggressively during later qjson.encode, but it also creates many small Lua tables and increases GC pressure on complex objects/arrays.

The main goal is to reduce memory and GC overhead in the materialize path. It is acceptable for small nodes to fall back to normal encoding after a parent mutation.

Proposed design

Change the default keep_origin=true behavior from full-tree provenance to lightweight, threshold-based provenance.

Add an origin completeness concept:

  • origin.complete = true: all children needed to verify byte-for-byte subtree reuse have provenance records, so the container may safely reuse its original JSON slice if it still fully matches.
  • origin.complete = false: only partial provenance is present. The encoder must not reuse the whole container slice, because plain materialized Lua tables have no dirty tracking and mutations inside the table cannot be intercepted.

Thresholds (hardcoded constants)

Thresholds are fixed constants in lua/qjson/table.lua; they are not configurable via materialize opts in this change (a follow-up issue may add opts if needed). Proposed defaults:

  • ORIGIN_STRING_MIN_RAW = 24 — record a string child only when its raw token length (including the surrounding quotes) exceeds 24 bytes.
  • ORIGIN_TABLE_MIN_RAW = 64 — record a complete=true table child only when its raw subtree byte span exceeds 64 bytes. "Raw subtree size" means the byte length of the child's original slice, i.e. origin.be - origin.bs.

Default recording policy

  • Do not record any small scalar child. This includes all numbers, booleans, null, and strings whose raw token does not exceed ORIGIN_STRING_MIN_RAW. (Numbers are never recorded at all — this subsumes the previous "numbers inside changed parents use the normal number encoder" behavior; encode_origin_child already encodes numbers normally.)
  • Record string child provenance only when the raw token length exceeds ORIGIN_STRING_MIN_RAW.
  • Record table child provenance in the parent only when the child origin is complete=true and the child raw subtree byte span exceeds ORIGIN_TABLE_MIN_RAW.
  • Mark a container complete=false when any child needed for full matching is omitted (i.e. any skipped small scalar, any skipped small string, or any child table that is itself complete=false or below the table threshold).

Encoding rules

  • Only complete=true origins may call the full-match path and return origin_table_slice(origin).
  • complete=false object origins may still preserve original key order for existing keys and reuse recorded large child tokens.
  • complete=false array origins should fall back to normal array/object classification encoding instead of trying to preserve partial formatting.

Compatibility notes

qjson.materialize(..., { keep_origin = true }) should still return ordinary Lua tables. Do not add dirty-tracking metatables to materialized output.

This relaxes the old behavior in two intentional ways:

  1. After mutating a parent, small escaped strings may encode normally instead of preserving their original escape spelling.
  2. An unmodified container that contains any small scalar (number/boolean/null/short string) is now complete=false, so even when it is encoded without any mutation it is re-emitted field-by-field rather than returned as its original byte slice. Output stays JSON-correct (key order is still preserved for objects), but it is no longer guaranteed byte-identical to the original input. In practice this means whole-slice reuse only fires for containers whose children are all large strings or large complete subtrees. This is the accepted trade-off of the "fewer records" goal.

If strict old behavior is needed later, add an explicit full mode such as:

qjson.materialize(v, { keep_origin = "full" })
-- or
qjson.materialize(v, { keep_origin = true, origin_mode = "full" })

Do not keep full provenance as the default.

Test coverage

Add or update Lua tests for:

  • Small escaped strings are not guaranteed to reuse raw tokens after a parent mutation.
  • Large escaped strings above ORIGIN_STRING_MIN_RAW still reuse their raw token after a parent mutation.
  • An unmodified container holding small scalars is re-emitted field-by-field (JSON-equivalent, key order preserved) rather than byte-identical, and does not return its original slice.
  • A container whose children are all large strings / large complete subtrees still returns its original slice when unmodified.
  • Partial origins never hide nested table mutations behind a parent raw slice.
  • Duplicate keys are not reintroduced after materialization.
  • Cycle and max-depth errors still surface correctly with origin-backed tables.
  • Large complete child subtrees can still be reused when the parent is modified.

Acceptance criteria

  • Materializing complex payloads with keep_origin=true allocates significantly fewer provenance records than the current full-record implementation.
  • Encoding remains JSON-correct and does not mask mutations.
  • Existing ordinary materialize behavior without keep_origin is unchanged.
  • The new keep_origin threshold semantics — including the relaxed byte-identical guarantee — are documented (README/docs do not currently describe keep_origin, so this means adding that documentation).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions