Background
qjson.materialize(v, { keep_origin = true }) currently records full provenance for every container and child record in the materialized tree. This preserves original bytes aggressively during later qjson.encode, but it also creates many small Lua tables and increases GC pressure on complex objects/arrays.
The main goal is to reduce memory and GC overhead in the materialize path. It is acceptable for small nodes to fall back to normal encoding after a parent mutation.
Proposed design
Change the default keep_origin=true behavior from full-tree provenance to lightweight, threshold-based provenance.
Add an origin completeness concept:
origin.complete = true: all children needed to verify byte-for-byte subtree reuse have provenance records, so the container may safely reuse its original JSON slice if it still fully matches.
origin.complete = false: only partial provenance is present. The encoder must not reuse the whole container slice, because plain materialized Lua tables have no dirty tracking and mutations inside the table cannot be intercepted.
Thresholds (hardcoded constants)
Thresholds are fixed constants in lua/qjson/table.lua; they are not configurable via materialize opts in this change (a follow-up issue may add opts if needed). Proposed defaults:
ORIGIN_STRING_MIN_RAW = 24 — record a string child only when its raw token length (including the surrounding quotes) exceeds 24 bytes.
ORIGIN_TABLE_MIN_RAW = 64 — record a complete=true table child only when its raw subtree byte span exceeds 64 bytes. "Raw subtree size" means the byte length of the child's original slice, i.e. origin.be - origin.bs.
Default recording policy
- Do not record any small scalar child. This includes all numbers, booleans,
null, and strings whose raw token does not exceed ORIGIN_STRING_MIN_RAW. (Numbers are never recorded at all — this subsumes the previous "numbers inside changed parents use the normal number encoder" behavior; encode_origin_child already encodes numbers normally.)
- Record string child provenance only when the raw token length exceeds
ORIGIN_STRING_MIN_RAW.
- Record table child provenance in the parent only when the child origin is
complete=true and the child raw subtree byte span exceeds ORIGIN_TABLE_MIN_RAW.
- Mark a container
complete=false when any child needed for full matching is omitted (i.e. any skipped small scalar, any skipped small string, or any child table that is itself complete=false or below the table threshold).
Encoding rules
- Only
complete=true origins may call the full-match path and return origin_table_slice(origin).
complete=false object origins may still preserve original key order for existing keys and reuse recorded large child tokens.
complete=false array origins should fall back to normal array/object classification encoding instead of trying to preserve partial formatting.
Compatibility notes
qjson.materialize(..., { keep_origin = true }) should still return ordinary Lua tables. Do not add dirty-tracking metatables to materialized output.
This relaxes the old behavior in two intentional ways:
- After mutating a parent, small escaped strings may encode normally instead of preserving their original escape spelling.
- An unmodified container that contains any small scalar (number/boolean/null/short string) is now
complete=false, so even when it is encoded without any mutation it is re-emitted field-by-field rather than returned as its original byte slice. Output stays JSON-correct (key order is still preserved for objects), but it is no longer guaranteed byte-identical to the original input. In practice this means whole-slice reuse only fires for containers whose children are all large strings or large complete subtrees. This is the accepted trade-off of the "fewer records" goal.
If strict old behavior is needed later, add an explicit full mode such as:
qjson.materialize(v, { keep_origin = "full" })
-- or
qjson.materialize(v, { keep_origin = true, origin_mode = "full" })
Do not keep full provenance as the default.
Test coverage
Add or update Lua tests for:
- Small escaped strings are not guaranteed to reuse raw tokens after a parent mutation.
- Large escaped strings above
ORIGIN_STRING_MIN_RAW still reuse their raw token after a parent mutation.
- An unmodified container holding small scalars is re-emitted field-by-field (JSON-equivalent, key order preserved) rather than byte-identical, and does not return its original slice.
- A container whose children are all large strings / large complete subtrees still returns its original slice when unmodified.
- Partial origins never hide nested table mutations behind a parent raw slice.
- Duplicate keys are not reintroduced after materialization.
- Cycle and max-depth errors still surface correctly with origin-backed tables.
- Large complete child subtrees can still be reused when the parent is modified.
Acceptance criteria
- Materializing complex payloads with
keep_origin=true allocates significantly fewer provenance records than the current full-record implementation.
- Encoding remains JSON-correct and does not mask mutations.
- Existing ordinary materialize behavior without
keep_origin is unchanged.
- The new
keep_origin threshold semantics — including the relaxed byte-identical guarantee — are documented (README/docs do not currently describe keep_origin, so this means adding that documentation).
Background
qjson.materialize(v, { keep_origin = true })currently records full provenance for every container and child record in the materialized tree. This preserves original bytes aggressively during laterqjson.encode, but it also creates many small Lua tables and increases GC pressure on complex objects/arrays.The main goal is to reduce memory and GC overhead in the materialize path. It is acceptable for small nodes to fall back to normal encoding after a parent mutation.
Proposed design
Change the default
keep_origin=truebehavior from full-tree provenance to lightweight, threshold-based provenance.Add an origin completeness concept:
origin.complete = true: all children needed to verify byte-for-byte subtree reuse have provenance records, so the container may safely reuse its original JSON slice if it still fully matches.origin.complete = false: only partial provenance is present. The encoder must not reuse the whole container slice, because plain materialized Lua tables have no dirty tracking and mutations inside the table cannot be intercepted.Thresholds (hardcoded constants)
Thresholds are fixed constants in
lua/qjson/table.lua; they are not configurable viamaterializeopts in this change (a follow-up issue may add opts if needed). Proposed defaults:ORIGIN_STRING_MIN_RAW = 24— record a string child only when its raw token length (including the surrounding quotes) exceeds 24 bytes.ORIGIN_TABLE_MIN_RAW = 64— record acomplete=truetable child only when its raw subtree byte span exceeds 64 bytes. "Raw subtree size" means the byte length of the child's original slice, i.e.origin.be - origin.bs.Default recording policy
null, and strings whose raw token does not exceedORIGIN_STRING_MIN_RAW. (Numbers are never recorded at all — this subsumes the previous "numbers inside changed parents use the normal number encoder" behavior;encode_origin_childalready encodes numbers normally.)ORIGIN_STRING_MIN_RAW.complete=trueand the child raw subtree byte span exceedsORIGIN_TABLE_MIN_RAW.complete=falsewhen any child needed for full matching is omitted (i.e. any skipped small scalar, any skipped small string, or any child table that is itselfcomplete=falseor below the table threshold).Encoding rules
complete=trueorigins may call the full-match path and returnorigin_table_slice(origin).complete=falseobject origins may still preserve original key order for existing keys and reuse recorded large child tokens.complete=falsearray origins should fall back to normal array/object classification encoding instead of trying to preserve partial formatting.Compatibility notes
qjson.materialize(..., { keep_origin = true })should still return ordinary Lua tables. Do not add dirty-tracking metatables to materialized output.This relaxes the old behavior in two intentional ways:
complete=false, so even when it is encoded without any mutation it is re-emitted field-by-field rather than returned as its original byte slice. Output stays JSON-correct (key order is still preserved for objects), but it is no longer guaranteed byte-identical to the original input. In practice this means whole-slice reuse only fires for containers whose children are all large strings or largecompletesubtrees. This is the accepted trade-off of the "fewer records" goal.If strict old behavior is needed later, add an explicit full mode such as:
Do not keep full provenance as the default.
Test coverage
Add or update Lua tests for:
ORIGIN_STRING_MIN_RAWstill reuse their raw token after a parent mutation.Acceptance criteria
keep_origin=trueallocates significantly fewer provenance records than the current full-record implementation.keep_originis unchanged.keep_originthreshold semantics — including the relaxed byte-identical guarantee — are documented (README/docs do not currently describekeep_origin, so this means adding that documentation).