perf: Lua encode + cache + scanner + validation stack optimizations by membphis · Pull Request #54 · api7/lua-qjson

membphis · 2026-05-23T12:44:45Z

Summary

Performance optimization across the full decode → access → modify → encode pipeline.

Lua layer (Phase 2)

encode_string bulk-copy: Replaces per-byte loop with segment-based copying. Escape-free strings return via ' .. s .. ' with zero table allocations. Escape lookup via table instead of if-elseif chain.
_dirty flag (O(1)): Replaces recursive is_dirty tree walk with a _dirty boolean propagated up the parent chain on __newindex.
is_array pre-scan elimination: Tags materialized tables with __qjson_type marker during materialization, eliminating duplicate key scan in encode.

Rust layer (Phase 3)

SkipCache Rc<[u32]>: Vec::clone on every cache hit replaced with O(1) Rc::clone, eliminating O(n) heap allocation per repeated field access.
AVX2 PSHUFB-LUT: structural_mask_chunk 7-cmpeq loop replaced with nibble-based PSHUFB classification (x86 only, NEON tag-bit strategy).
Validation stack array: validate_eager_values heap allocation replaced with [CtxKind; 64] fixed-size stack (falls back to Vec for depth > 64).
memchr3: parse_i64 float-detection changed from byte-by-byte .any() to SIMD-accelerated memchr::memchr3.

Benchmarks

Added 3 modified-encode scenarios (modify top / add field / modify nested) across all payload sizes.

Performance (aarch64, small/medium multimodal payloads)

modify top + encode: +35–50%
add field + encode: +30–59%
modify nested + encode: +36–99%

Summary by CodeRabbit

Refactor
- Improved performance and memory via shared buffers/caches, reduced allocations, and a fixed-capacity validation stack.
New Features
- Benchmarks: scenario filtering, fresh-process runs, and new decode→modify→encode workloads covering top/add/nested mutations; make target updated to run scenarios.
Bug Fixes
- More reliable encoding with deterministic object/array handling and correct propagation of lazy mutations; faster numeric parsing checks.
Documentation
- README and benchmarks doc updated with methodology and new results.
Tests
- Added tests for nested mutations, additions, and special-character string round-trips.

- Replace per-byte encode_string with bulk-copy segments and fast escape-free path (zero table allocations for clean strings) - Replace recursive is_dirty with O(1) _dirty flag propagated via parent chain on __newindex - Eliminate is_array pre-scan by tagging materialized tables with __qjson_type marker during materialization - Replace Vec<u32>::clone in SkipCache with Rc<[u32]> for O(1) cache-hit access instead of O(n) vector clone - Replace 7-cmpeq loop in AVX2 structural_mask_chunk with PSHUFB-based nibble LUT classification - Replace heap-allocated validation state stack with fixed-size [CtxKind; 64] array for typical JSON depths - Replace byte-by-byte float-detection in parse_i64 with memchr::memchr3 - Add modified-encode benchmark scenarios (modify top / add field / modify nested) to bench harness - Add correctness tests for modified-encode round-trip

coderabbitai · 2026-05-23T12:44:56Z

Warning

Review limit reached

@membphis, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 50 minutes and 3 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 65fbb50a-b09c-4a71-84f4-1f693a8a8739

📥 Commits

Reviewing files that changed from the base of the PR and between 9c49135 and 5a178d7.

📒 Files selected for processing (3)

README.md
benches/lua_bench.lua
docs/benchmarks.md

📝 Walkthrough

Walkthrough

Refactors lazy JSON mutation tracking to explicit _dirty flags, rewrites string encoding and encoder dispatch, replaces several vectors with shared Rc slices, adds a fixed-stack validator, optimizes number scanning and updates an AVX2 comment, and extends tests and benchmarks for modify→encode flows.

Changes

Integrated Performance Optimization Suite

Layer / File(s)	Summary
Lazy mutation tracking via dirty flags `lua/qjson/table.lua`	Adds `_parent` links and `_dirty` flags on lazy child views; `__newindex` marks ancestors dirty, materialization clears lazy internals and tags realized container kind; root decode initializes `_dirty = false`.
String encoding with ESCAPES lookup and segmented build `lua/qjson/table.lua`	Replaces `encode_string` with ESCAPES-table-based escaper using a bulk/segmented build strategy and `\u00xx` control-byte escapes.
Encoder fast path and plain-table dispatch `lua/qjson/table.lua`	Removes recursive `is_dirty`; `encode_proxy` checks `t._dirty`; adds `encode_plain_table` to deterministically choose array vs object via `empty_array_mt` and `TABLE_TYPE_HINT`; `encode()` delegates plain tables.
SkipCache: shared Rc empty slices `src/skip_cache.rs`	Converts `SkipSlot.child_starts`/`child_ends` to `Rc<[u32]>` and initializes empty slots from a single shared `Rc<[u32]>`.
Cursor walk_children Rc cache-hit fast path `src/cursor.rs`	Cache-hit path now uses `Rc::clone` instead of cloning vectors; cache storage updated to use `into()` conversions for Rc-backed representation.
Validation eager checker fixed-size stack and macros `src/validate/mod.rs`	Replaces heap `Vec<CtxKind>` with fixed-capacity `stack_buf` and macros (`push!`, `pop!`, `last_mut!`, `stack_len!`, `stack_is_empty!`); lazy fallback Vec on overflow; updates state transitions and tests.
Number parsing guards and test lint `src/decode/number.rs`	`parse_i64` uses `memchr::memchr3` to detect `.`/`e`/`E`; `parse_f64` combines leading-byte validation; adds clippy allow in tests.
AVX2 structural_mask_chunk comment update `src/scan/avx2.rs`	Updates the doc comment to reflect the 7-parallel-byte-equality-compare implementation and microarchitectural notes; implementation unchanged.
Lua tests for lazy table encoding `tests/lua/lazy_table_spec.lua`	Adds tests validating encode reflects top-level and nested mutations, new-field insertion, unmodified fast-path encoding, and correct escaping round-trip for special characters.
Bench updates: decode→modify→encode scenarios `benches/lua_bench.lua`, `Makefile`	Adds default multimodal and scenario-specific modify callbacks; wires `github-100k` helpers; scenario filtering; adds per-scenario and interleaved decode→modify→encode benchmark cases with minimum-length validation; `Makefile` iterates `BENCH_SCENARIOS`.
Docs and README benchmark updates `docs/benchmarks.md`, `README.md`	Updates methodology (warmup, fresh-process isolation), adds modify+encode workload descriptions and throughput table, and replaces README benchmark text with modify-then-encode notes.

Sequence Diagram(s)

No sequence diagram generated for this PR.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly Related PRs

api7/lua-qjson#51: Modifies src/validate/mod.rs eager validation logic; both PRs refactor validation/stack handling and related checks.

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
E2e Test Quality Review	⚠️ Warning	Review comment on benches/lua_bench.lua requesting error handling for invalid filter remains unaddressed; missing matched flag to error when filter doesn't match any scenario.	Add matched variable to detect when arg[1] filter matches no scenarios, then error with "unknown benchmark scenario filter" message as suggested in review.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main focus: performance optimizations across Lua encoding, caching, scanner, and validation stack components, matching the substantial changes across those areas.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Check	✅ Passed	No security vulnerabilities found. PR contains performance optimizations for a JSON encoder/decoder with proper escape handling, validation bounds checking, and no credential/secret handling.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 001-json-perf-optimize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

membphis · 2026-05-23T12:48:51Z

Performance Comparison (aarch64, Apple M-series, NEON)

modify top + encode

Payload	Before (ops/s)	After (ops/s)	Change
small (2KB)	98,965	148,364	+50%
medium (60KB)	52,854	71,480	+35%
100k	42,105	54,705	+30%
500k	10,070	13,541	+34%
1m	5,111	6,699	+31%

add field + encode

Payload	Before (ops/s)	After (ops/s)	Change
small (2KB)	99,548	158,013	+59%
medium (60KB)	60,234	78,040	+30%
100k	46,253	61,502	+33%
500k	12,158	14,018	+15%
1m	6,209	7,349	+18%

modify nested + encode

Payload	Before (ops/s)	After (ops/s)	Change
small (2KB)	59,056	117,409	+99%
medium (60KB)	121,212	165,344	+36%
100k	93,371	107,759	+15%
500k	34,423	35,881	+4%
1m	18,773	19,025	+1%

unmodified encode (memcpy fast path)

Payload	Before (ops/s)	After (ops/s)	Change
100k	159,744	220,000	+38%
1m	19,841	21,941	+11%

decode + access content

Payload	Before (ops/s)	After (ops/s)	Change
100k	155,039	158,028	+2%
1m	20,718	21,556	+4%

Before = baseline prior to any optimization. After = Phase 2 (Lua layer) + Phase 3 (Rust layer). Benchmarks run via make bench on same hardware throughout.

Copilot

Pull request overview

Performance-focused PR optimizing qjson’s full pipeline (decode → access/modify → encode), with targeted improvements in both the Lua proxy layer and Rust core (scanner/cache/validation/number parsing).

Changes:

Lua: faster string encoding, O(1) mutation tracking via _dirty propagation, and table type tagging to avoid repeated array/object scans during encode.
Rust: SkipCache avoids per-hit Vec cloning via Rc<[u32]>, AVX2 structural classification uses PSHUFB LUTs, eager validation uses a fixed-size stack fast path, and parse_i64 uses memchr3.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/lua/lazy_table_spec.lua	Adds encode/mutation test coverage for new fast paths and dirty propagation.
lua/qjson/table.lua	Implements Lua-layer encode + mutation tracking optimizations (string escaper, `_dirty`, type tagging).
src/skip_cache.rs	Switches skip-cache child index storage to `Rc<[u32]>` for O(1) cache-hit cloning.
src/cursor.rs	Uses `Rc::clone` on cache hits and stores populated vectors into `Rc<[u32]>`.
src/scan/avx2.rs	Replaces repeated cmpeq loop with nibble-based PSHUFB LUT classification for structurals.
src/validate/mod.rs	Reworks eager validation stack to avoid heap allocation for typical depths.
src/decode/number.rs	Uses `memchr::memchr3` for faster integer/float discrimination.
benches/lua_bench.lua	Adds modified-encode benchmark scenarios (modify top/add field/modify nested).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Skip __qjson_type internal marker in encode_object() pairs iteration - Guard dirty propagation to stop at non-lazy ancestors, preventing _dirty rawset on already-materialized tables

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/skip_cache.rs (1)

25-27: 💤 Low value

Consider a single shared empty slice constant.

Both new() and get_or_insert() create new empty Rc<[u32]> instances. While empty slices are small, a single shared constant would eliminate these allocations entirely.

♻️ Optional refactor using lazy_static or thread_local

+use std::cell::RefCell;
+
+thread_local! {
+    static EMPTY_SLICE: Rc<[u32]> = Rc::from([]);
+}
+
 impl SkipCache {
     pub(crate) fn new() -> Self {
-        let empty: Rc<[u32]> = Rc::from([]);
         Self {
-            slots: vec![SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty }],
+            slots: vec![SkipSlot { 
+                child_starts: EMPTY_SLICE.with(Rc::clone), 
+                child_ends: EMPTY_SLICE.with(Rc::clone) 
+            }],
             by_opener: FxHashMap::default(),
         }
     }
 
     pub(crate) fn get_or_insert(&mut self, opener_idx: u32) -> (u32, bool) {
         if let Some(&slot) = self.by_opener.get(&opener_idx) {
             return (slot, true);
         }
         let new = self.slots.len() as u32;
-        let empty: Rc<[u32]> = Rc::from([]);
-        self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty });
+        self.slots.push(SkipSlot { 
+            child_starts: EMPTY_SLICE.with(Rc::clone), 
+            child_ends: EMPTY_SLICE.with(Rc::clone) 
+        });
         self.by_opener.insert(opener_idx, new);
         (new, false)
     }

Also applies to: 39-40

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skip_cache.rs` around lines 25 - 27, Create a single shared empty
Rc<[u32]> and use it from both new() and get_or_insert() instead of calling
Rc::from([]) twice; define a static (e.g. using once_cell::sync::Lazy or
lazy_static) EMPTY_EMPTY_U32_RC and replace the Rc::from([]) usages in SkipSlot
construction and any place in get_or_insert() with
Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts:
Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or
equivalent) to eliminate duplicate empty allocations while keeping types
Rc<[u32]> consistent.

tests/lua/lazy_table_spec.lua (1)

394-442: ⚡ Quick win

Add a regression test for __qjson_type payload-key collision.

Please add a case where decoded data includes a user key named __qjson_type and verify encode still preserves object semantics after mutation.

Example test to add

+    it("does not let user __qjson_type key alter table dispatch", function()
+        local cjson = require("cjson")
+        local t = qjson.decode('{"__qjson_type":"array","a":1}')
+        t.a = 2
+        local out = qjson.encode(t)
+        local parsed = cjson.decode(out)
+        assert.are.equal("array", parsed.__qjson_type)
+        assert.are.equal(2, parsed.a)
+    end)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/lua/lazy_table_spec.lua` around lines 394 - 442, Add a regression test
in tests/lua/lazy_table_spec.lua that decodes JSON containing a user-supplied
"__qjson_type" key via qjson.decode, mutates that field (and optionally other
fields or nested entries), re-encodes with qjson.encode, then cjson.decode the
output and assert the resulting table preserves object semantics and the mutated
"__qjson_type" value; locate the test near the existing it("...") cases and
follow their pattern using qjson.decode, qjson.encode, and cjson.decode to
validate both the "__qjson_type" field and other fields remain correct after
encode/decode.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/lua_bench.lua`:
- Around line 261-263: The benchmark helper github_table_modify_add currently
does t.extra_field = true which has no effect when the encoded JSON root is an
array; change github_table_modify_add to detect array-root tables (e.g., check
if `#t` > 0 or use ipairs) and in that case iterate over elements and set
element.extra_field = true for each item, otherwise set t.extra_field = true on
the object root so the added field appears in encoded output for both object and
array roots.

In `@lua/qjson/table.lua`:
- Around line 286-287: The code currently embeds a user-visible string key
"__qjson_type" via rawset(t, "__qjson_type", "object"), which lets user payloads
collide with encoder dispatch; replace this by storing the marker in an internal
weak side-table (e.g., a local table with metatable __mode="k") instead of
mutating the user's table, set side_table[t] = "object" wherever rawset(t,
"__qjson_type", ...) is used (references: rawset and the "__qjson_type" marker
in table.lua around the shown block and the similar block at lines ~552-557),
and update any lookup logic that currently reads t["__qjson_type"] to consult
the side-table to determine the container type.

---

Nitpick comments:
In `@src/skip_cache.rs`:
- Around line 25-27: Create a single shared empty Rc<[u32]> and use it from both
new() and get_or_insert() instead of calling Rc::from([]) twice; define a static
(e.g. using once_cell::sync::Lazy or lazy_static) EMPTY_EMPTY_U32_RC and replace
the Rc::from([]) usages in SkipSlot construction and any place in
get_or_insert() with Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts:
Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or
equivalent) to eliminate duplicate empty allocations while keeping types
Rc<[u32]> consistent.

In `@tests/lua/lazy_table_spec.lua`:
- Around line 394-442: Add a regression test in tests/lua/lazy_table_spec.lua
that decodes JSON containing a user-supplied "__qjson_type" key via
qjson.decode, mutates that field (and optionally other fields or nested
entries), re-encodes with qjson.encode, then cjson.decode the output and assert
the resulting table preserves object semantics and the mutated "__qjson_type"
value; locate the test near the existing it("...") cases and follow their
pattern using qjson.decode, qjson.encode, and cjson.decode to validate both the
"__qjson_type" field and other fields remain correct after encode/decode.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 05f6541a-c41b-4fe1-8fe6-6819cb53bdd0

📥 Commits

Reviewing files that changed from the base of the PR and between 8e49d18 and c7574e3.

📒 Files selected for processing (8)

benches/lua_bench.lua
lua/qjson/table.lua
src/cursor.rs
src/decode/number.rs
src/scan/avx2.rs
src/skip_cache.rs
src/validate/mod.rs
tests/lua/lazy_table_spec.lua

- Replace __qjson_type string key with local weak side-table (TABLE_TYPE_HINT) to prevent collision with user payload keys - Fix github_table_modify_add to mutate first element instead of array root so the added field appears in encoded output

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

membphis · 2026-05-23T15:34:49Z

    // After the walk, the stack must hold exactly one frame: the root
    // context, which must be `TopDone` (root value consumed).
-    if stack.len() != 1 || stack[0] != CtxKind::TopDone {
+    if stack_len!() != 1 || stack_buf[0] != CtxKind::TopDone {


Verified against the existing 256-deep test (tests/rfc8259_compliance.rs:119 accepts_nested_at_configured_limit) — it parses successfully on this branch. The premise doesn't hold: Top → TopDone mutation happens at line 226 (*cur = parent_after_value(*cur)) on the first [/{ of input, before any push that could overflow into the fallback Vec. So stack_buf[0] is set to TopDone on entry to any deep section, gets copied into fb[0] when fallback is created, and the final-state check at line 317 reads the right slot. Leaving as-is.

membphis · 2026-05-23T15:34:50Z

+    macro_rules! push {
+        ($kind:expr) => {
+            if sp < STACK_CAP {
+                stack_buf[sp] = $kind;
+                sp += 1;
+            } else {
+                let fb = fallback.get_or_insert_with(|| {
+                    let mut v: Vec<CtxKind> = Vec::with_capacity(STACK_CAP + 16);
+                    v.extend_from_slice(&stack_buf[..sp]);
+                    v
+                });
+                sp = STACK_CAP.wrapping_add(fb.len() + 1);
+                fb.push($kind);
+            }
+        };
+    }
+    macro_rules! pop {
+        () => {{
+            if sp <= STACK_CAP {
+                if sp == 0 { None }
+                else { sp -= 1; Some(stack_buf[sp]) }
+            } else {
+                let fb = fallback.as_mut().unwrap();
+                let val = fb.pop();
+                if fb.is_empty() { sp = STACK_CAP; }
+                val
+            }


The bookkeeping is admittedly hard to read, but it's correct given an invariant the validator maintains: the Top marker is never popped from the stack, only mutated to TopDone (line 226). Bracket imbalance errors out earlier via the pop!().ok_or(...) check. Hence the branch if fb.is_empty() { sp = STACK_CAP; } is unreachable in practice — fb always retains the Top marker at index 0. Refactoring to a single source-of-truth backing store (SmallVec or always-Vec) is a clarity improvement worth doing as a follow-up, but out of scope for this perf-focused PR.

membphis · 2026-05-23T15:34:51Z

 -- The set of keys reserved by the lazy view bookkeeping; user-supplied JSON
 -- keys with these names would collide (minor, deferred). Centralized here so
 -- the dirty check and __newindex can share the list.
 local INTERNAL_KEYS = {
    _doc = true, _cur_box = true, _cur = true, _bs = true, _be = true,
+    _parent = true, _dirty = true,
 }


Good catch — comment updated to reference the actual current consumers (__newindex cache snapshotting and encode_lazy_object_walking for skipping internals on dirty proxies). Pushing in the next commit.

membphis · 2026-05-23T15:34:52Z

+    for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do
+        rawset(t, f, nil)
+    end
    setmetatable(t, nil)
+    TABLE_TYPE_HINT[t] = "object"


Accepted — replaced the ipairs({...}) allocation with seven inline rawset(t, ..., nil) calls. Using rawset rather than multi-assignment because the root view (created in _M.decode) doesn't carry _parent and a direct t._parent = nil would trip __newindex recursion on it. Zero per-call allocations on the materialization path now.

membphis · 2026-05-23T15:34:54Z

+    for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do
+        rawset(t, f, nil)
+    end


Same fix applied in LazyArray.__newindex — seven inline rawset(t, ..., nil) calls, no per-call allocation.

Move TABLE_TYPE_HINT/is_array/empty_array_mt dispatch into a separate function so the hot encode path (lazy proxy → memcpy) stays minimal for LuaJIT trace compilation.

membphis · 2026-05-23T13:20:32Z

Performance Comparison: main vs PR (aarch64, Apple M-series)

Both branches run the same modified-encode benchmark suite for apples-to-apples comparison. Main branch uses original lua/qjson/table.lua + original Rust code.

Key metrics (median ops/s)

Scenario	Op	main	PR	Change
100k	modify top + encode	51,203	48,473	-5%
100k	add field + encode	48,591	48,193	-1%
100k	modify nested + encode	95,969	107,991	+13%
200k	modify nested + encode	66,401	74,405	+12%
1m	modify top + encode	6,098	5,682	-7%
1m	modify nested + encode	19,036	19,455	+2%
5m	modify top + encode	585	1,043	+78%
5m	add field + encode	579	1,039	+79%
10m	modify top + encode	275	457	+66%

Existing paths (parse, access, encode unmod)

Flat within bench variance (±5% for parse+access and decode+access). The encode(unmod) path shows ~25% lower numbers in the full bench due to JIT trace slot competition from the 3 additional modify-encode scenarios; isolated testing with only encode(unmod) shows identical performance between branches.

Modified-encode scenarios (small/medium multimodal payloads)

Scenario	Op	ops/s
small (2KB)	modify top + encode	145,641
small (2KB)	add field + encode	148,920
small (2KB)	modify nested + encode	81,171
medium (60KB)	modify top + encode	69,185
medium (60KB)	add field + encode	80,632
medium (60KB)	modify nested + encode	174,095

Both branches tested with identical bench harness (modified-encode scenarios present in both). Main = original table.lua + original Rust. PR = optimized table.lua + optimized Rust (Rc cache, stack validation, memchr3, AVX2 LUT).

Scanner (src/scan/avx2.rs): - Revert structural_mask_chunk from the PSHUFB-LUT nibble classifier back to the parallel 7-cmpeq form. On AMD Zen2 the PSHUFB variant measured -45% parse on small payloads (-7% on 1m where the in-string fast-probe dominates anyway): VPSHUFB ymm is split into two micro-ops per 128-bit lane, the srli->and->pshufb->and->cmpeq-> xor->movemask chain has a longer critical path, and VPMOVMSKB (lat 4) plus the LUT-load constants pressure the FP ports. The parallel cmpeq design lets independent compare chains dispatch across multiple ports. NEON path is unaffected (separate file). Lua layer (lua/qjson/table.lua) -- review feedback: - Update INTERNAL_KEYS comment to reference its actual consumers (__newindex cache snapshotting and encode_lazy_object_walking), not the removed recursive is_dirty walk. - Replace 'for _, f in ipairs({...})' in both LazyObject.__newindex and LazyArray.__newindex with seven inline rawset(t, ..., nil) calls. Eliminates the per-call 7-elem table-literal allocation on the materialization path. Uses rawset (not multi-assignment to nil) because root views from _M.decode lack _parent, so 't._parent = nil' could fire __newindex recursively. Bench (benches/lua_bench.lua): - Force LuaJIT to evaluate qjson.encode results: replace 'local _ = qjson.encode(t)' with 'local _enc = qjson.encode(t); if #_enc < 2 then error(...) end' in all 8 mutation/encode cases. Without this, LuaJIT could partially DCE the encode call on some trace shapes and produce misleading speedups. - Raise the warmup floor from max(3, iters/5) to max(50, iters/5). LuaJIT's default hotloop is 56; with the original floor of 3, the 1m payload (iters=15) measured pre-JIT interpreter mode for most of the run. - Raise 500k iters 20->100 and 1m iters 15->60 so per-round wall time exceeds os.clock granularity (~1ms) by a comfortable margin.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

@@ -34,7 +36,8 @@ impl SkipCache {
            return (slot, true);
        }
        let new = self.slots.len() as u32;
-        self.slots.push(SkipSlot { child_starts: Vec::new(), child_ends: Vec::new() });
+        let empty: Rc<[u32]> = Rc::from([]);
+        self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty });
        self.by_opener.insert(opener_idx, new);


+    -- Floor at 50: LuaJIT hotloop default is 56, so fewer iterations leave
+    -- the bench measuring interpreter mode for the large-payload scenarios
+    -- (1m has iters=15, iters/5=3 → trace never compiles → ~30% noise).
+    local warmup = math.max(50, math.floor(iters / 5))


coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/lua_bench.lua`:
- Line 149: Update the explanatory comment that currently reads "1m has
iters=15" to reflect the actual iteration count used in the benchmark (the
`iters = 60` setting for the 1m scenario); locate the comment near the bench
description for large-payload/interpreter mode and change the text to reference
`iters = 60` (or reword to avoid hardcoding the number) so it matches the
`iters` variable defined for the 1m scenario.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 87af55bb-8a10-4f0b-8acb-70395b800582

📥 Commits

Reviewing files that changed from the base of the PR and between c7574e3 and 65da500.

📒 Files selected for processing (3)

benches/lua_bench.lua
lua/qjson/table.lua
src/scan/avx2.rs

✅ Files skipped from review due to trivial changes (1)

src/scan/avx2.rs

…pty Rc - Collapse nested if in parse_f64 skip_validation path (clippy::collapsible_if) - Allow clippy::approx_constant on intentional test value 3.14 - Replace push loops with vec![b'['; N] in depth tests (clippy::same-item-push) - Update warmup comment in lua_bench.lua to reference current iters values - Use shared empty_rc field in SkipCache to avoid per-slot Rc::from([]) allocation

…code results Changes: - benches/lua_bench.lua: accept arg[1] as scenario name filter; when set, only that single scenario runs (backward-compatible — no arg runs all as before). - Makefile: bench target loops BENCH_SCENARIOS, launching a fresh resty process for each. Eliminates accumulated GC/JIT trace-cache interference between payload sizes. - docs/benchmarks.md: document fresh-process methodology, add modify+encode workload descriptions and throughput table. - README.md: reference modify+encode scenarios added in PR #54. Benchmark summary (x86_64 Linux, 3-round fresh-process, PR vs main): modify + encode path (median ops/s, PR main): small modify top: +41%..+75% (33k→59k) small modify nested: +44%..+80% (29k→53k) medium modify nested: +58%..+106% (92k→190k) 100k modify nested: +31%..+45% (47k→68k) interleaved mod top: +2%..+30% (5.8k→7.2k) 30/33 data points PR > main (91%). encode(unmodified) broadly flat or slightly up (+2%..+26%), consistent with JIT trace slot competition from the additional modify+encode scenarios.

Update README.md summary table and docs/benchmarks.md (throughput, speedup, memory, modify+encode) with results from a clean make bench run on the current branch.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/lua_bench.lua`:
- Around line 299-307: The filter check using local variable "filter" can
silently skip all scenarios when arg[1] is mistyped; modify the loop over
"scenarios" that uses "if filter and s.name ~= filter then goto
continue_scenario end" to set a boolean flag (e.g., "matched = true") when any
scenario is executed, and after the loop check that flag and call error/print
and os.exit(1) if no scenario matched; apply the same pattern to the second loop
that uses the same filter logic (the block using the "continue_scenario" label)
so CI fails fast when an invalid filter was supplied.

In `@README.md`:
- Line 116: Replace the non-hyphenated phrase "full size ladder" with the
hyphenated compound adjective "full-size ladder" in the README text (search for
the exact string "full size ladder" near the benchmarks link and update it to
"full-size ladder") so the phrase uses the correct hyphenation for a compound
adjective.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c7a1775e-967d-4c22-8a44-b581e7e5c3b7

📥 Commits

Reviewing files that changed from the base of the PR and between 65da500 and 9c49135.

📒 Files selected for processing (7)

Makefile
README.md
benches/lua_bench.lua
docs/benchmarks.md
src/decode/number.rs
src/skip_cache.rs
src/validate/mod.rs

✅ Files skipped from review due to trivial changes (1)

docs/benchmarks.md

coderabbitai · 2026-05-23T22:15:05Z

+local filter = arg[1]
+
 if not simdjson then
    print("lua-resty-simdjson unavailable; skipping simdjson rows: "
        .. tostring(simdjson_or_err))
 end

 for _, s in ipairs(scenarios) do
+    if filter and s.name ~= filter then goto continue_scenario end


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast when arg[1] does not match any scenario.

A typo in the filter currently exits with no benchmark rows, which is easy to misread as success in CI/manual runs. Please track whether any scenario matched and error if none did.

Suggested patch

local filter = arg[1] +local matched = false @@ for _, s in ipairs(scenarios) do if filter and s.name ~= filter then goto continue_scenario end + matched = true print(string.format("=== %s (%d bytes) ===", s.name, `#s.payload`)) @@ end + +if filter and filter ~= "interleaved" and not matched then + error("unknown benchmark scenario filter: " .. tostring(filter)) +end

Also applies to: 378-379

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/lua_bench.lua` around lines 299 - 307, The filter check using local variable "filter" can silently skip all scenarios when arg[1] is mistyped; modify the loop over "scenarios" that uses "if filter and s.name ~= filter then goto continue_scenario end" to set a boolean flag (e.g., "matched = true") when any scenario is executed, and after the loop check that flag and call error/print and os.exit(1) if no scenario matched; apply the same pattern to the second loop that uses the same filter logic (the block using the "continue_scenario" label) so CI fails fast when an invalid filter was supplied.

coderabbitai · 2026-05-23T22:15:05Z

-OpenResty environment; otherwise it skips the simdjson rows.
+Modify-then-encode scenarios (PR #54) add decode → mutate field → re-encode
+workloads; small payload modify+encode reaches 48k–60k ops/s. See
+[`docs/benchmarks.md`](docs/benchmarks.md) for the full size ladder,


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use hyphenated compound adjective in README text.

“full size ladder” should be “full-size ladder” for standard technical prose.

🧰 Tools

🪛 LanguageTool

[grammar] ~116-~116: Use a hyphen to join words.
Context: ...ks.md`](docs/benchmarks.md) for the full size ladder, modify+encode results, memo...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 116, Replace the non-hyphenated phrase "full size ladder" with the hyphenated compound adjective "full-size ladder" in the README text (search for the exact string "full size ladder" near the benchmarks link and update it to "full-size ladder") so the phrase uses the correct hyphenation for a compound adjective.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

+
+bench: build vendor/lua-cjson/cjson.so ## Run each scenario in a fresh LuaJIT process
+	@for s in $(BENCH_SCENARIOS); do \
+		$(LUA_ENV) $(RESTY) benches/lua_bench.lua $$s; \


+    // 7 parallel byte-equality compares. On AMD Zen2 these dispatch across
+    // multiple FP ports and beat a PSHUFB-LUT nibble classifier (PSHUFB ymm
+    // is split into 2 micro-ops per lane, the LUT chain lengthens the
+    // critical path, and VPMOVMSKB has lat=4 — the 14-movemask total is
+    // still cheaper than the LUT path's serial dependency). PR #54 tried
+    // PSHUFB-LUT but measured -45% parse on small payloads on Zen2; this
+    // form is what shipped through #51.


+        let mut buf = vec![b'['; 1024];
+        buf.extend_from_slice(&vec![b']'; 1024]);
        assert!(


…son in README Adds cjson.decode + modify top/add field/modify nested + cjson.encode benchmarks so the modify+encode path has an eager baseline. README encode table now shows cjson/qjson side-by-side for modify workloads: qjson is 10-43x faster at 60 KB+.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

 for _, s in ipairs(scenarios) do
+    if filter and s.name ~= filter then goto continue_scenario end
    print(string.format("=== %s (%d bytes) ===", s.name, #s.payload))


Copilot AI review requested due to automatic review settings May 23, 2026 12:44

Copilot started reviewing on behalf of membphis May 23, 2026 12:44 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

Comment thread lua/qjson/table.lua Outdated

Comment thread lua/qjson/table.lua

Comment thread lua/qjson/table.lua

fix: prevent __qjson_type and _dirty leakage into encode output

913f6a8

- Skip __qjson_type internal marker in encode_object() pairs iteration - Guard dirty propagation to stop at non-lazy ancestors, preventing _dirty rawset on already-materialized tables

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

Comment thread benches/lua_bench.lua

Comment thread lua/qjson/table.lua Outdated

Copilot AI review requested due to automatic review settings May 23, 2026 12:55

Copilot started reviewing on behalf of membphis May 23, 2026 12:55 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

perf: extract encode_plain_table to keep lazy-proxy fast path narrow

c8630b7

Move TABLE_TYPE_HINT/is_array/empty_array_mt dispatch into a separate function so the hot encode path (lazy proxy → memcpy) stays minimal for LuaJIT trace compilation.

Copilot AI review requested due to automatic review settings May 23, 2026 15:38

Copilot started reviewing on behalf of membphis May 23, 2026 15:38 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

Comment thread benches/lua_bench.lua

membphis added 3 commits May 23, 2026 21:35

docs: refresh all benchmark tables with fresh-process build data

923b91b

Update README.md summary table and docs/benchmarks.md (throughput, speedup, memory, modify+encode) with results from a clean make bench run on the current branch.

Copilot AI review requested due to automatic review settings May 23, 2026 22:13

Copilot started reviewing on behalf of membphis May 23, 2026 22:14 View session

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

docs: add modify+encode columns to README benchmark summary

090d134

Copilot AI reviewed May 23, 2026

View reviewed changes

membphis added 2 commits May 23, 2026 22:18

docs: add speedup column to README encode table

5a178d7

Copilot AI review requested due to automatic review settings May 23, 2026 22:20

Copilot started reviewing on behalf of membphis May 23, 2026 22:20 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

Comment thread benches/lua_bench.lua

Comment on lines 306 to 308

for _, s in ipairs(scenarios) do

if filter and s.name ~= filter then goto continue_scenario end

print(string.format("=== %s (%d bytes) ===", s.name, #s.payload))

membphis merged commit 5901f7a into main May 23, 2026
8 checks passed

Conversation

membphis commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Lua layer (Phase 2)

Rust layer (Phase 3)

Benchmarks

Performance (aarch64, small/medium multimodal payloads)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

❌ Failed checks (1 warning)

Uh oh!

membphis commented May 23, 2026

Performance Comparison (aarch64, Apple M-series, NEON)

modify top + encode

add field + encode

modify nested + encode

unmodified encode (memcpy fast path)

decode + access content

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

membphis May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis commented May 23, 2026

Performance Comparison: main vs PR (aarch64, Apple M-series)

Key metrics (median ops/s)

Existing paths (parse, access, encode unmod)

Modified-encode scenarios (small/medium multimodal payloads)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading