Skip to content

perf: Lua encode + cache + scanner + validation stack optimizations#54

Merged
membphis merged 11 commits into
mainfrom
001-json-perf-optimize
May 23, 2026
Merged

perf: Lua encode + cache + scanner + validation stack optimizations#54
membphis merged 11 commits into
mainfrom
001-json-perf-optimize

Conversation

@membphis
Copy link
Copy Markdown
Collaborator

@membphis membphis commented May 23, 2026

Summary

Performance optimization across the full decode → access → modify → encode pipeline.

Lua layer (Phase 2)

  • encode_string bulk-copy: Replaces per-byte loop with segment-based copying. Escape-free strings return via ' .. s .. ' with zero table allocations. Escape lookup via table instead of if-elseif chain.
  • _dirty flag (O(1)): Replaces recursive is_dirty tree walk with a _dirty boolean propagated up the parent chain on __newindex.
  • is_array pre-scan elimination: Tags materialized tables with __qjson_type marker during materialization, eliminating duplicate key scan in encode.

Rust layer (Phase 3)

  • SkipCache Rc<[u32]>: Vec::clone on every cache hit replaced with O(1) Rc::clone, eliminating O(n) heap allocation per repeated field access.
  • AVX2 PSHUFB-LUT: structural_mask_chunk 7-cmpeq loop replaced with nibble-based PSHUFB classification (x86 only, NEON tag-bit strategy).
  • Validation stack array: validate_eager_values heap allocation replaced with [CtxKind; 64] fixed-size stack (falls back to Vec for depth > 64).
  • memchr3: parse_i64 float-detection changed from byte-by-byte .any() to SIMD-accelerated memchr::memchr3.

Benchmarks

  • Added 3 modified-encode scenarios (modify top / add field / modify nested) across all payload sizes.

Performance (aarch64, small/medium multimodal payloads)

  • modify top + encode: +35–50%
  • add field + encode: +30–59%
  • modify nested + encode: +36–99%

Summary by CodeRabbit

  • Refactor
    • Improved performance and memory via shared buffers/caches, reduced allocations, and a fixed-capacity validation stack.
  • New Features
    • Benchmarks: scenario filtering, fresh-process runs, and new decode→modify→encode workloads covering top/add/nested mutations; make target updated to run scenarios.
  • Bug Fixes
    • More reliable encoding with deterministic object/array handling and correct propagation of lazy mutations; faster numeric parsing checks.
  • Documentation
    • README and benchmarks doc updated with methodology and new results.
  • Tests
    • Added tests for nested mutations, additions, and special-character string round-trips.

Review Change Stack

- Replace per-byte encode_string with bulk-copy segments and fast
  escape-free path (zero table allocations for clean strings)
- Replace recursive is_dirty with O(1) _dirty flag propagated via
  parent chain on __newindex
- Eliminate is_array pre-scan by tagging materialized tables with
  __qjson_type marker during materialization
- Replace Vec<u32>::clone in SkipCache with Rc<[u32]> for O(1)
  cache-hit access instead of O(n) vector clone
- Replace 7-cmpeq loop in AVX2 structural_mask_chunk with
  PSHUFB-based nibble LUT classification
- Replace heap-allocated validation state stack with fixed-size
  [CtxKind; 64] array for typical JSON depths
- Replace byte-by-byte float-detection in parse_i64 with
  memchr::memchr3
- Add modified-encode benchmark scenarios (modify top / add field /
  modify nested) to bench harness
- Add correctness tests for modified-encode round-trip
Copilot AI review requested due to automatic review settings May 23, 2026 12:44
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

Warning

Review limit reached

@membphis, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 50 minutes and 3 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 65fbb50a-b09c-4a71-84f4-1f693a8a8739

📥 Commits

Reviewing files that changed from the base of the PR and between 9c49135 and 5a178d7.

📒 Files selected for processing (3)
  • README.md
  • benches/lua_bench.lua
  • docs/benchmarks.md
📝 Walkthrough

Walkthrough

Refactors lazy JSON mutation tracking to explicit _dirty flags, rewrites string encoding and encoder dispatch, replaces several vectors with shared Rc slices, adds a fixed-stack validator, optimizes number scanning and updates an AVX2 comment, and extends tests and benchmarks for modify→encode flows.

Changes

Integrated Performance Optimization Suite

Layer / File(s) Summary
Lazy mutation tracking via dirty flags
lua/qjson/table.lua
Adds _parent links and _dirty flags on lazy child views; __newindex marks ancestors dirty, materialization clears lazy internals and tags realized container kind; root decode initializes _dirty = false.
String encoding with ESCAPES lookup and segmented build
lua/qjson/table.lua
Replaces encode_string with ESCAPES-table-based escaper using a bulk/segmented build strategy and \u00xx control-byte escapes.
Encoder fast path and plain-table dispatch
lua/qjson/table.lua
Removes recursive is_dirty; encode_proxy checks t._dirty; adds encode_plain_table to deterministically choose array vs object via empty_array_mt and TABLE_TYPE_HINT; encode() delegates plain tables.
SkipCache: shared Rc empty slices
src/skip_cache.rs
Converts SkipSlot.child_starts/child_ends to Rc<[u32]> and initializes empty slots from a single shared Rc<[u32]>.
Cursor walk_children Rc cache-hit fast path
src/cursor.rs
Cache-hit path now uses Rc::clone instead of cloning vectors; cache storage updated to use into() conversions for Rc-backed representation.
Validation eager checker fixed-size stack and macros
src/validate/mod.rs
Replaces heap Vec<CtxKind> with fixed-capacity stack_buf and macros (push!, pop!, last_mut!, stack_len!, stack_is_empty!); lazy fallback Vec on overflow; updates state transitions and tests.
Number parsing guards and test lint
src/decode/number.rs
parse_i64 uses memchr::memchr3 to detect ./e/E; parse_f64 combines leading-byte validation; adds clippy allow in tests.
AVX2 structural_mask_chunk comment update
src/scan/avx2.rs
Updates the doc comment to reflect the 7-parallel-byte-equality-compare implementation and microarchitectural notes; implementation unchanged.
Lua tests for lazy table encoding
tests/lua/lazy_table_spec.lua
Adds tests validating encode reflects top-level and nested mutations, new-field insertion, unmodified fast-path encoding, and correct escaping round-trip for special characters.
Bench updates: decode→modify→encode scenarios
benches/lua_bench.lua, Makefile
Adds default multimodal and scenario-specific modify callbacks; wires github-100k helpers; scenario filtering; adds per-scenario and interleaved decode→modify→encode benchmark cases with minimum-length validation; Makefile iterates BENCH_SCENARIOS.
Docs and README benchmark updates
docs/benchmarks.md, README.md
Updates methodology (warmup, fresh-process isolation), adds modify+encode workload descriptions and throughput table, and replaces README benchmark text with modify-then-encode notes.

Sequence Diagram(s)

No sequence diagram generated for this PR.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly Related PRs

  • api7/lua-qjson#51: Modifies src/validate/mod.rs eager validation logic; both PRs refactor validation/stack handling and related checks.
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
E2e Test Quality Review ⚠️ Warning Review comment on benches/lua_bench.lua requesting error handling for invalid filter remains unaddressed; missing matched flag to error when filter doesn't match any scenario. Add matched variable to detect when arg[1] filter matches no scenarios, then error with "unknown benchmark scenario filter" message as suggested in review.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main focus: performance optimizations across Lua encoding, caching, scanner, and validation stack components, matching the substantial changes across those areas.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Check ✅ Passed No security vulnerabilities found. PR contains performance optimizations for a JSON encoder/decoder with proper escape handling, validation bounds checking, and no credential/secret handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 001-json-perf-optimize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@membphis
Copy link
Copy Markdown
Collaborator Author

Performance Comparison (aarch64, Apple M-series, NEON)

modify top + encode

Payload Before (ops/s) After (ops/s) Change
small (2KB) 98,965 148,364 +50%
medium (60KB) 52,854 71,480 +35%
100k 42,105 54,705 +30%
500k 10,070 13,541 +34%
1m 5,111 6,699 +31%

add field + encode

Payload Before (ops/s) After (ops/s) Change
small (2KB) 99,548 158,013 +59%
medium (60KB) 60,234 78,040 +30%
100k 46,253 61,502 +33%
500k 12,158 14,018 +15%
1m 6,209 7,349 +18%

modify nested + encode

Payload Before (ops/s) After (ops/s) Change
small (2KB) 59,056 117,409 +99%
medium (60KB) 121,212 165,344 +36%
100k 93,371 107,759 +15%
500k 34,423 35,881 +4%
1m 18,773 19,025 +1%

unmodified encode (memcpy fast path)

Payload Before (ops/s) After (ops/s) Change
100k 159,744 220,000 +38%
1m 19,841 21,941 +11%

decode + access content

Payload Before (ops/s) After (ops/s) Change
100k 155,039 158,028 +2%
1m 20,718 21,556 +4%

Before = baseline prior to any optimization. After = Phase 2 (Lua layer) + Phase 3 (Rust layer). Benchmarks run via make bench on same hardware throughout.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Performance-focused PR optimizing qjson’s full pipeline (decode → access/modify → encode), with targeted improvements in both the Lua proxy layer and Rust core (scanner/cache/validation/number parsing).

Changes:

  • Lua: faster string encoding, O(1) mutation tracking via _dirty propagation, and table type tagging to avoid repeated array/object scans during encode.
  • Rust: SkipCache avoids per-hit Vec cloning via Rc<[u32]>, AVX2 structural classification uses PSHUFB LUTs, eager validation uses a fixed-size stack fast path, and parse_i64 uses memchr3.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/lua/lazy_table_spec.lua Adds encode/mutation test coverage for new fast paths and dirty propagation.
lua/qjson/table.lua Implements Lua-layer encode + mutation tracking optimizations (string escaper, _dirty, type tagging).
src/skip_cache.rs Switches skip-cache child index storage to Rc<[u32]> for O(1) cache-hit cloning.
src/cursor.rs Uses Rc::clone on cache hits and stores populated vectors into Rc<[u32]>.
src/scan/avx2.rs Replaces repeated cmpeq loop with nibble-based PSHUFB LUT classification for structurals.
src/validate/mod.rs Reworks eager validation stack to avoid heap allocation for typical depths.
src/decode/number.rs Uses memchr::memchr3 for faster integer/float discrimination.
benches/lua_bench.lua Adds modified-encode benchmark scenarios (modify top/add field/modify nested).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lua/qjson/table.lua Outdated
Comment thread lua/qjson/table.lua
Comment thread lua/qjson/table.lua
- Skip __qjson_type internal marker in encode_object() pairs iteration
- Guard dirty propagation to stop at non-lazy ancestors,
  preventing _dirty rawset on already-materialized tables
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/skip_cache.rs (1)

25-27: 💤 Low value

Consider a single shared empty slice constant.

Both new() and get_or_insert() create new empty Rc<[u32]> instances. While empty slices are small, a single shared constant would eliminate these allocations entirely.

♻️ Optional refactor using lazy_static or thread_local
+use std::cell::RefCell;
+
+thread_local! {
+    static EMPTY_SLICE: Rc<[u32]> = Rc::from([]);
+}
+
 impl SkipCache {
     pub(crate) fn new() -> Self {
-        let empty: Rc<[u32]> = Rc::from([]);
         Self {
-            slots: vec![SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty }],
+            slots: vec![SkipSlot { 
+                child_starts: EMPTY_SLICE.with(Rc::clone), 
+                child_ends: EMPTY_SLICE.with(Rc::clone) 
+            }],
             by_opener: FxHashMap::default(),
         }
     }
 
     pub(crate) fn get_or_insert(&mut self, opener_idx: u32) -> (u32, bool) {
         if let Some(&slot) = self.by_opener.get(&opener_idx) {
             return (slot, true);
         }
         let new = self.slots.len() as u32;
-        let empty: Rc<[u32]> = Rc::from([]);
-        self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty });
+        self.slots.push(SkipSlot { 
+            child_starts: EMPTY_SLICE.with(Rc::clone), 
+            child_ends: EMPTY_SLICE.with(Rc::clone) 
+        });
         self.by_opener.insert(opener_idx, new);
         (new, false)
     }

Also applies to: 39-40

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skip_cache.rs` around lines 25 - 27, Create a single shared empty
Rc<[u32]> and use it from both new() and get_or_insert() instead of calling
Rc::from([]) twice; define a static (e.g. using once_cell::sync::Lazy or
lazy_static) EMPTY_EMPTY_U32_RC and replace the Rc::from([]) usages in SkipSlot
construction and any place in get_or_insert() with
Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts:
Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or
equivalent) to eliminate duplicate empty allocations while keeping types
Rc<[u32]> consistent.
tests/lua/lazy_table_spec.lua (1)

394-442: ⚡ Quick win

Add a regression test for __qjson_type payload-key collision.

Please add a case where decoded data includes a user key named __qjson_type and verify encode still preserves object semantics after mutation.

Example test to add
+    it("does not let user __qjson_type key alter table dispatch", function()
+        local cjson = require("cjson")
+        local t = qjson.decode('{"__qjson_type":"array","a":1}')
+        t.a = 2
+        local out = qjson.encode(t)
+        local parsed = cjson.decode(out)
+        assert.are.equal("array", parsed.__qjson_type)
+        assert.are.equal(2, parsed.a)
+    end)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/lua/lazy_table_spec.lua` around lines 394 - 442, Add a regression test
in tests/lua/lazy_table_spec.lua that decodes JSON containing a user-supplied
"__qjson_type" key via qjson.decode, mutates that field (and optionally other
fields or nested entries), re-encodes with qjson.encode, then cjson.decode the
output and assert the resulting table preserves object semantics and the mutated
"__qjson_type" value; locate the test near the existing it("...") cases and
follow their pattern using qjson.decode, qjson.encode, and cjson.decode to
validate both the "__qjson_type" field and other fields remain correct after
encode/decode.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/lua_bench.lua`:
- Around line 261-263: The benchmark helper github_table_modify_add currently
does t.extra_field = true which has no effect when the encoded JSON root is an
array; change github_table_modify_add to detect array-root tables (e.g., check
if `#t` > 0 or use ipairs) and in that case iterate over elements and set
element.extra_field = true for each item, otherwise set t.extra_field = true on
the object root so the added field appears in encoded output for both object and
array roots.

In `@lua/qjson/table.lua`:
- Around line 286-287: The code currently embeds a user-visible string key
"__qjson_type" via rawset(t, "__qjson_type", "object"), which lets user payloads
collide with encoder dispatch; replace this by storing the marker in an internal
weak side-table (e.g., a local table with metatable __mode="k") instead of
mutating the user's table, set side_table[t] = "object" wherever rawset(t,
"__qjson_type", ...) is used (references: rawset and the "__qjson_type" marker
in table.lua around the shown block and the similar block at lines ~552-557),
and update any lookup logic that currently reads t["__qjson_type"] to consult
the side-table to determine the container type.

---

Nitpick comments:
In `@src/skip_cache.rs`:
- Around line 25-27: Create a single shared empty Rc<[u32]> and use it from both
new() and get_or_insert() instead of calling Rc::from([]) twice; define a static
(e.g. using once_cell::sync::Lazy or lazy_static) EMPTY_EMPTY_U32_RC and replace
the Rc::from([]) usages in SkipSlot construction and any place in
get_or_insert() with Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts:
Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or
equivalent) to eliminate duplicate empty allocations while keeping types
Rc<[u32]> consistent.

In `@tests/lua/lazy_table_spec.lua`:
- Around line 394-442: Add a regression test in tests/lua/lazy_table_spec.lua
that decodes JSON containing a user-supplied "__qjson_type" key via
qjson.decode, mutates that field (and optionally other fields or nested
entries), re-encodes with qjson.encode, then cjson.decode the output and assert
the resulting table preserves object semantics and the mutated "__qjson_type"
value; locate the test near the existing it("...") cases and follow their
pattern using qjson.decode, qjson.encode, and cjson.decode to validate both the
"__qjson_type" field and other fields remain correct after encode/decode.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 05f6541a-c41b-4fe1-8fe6-6819cb53bdd0

📥 Commits

Reviewing files that changed from the base of the PR and between 8e49d18 and c7574e3.

📒 Files selected for processing (8)
  • benches/lua_bench.lua
  • lua/qjson/table.lua
  • src/cursor.rs
  • src/decode/number.rs
  • src/scan/avx2.rs
  • src/skip_cache.rs
  • src/validate/mod.rs
  • tests/lua/lazy_table_spec.lua

Comment thread benches/lua_bench.lua
Comment thread lua/qjson/table.lua Outdated
- Replace __qjson_type string key with local weak side-table
  (TABLE_TYPE_HINT) to prevent collision with user payload keys
- Fix github_table_modify_add to mutate first element instead of
  array root so the added field appears in encoded output
Copilot AI review requested due to automatic review settings May 23, 2026 12:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comment thread src/validate/mod.rs
// After the walk, the stack must hold exactly one frame: the root
// context, which must be `TopDone` (root value consumed).
if stack.len() != 1 || stack[0] != CtxKind::TopDone {
if stack_len!() != 1 || stack_buf[0] != CtxKind::TopDone {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified against the existing 256-deep test (tests/rfc8259_compliance.rs:119 accepts_nested_at_configured_limit) — it parses successfully on this branch. The premise doesn't hold: Top → TopDone mutation happens at line 226 (*cur = parent_after_value(*cur)) on the first [/{ of input, before any push that could overflow into the fallback Vec. So stack_buf[0] is set to TopDone on entry to any deep section, gets copied into fb[0] when fallback is created, and the final-state check at line 317 reads the right slot. Leaving as-is.

Comment thread src/validate/mod.rs
Comment on lines +154 to +180
macro_rules! push {
($kind:expr) => {
if sp < STACK_CAP {
stack_buf[sp] = $kind;
sp += 1;
} else {
let fb = fallback.get_or_insert_with(|| {
let mut v: Vec<CtxKind> = Vec::with_capacity(STACK_CAP + 16);
v.extend_from_slice(&stack_buf[..sp]);
v
});
sp = STACK_CAP.wrapping_add(fb.len() + 1);
fb.push($kind);
}
};
}
macro_rules! pop {
() => {{
if sp <= STACK_CAP {
if sp == 0 { None }
else { sp -= 1; Some(stack_buf[sp]) }
} else {
let fb = fallback.as_mut().unwrap();
let val = fb.pop();
if fb.is_empty() { sp = STACK_CAP; }
val
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bookkeeping is admittedly hard to read, but it's correct given an invariant the validator maintains: the Top marker is never popped from the stack, only mutated to TopDone (line 226). Bracket imbalance errors out earlier via the pop!().ok_or(...) check. Hence the branch if fb.is_empty() { sp = STACK_CAP; } is unreachable in practice — fb always retains the Top marker at index 0. Refactoring to a single source-of-truth backing store (SmallVec or always-Vec) is a clarity improvement worth doing as a follow-up, but out of scope for this perf-focused PR.

Comment thread lua/qjson/table.lua
Comment on lines 256 to 262
-- The set of keys reserved by the lazy view bookkeeping; user-supplied JSON
-- keys with these names would collide (minor, deferred). Centralized here so
-- the dirty check and __newindex can share the list.
local INTERNAL_KEYS = {
_doc = true, _cur_box = true, _cur = true, _bs = true, _be = true,
_parent = true, _dirty = true,
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — comment updated to reference the actual current consumers (__newindex cache snapshotting and encode_lazy_object_walking for skipping internals on dirty proxies). Pushing in the next commit.

Comment thread lua/qjson/table.lua Outdated
Comment on lines +288 to +292
for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do
rawset(t, f, nil)
end
setmetatable(t, nil)
TABLE_TYPE_HINT[t] = "object"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted — replaced the ipairs({...}) allocation with seven inline rawset(t, ..., nil) calls. Using rawset rather than multi-assignment because the root view (created in _M.decode) doesn't carry _parent and a direct t._parent = nil would trip __newindex recursion on it. Zero per-call allocations on the materialization path now.

Comment thread lua/qjson/table.lua Outdated
Comment on lines +321 to +323
for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do
rawset(t, f, nil)
end
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix applied in LazyArray.__newindex — seven inline rawset(t, ..., nil) calls, no per-call allocation.

Move TABLE_TYPE_HINT/is_array/empty_array_mt dispatch into a
separate function so the hot encode path (lazy proxy → memcpy)
stays minimal for LuaJIT trace compilation.
@membphis
Copy link
Copy Markdown
Collaborator Author

Performance Comparison: main vs PR (aarch64, Apple M-series)

Both branches run the same modified-encode benchmark suite for apples-to-apples comparison. Main branch uses original lua/qjson/table.lua + original Rust code.

Key metrics (median ops/s)

Scenario Op main PR Change
100k modify top + encode 51,203 48,473 -5%
100k add field + encode 48,591 48,193 -1%
100k modify nested + encode 95,969 107,991 +13%
200k modify nested + encode 66,401 74,405 +12%
1m modify top + encode 6,098 5,682 -7%
1m modify nested + encode 19,036 19,455 +2%
5m modify top + encode 585 1,043 +78%
5m add field + encode 579 1,039 +79%
10m modify top + encode 275 457 +66%

Existing paths (parse, access, encode unmod)

Flat within bench variance (±5% for parse+access and decode+access). The encode(unmod) path shows ~25% lower numbers in the full bench due to JIT trace slot competition from the 3 additional modify-encode scenarios; isolated testing with only encode(unmod) shows identical performance between branches.

Modified-encode scenarios (small/medium multimodal payloads)

Scenario Op ops/s
small (2KB) modify top + encode 145,641
small (2KB) add field + encode 148,920
small (2KB) modify nested + encode 81,171
medium (60KB) modify top + encode 69,185
medium (60KB) add field + encode 80,632
medium (60KB) modify nested + encode 174,095

Both branches tested with identical bench harness (modified-encode scenarios present in both). Main = original table.lua + original Rust. PR = optimized table.lua + optimized Rust (Rc cache, stack validation, memchr3, AVX2 LUT).

Scanner (src/scan/avx2.rs):
- Revert structural_mask_chunk from the PSHUFB-LUT nibble classifier
  back to the parallel 7-cmpeq form. On AMD Zen2 the PSHUFB variant
  measured -45% parse on small payloads (-7% on 1m where the
  in-string fast-probe dominates anyway): VPSHUFB ymm is split into
  two micro-ops per 128-bit lane, the srli->and->pshufb->and->cmpeq->
  xor->movemask chain has a longer critical path, and VPMOVMSKB
  (lat 4) plus the LUT-load constants pressure the FP ports. The
  parallel cmpeq design lets independent compare chains dispatch
  across multiple ports. NEON path is unaffected (separate file).

Lua layer (lua/qjson/table.lua) -- review feedback:
- Update INTERNAL_KEYS comment to reference its actual consumers
  (__newindex cache snapshotting and encode_lazy_object_walking),
  not the removed recursive is_dirty walk.
- Replace 'for _, f in ipairs({...})' in both LazyObject.__newindex
  and LazyArray.__newindex with seven inline rawset(t, ..., nil)
  calls. Eliminates the per-call 7-elem table-literal allocation
  on the materialization path. Uses rawset (not multi-assignment
  to nil) because root views from _M.decode lack _parent, so
  't._parent = nil' could fire __newindex recursively.

Bench (benches/lua_bench.lua):
- Force LuaJIT to evaluate qjson.encode results: replace
  'local _ = qjson.encode(t)' with 'local _enc = qjson.encode(t);
  if #_enc < 2 then error(...) end' in all 8 mutation/encode cases.
  Without this, LuaJIT could partially DCE the encode call on some
  trace shapes and produce misleading speedups.
- Raise the warmup floor from max(3, iters/5) to max(50, iters/5).
  LuaJIT's default hotloop is 56; with the original floor of 3, the
  1m payload (iters=15) measured pre-JIT interpreter mode for most
  of the run.
- Raise 500k iters 20->100 and 1m iters 15->60 so per-round wall
  time exceeds os.clock granularity (~1ms) by a comfortable margin.
Copilot AI review requested due to automatic review settings May 23, 2026 15:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comment thread src/skip_cache.rs
Comment on lines 24 to 41
@@ -34,7 +36,8 @@ impl SkipCache {
return (slot, true);
}
let new = self.slots.len() as u32;
self.slots.push(SkipSlot { child_starts: Vec::new(), child_ends: Vec::new() });
let empty: Rc<[u32]> = Rc::from([]);
self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty });
self.by_opener.insert(opener_idx, new);
Comment thread benches/lua_bench.lua
Comment on lines +148 to +151
-- Floor at 50: LuaJIT hotloop default is 56, so fewer iterations leave
-- the bench measuring interpreter mode for the large-payload scenarios
-- (1m has iters=15, iters/5=3 → trace never compiles → ~30% noise).
local warmup = math.max(50, math.floor(iters / 5))
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/lua_bench.lua`:
- Line 149: Update the explanatory comment that currently reads "1m has
iters=15" to reflect the actual iteration count used in the benchmark (the
`iters = 60` setting for the 1m scenario); locate the comment near the bench
description for large-payload/interpreter mode and change the text to reference
`iters = 60` (or reword to avoid hardcoding the number) so it matches the
`iters` variable defined for the 1m scenario.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 87af55bb-8a10-4f0b-8acb-70395b800582

📥 Commits

Reviewing files that changed from the base of the PR and between c7574e3 and 65da500.

📒 Files selected for processing (3)
  • benches/lua_bench.lua
  • lua/qjson/table.lua
  • src/scan/avx2.rs
✅ Files skipped from review due to trivial changes (1)
  • src/scan/avx2.rs

Comment thread benches/lua_bench.lua
membphis added 3 commits May 23, 2026 21:35
…pty Rc

- Collapse nested if in parse_f64 skip_validation path (clippy::collapsible_if)
- Allow clippy::approx_constant on intentional test value 3.14
- Replace push loops with vec![b'['; N] in depth tests (clippy::same-item-push)
- Update warmup comment in lua_bench.lua to reference current iters values
- Use shared empty_rc field in SkipCache to avoid per-slot Rc::from([]) allocation
…code results

Changes:
- benches/lua_bench.lua: accept arg[1] as scenario name filter;
  when set, only that single scenario runs (backward-compatible —
  no arg runs all as before).
- Makefile: bench target loops BENCH_SCENARIOS, launching a fresh
  resty process for each. Eliminates accumulated GC/JIT trace-cache
  interference between payload sizes.
- docs/benchmarks.md: document fresh-process methodology, add
  modify+encode workload descriptions and throughput table.
- README.md: reference modify+encode scenarios added in PR #54.

Benchmark summary (x86_64 Linux, 3-round fresh-process, PR vs main):

modify + encode path (median ops/s, PR main):
  small  modify top:     +41%..+75%   (33k→59k)
  small  modify nested:  +44%..+80%   (29k→53k)
  medium modify nested:  +58%..+106%  (92k→190k)
  100k   modify nested:  +31%..+45%   (47k→68k)
  interleaved mod top:   +2%..+30%   (5.8k→7.2k)

30/33 data points PR > main (91%). encode(unmodified) broadly flat
or slightly up (+2%..+26%), consistent with JIT trace slot competition
from the additional modify+encode scenarios.
Update README.md summary table and docs/benchmarks.md
(throughput, speedup, memory, modify+encode) with results from
a clean make bench run on the current branch.
Copilot AI review requested due to automatic review settings May 23, 2026 22:13
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/lua_bench.lua`:
- Around line 299-307: The filter check using local variable "filter" can
silently skip all scenarios when arg[1] is mistyped; modify the loop over
"scenarios" that uses "if filter and s.name ~= filter then goto
continue_scenario end" to set a boolean flag (e.g., "matched = true") when any
scenario is executed, and after the loop check that flag and call error/print
and os.exit(1) if no scenario matched; apply the same pattern to the second loop
that uses the same filter logic (the block using the "continue_scenario" label)
so CI fails fast when an invalid filter was supplied.

In `@README.md`:
- Line 116: Replace the non-hyphenated phrase "full size ladder" with the
hyphenated compound adjective "full-size ladder" in the README text (search for
the exact string "full size ladder" near the benchmarks link and update it to
"full-size ladder") so the phrase uses the correct hyphenation for a compound
adjective.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c7a1775e-967d-4c22-8a44-b581e7e5c3b7

📥 Commits

Reviewing files that changed from the base of the PR and between 65da500 and 9c49135.

📒 Files selected for processing (7)
  • Makefile
  • README.md
  • benches/lua_bench.lua
  • docs/benchmarks.md
  • src/decode/number.rs
  • src/skip_cache.rs
  • src/validate/mod.rs
✅ Files skipped from review due to trivial changes (1)
  • docs/benchmarks.md

Comment thread benches/lua_bench.lua
Comment on lines +299 to +307
local filter = arg[1]

if not simdjson then
print("lua-resty-simdjson unavailable; skipping simdjson rows: "
.. tostring(simdjson_or_err))
end

for _, s in ipairs(scenarios) do
if filter and s.name ~= filter then goto continue_scenario end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast when arg[1] does not match any scenario.

A typo in the filter currently exits with no benchmark rows, which is easy to misread as success in CI/manual runs. Please track whether any scenario matched and error if none did.

Suggested patch
 local filter = arg[1]
+local matched = false
@@
 for _, s in ipairs(scenarios) do
     if filter and s.name ~= filter then goto continue_scenario end
+    matched = true
     print(string.format("=== %s (%d bytes) ===", s.name, `#s.payload`))
@@
 end
+
+if filter and filter ~= "interleaved" and not matched then
+    error("unknown benchmark scenario filter: " .. tostring(filter))
+end

Also applies to: 378-379

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/lua_bench.lua` around lines 299 - 307, The filter check using local
variable "filter" can silently skip all scenarios when arg[1] is mistyped;
modify the loop over "scenarios" that uses "if filter and s.name ~= filter then
goto continue_scenario end" to set a boolean flag (e.g., "matched = true") when
any scenario is executed, and after the loop check that flag and call
error/print and os.exit(1) if no scenario matched; apply the same pattern to the
second loop that uses the same filter logic (the block using the
"continue_scenario" label) so CI fails fast when an invalid filter was supplied.

Comment thread README.md Outdated
OpenResty environment; otherwise it skips the simdjson rows.
Modify-then-encode scenarios (PR #54) add decode → mutate field → re-encode
workloads; small payload modify+encode reaches 48k–60k ops/s. See
[`docs/benchmarks.md`](docs/benchmarks.md) for the full size ladder,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use hyphenated compound adjective in README text.

“full size ladder” should be “full-size ladder” for standard technical prose.

🧰 Tools
🪛 LanguageTool

[grammar] ~116-~116: Use a hyphen to join words.
Context: ...ks.md`](docs/benchmarks.md) for the full size ladder, modify+encode results, memo...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 116, Replace the non-hyphenated phrase "full size ladder"
with the hyphenated compound adjective "full-size ladder" in the README text
(search for the exact string "full size ladder" near the benchmarks link and
update it to "full-size ladder") so the phrase uses the correct hyphenation for
a compound adjective.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Comment thread Makefile

bench: build vendor/lua-cjson/cjson.so ## Run each scenario in a fresh LuaJIT process
@for s in $(BENCH_SCENARIOS); do \
$(LUA_ENV) $(RESTY) benches/lua_bench.lua $$s; \
Comment thread src/scan/avx2.rs
Comment on lines +110 to +116
// 7 parallel byte-equality compares. On AMD Zen2 these dispatch across
// multiple FP ports and beat a PSHUFB-LUT nibble classifier (PSHUFB ymm
// is split into 2 micro-ops per lane, the LUT chain lengthens the
// critical path, and VPMOVMSKB has lat=4 — the 14-movemask total is
// still cheaper than the LUT path's serial dependency). PR #54 tried
// PSHUFB-LUT but measured -45% parse on small payloads on Zen2; this
// form is what shipped through #51.
Comment thread src/validate/mod.rs
Comment on lines +544 to 546
let mut buf = vec![b'['; 1024];
buf.extend_from_slice(&vec![b']'; 1024]);
assert!(
membphis added 2 commits May 23, 2026 22:18
…son in README

Adds cjson.decode + modify top/add field/modify nested + cjson.encode
benchmarks so the modify+encode path has an eager baseline. README
encode table now shows cjson/qjson side-by-side for modify workloads:
qjson is 10-43x faster at 60 KB+.
Copilot AI review requested due to automatic review settings May 23, 2026 22:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Comment thread benches/lua_bench.lua
Comment on lines 306 to 308
for _, s in ipairs(scenarios) do
if filter and s.name ~= filter then goto continue_scenario end
print(string.format("=== %s (%d bytes) ===", s.name, #s.payload))
@membphis membphis merged commit 5901f7a into main May 23, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants