Context
Three small scanner tweaks that fit together; all in README's Deferred list since PR #3. Bundled because each individually is low-impact and they touch the same area.
B1 — shuffle-based structural set check
structural_mask_chunk (src/scan/avx2.rs) currently does 7 × _mm256_cmpeq_epi8 + 7 × _mm256_movemask_epi8 per chunk half (one per char in {}[]:,"). A single _mm256_shuffle_epi8 against a 16-byte LUT plus one cmpeq can do the same set-membership test in 2-3 ops per half.
Only affects non-fast-path chunks. For string-heavy workloads ~5% of chunks hit this path; for object-heavy workloads up to 100%.
C1 — adaptive out.reserve
out.reserve(buf.len() / 6) is calibrated for object-heavy JSON. On string-heavy multimodal payloads the actual emit rate is <1 structural per KB, so we over-reserve by 100×+. Mainly a memory hygiene concern (mmap'd pages stay lazily faulted) but reduces alloc cost on smaller buffers.
Proposal: start at max(64, buf.len() / 128) and let Vec grow naturally. Standard amortized-doubling handles the rare growth case.
C2 — SmallVec for documents < 4 KB
For tiny payloads the indices Vec heap allocation is a meaningful fraction of total parse time. Switch indices to SmallVec<[u32; N]> for some inline N (e.g. 64). Heap alloc only triggers for documents with more than N structurals.
Estimated impact
|
est. speedup |
| B1 (string-heavy) |
~2–5% |
| B1 (object-heavy) |
~15–25% |
| C1 |
<2% across the board |
| C2 (small docs only) |
~10–20% |
Validation plan
Notes
- Validation semantics unchanged
- Implement in one PR but with separate commits per item so individual revert is possible
Context
Three small scanner tweaks that fit together; all in README's Deferred list since PR #3. Bundled because each individually is low-impact and they touch the same area.
B1 — shuffle-based structural set check
structural_mask_chunk(src/scan/avx2.rs) currently does 7 ×_mm256_cmpeq_epi8+ 7 ×_mm256_movemask_epi8per chunk half (one per char in{}[]:,"). A single_mm256_shuffle_epi8against a 16-byte LUT plus onecmpeqcan do the same set-membership test in 2-3 ops per half.Only affects non-fast-path chunks. For string-heavy workloads ~5% of chunks hit this path; for object-heavy workloads up to 100%.
C1 — adaptive
out.reserveout.reserve(buf.len() / 6)is calibrated for object-heavy JSON. On string-heavy multimodal payloads the actual emit rate is <1 structural per KB, so we over-reserve by 100×+. Mainly a memory hygiene concern (mmap'd pages stay lazily faulted) but reduces alloc cost on smaller buffers.Proposal: start at
max(64, buf.len() / 128)and letVecgrow naturally. Standard amortized-doubling handles the rare growth case.C2 — SmallVec for documents < 4 KB
For tiny payloads the indices Vec heap allocation is a meaningful fraction of total parse time. Switch indices to
SmallVec<[u32; N]>for some inline N (e.g. 64). Heap alloc only triggers for documents with more than N structurals.Estimated impact
Validation plan
make benchmedian before/after per item separately (so each one's contribution is attributable)Vecregrowth on object-heavy inputsNotes