Skip to content

perf: Add bulkGet64WithBaseline and 8-byte fast path for FixedBitWidthEncoding (#641)#641

Closed
xiaoxmeng wants to merge 1 commit intofacebookincubator:mainfrom
xiaoxmeng:export-D99154749
Closed

perf: Add bulkGet64WithBaseline and 8-byte fast path for FixedBitWidthEncoding (#641)#641
xiaoxmeng wants to merge 1 commit intofacebookincubator:mainfrom
xiaoxmeng:export-D99154749

Conversation

@xiaoxmeng
Copy link
Copy Markdown
Contributor

@xiaoxmeng xiaoxmeng commented Apr 5, 2026

Summary:

Referenced from MRS AusList decode optimization D98819389 (AusLongListForBitpackEncoder). Ports the key branchless byte-aligned load technique to Nimble's FixedBitWidthEncoding for general use.

Add bulk decode optimizations for 64-bit types in FixedBitWidthEncoding, targeting the selective reader and serializer/deserializer materialize() paths.

Changes:

FixedBitArray: Add bulkGet64WithBaseline() for 64-bit output with arbitrary bitWidth. Three code paths by bit width:

  • bitWidth <= 32: delegates to the optimized template-unrolled 32-bit path (bulkGetWithBaseline32Into64).
  • bitWidth 33-57: branchless byte-aligned loads — since the sub-byte offset is at most 7, bitWidth + remainder <= 57 + 7 = 64, so each value fits in a single 64-bit load with no cross-word boundary branch. This eliminates the branch in the hot loop and enables better instruction-level parallelism.
  • bitWidth > 57: falls back to per-element get() for cross-word handling.

FixedBitWidthEncoding: Extend the selective reader fast path (bulkScan + readWithVisitorFast) from 4-byte-only to also support 8-byte integral types (int64/uint64). Previously, 64-bit columns always used the slow per-element path.

Legacy FixedBitWidthEncoding: Updated materialize() to use bulkGet64WithBaseline for 8-byte types.

Differential Revision: D99154749

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 5, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 5, 2026

@xiaoxmeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99154749.

@meta-codesync meta-codesync Bot changed the title perf: Add bulkGet64WithBaseline and 8-byte fast path for FixedBitWidthEncoding perf: Add bulkGet64WithBaseline and 8-byte fast path for FixedBitWidthEncoding (#641) Apr 5, 2026
xiaoxmeng added a commit to xiaoxmeng/nimble that referenced this pull request Apr 5, 2026
…hEncoding (facebookincubator#641)

Summary:

Referenced from MRS AusList decode optimization D98819389 (AusLongListForBitpackEncoder). Ports the key branchless byte-aligned load technique to Nimble's FixedBitWidthEncoding for general use.

Add bulk decode optimizations for 64-bit types in FixedBitWidthEncoding, targeting the selective reader and serializer/deserializer materialize() paths.

Changes:

FixedBitArray: Add bulkGet64WithBaseline() for 64-bit output with arbitrary bitWidth. Three code paths by bit width:
- bitWidth <= 32: delegates to the optimized template-unrolled 32-bit path (bulkGetWithBaseline32Into64).
- bitWidth 33-57: branchless byte-aligned loads — since the sub-byte offset is at most 7, bitWidth + remainder <= 57 + 7 = 64, so each value fits in a single 64-bit load with no cross-word boundary branch. This eliminates the branch in the hot loop and enables better instruction-level parallelism.
- bitWidth > 57: falls back to per-element get() for cross-word handling.

FixedBitWidthEncoding: Extend the selective reader fast path (bulkScan + readWithVisitorFast) from 4-byte-only to also support 8-byte integral types (int64/uint64). Previously, 64-bit columns always used the slow per-element path.

Legacy FixedBitWidthEncoding: Updated materialize() to use bulkGet64WithBaseline for 8-byte types.

Differential Revision: D99154749
…hEncoding (facebookincubator#641)

Summary:

Referenced from MRS AusList decode optimization D98819389 (AusLongListForBitpackEncoder). Ports the key branchless byte-aligned load technique to Nimble's FixedBitWidthEncoding for general use.

Add bulk decode optimizations for 64-bit types in FixedBitWidthEncoding, targeting the selective reader and serializer/deserializer materialize() paths.

Changes:

FixedBitArray: Add bulkGet64WithBaseline() for 64-bit output with arbitrary bitWidth. Three code paths by bit width:
- bitWidth <= 32: delegates to the optimized template-unrolled 32-bit path (bulkGetWithBaseline32Into64).
- bitWidth 33-57: branchless byte-aligned loads — since the sub-byte offset is at most 7, bitWidth + remainder <= 57 + 7 = 64, so each value fits in a single 64-bit load with no cross-word boundary branch. This eliminates the branch in the hot loop and enables better instruction-level parallelism.
- bitWidth > 57: falls back to per-element get() for cross-word handling.

FixedBitWidthEncoding: Extend the selective reader fast path (bulkScan + readWithVisitorFast) from 4-byte-only to also support 8-byte integral types (int64/uint64). Previously, 64-bit columns always used the slow per-element path.

Legacy FixedBitWidthEncoding: Updated materialize() to use bulkGet64WithBaseline for 8-byte types.

Differential Revision: D99154749
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 6, 2026

This pull request has been merged in a7f0acd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant