perf: Add bulkGet64WithBaseline and 8-byte fast path for FixedBitWidthEncoding (#641)#641
Closed
xiaoxmeng wants to merge 1 commit intofacebookincubator:mainfrom
Closed
perf: Add bulkGet64WithBaseline and 8-byte fast path for FixedBitWidthEncoding (#641)#641xiaoxmeng wants to merge 1 commit intofacebookincubator:mainfrom
xiaoxmeng wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
|
@xiaoxmeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99154749. |
xiaoxmeng
added a commit
to xiaoxmeng/nimble
that referenced
this pull request
Apr 5, 2026
…hEncoding (facebookincubator#641) Summary: Referenced from MRS AusList decode optimization D98819389 (AusLongListForBitpackEncoder). Ports the key branchless byte-aligned load technique to Nimble's FixedBitWidthEncoding for general use. Add bulk decode optimizations for 64-bit types in FixedBitWidthEncoding, targeting the selective reader and serializer/deserializer materialize() paths. Changes: FixedBitArray: Add bulkGet64WithBaseline() for 64-bit output with arbitrary bitWidth. Three code paths by bit width: - bitWidth <= 32: delegates to the optimized template-unrolled 32-bit path (bulkGetWithBaseline32Into64). - bitWidth 33-57: branchless byte-aligned loads — since the sub-byte offset is at most 7, bitWidth + remainder <= 57 + 7 = 64, so each value fits in a single 64-bit load with no cross-word boundary branch. This eliminates the branch in the hot loop and enables better instruction-level parallelism. - bitWidth > 57: falls back to per-element get() for cross-word handling. FixedBitWidthEncoding: Extend the selective reader fast path (bulkScan + readWithVisitorFast) from 4-byte-only to also support 8-byte integral types (int64/uint64). Previously, 64-bit columns always used the slow per-element path. Legacy FixedBitWidthEncoding: Updated materialize() to use bulkGet64WithBaseline for 8-byte types. Differential Revision: D99154749
136b8c8 to
33752c3
Compare
…hEncoding (facebookincubator#641) Summary: Referenced from MRS AusList decode optimization D98819389 (AusLongListForBitpackEncoder). Ports the key branchless byte-aligned load technique to Nimble's FixedBitWidthEncoding for general use. Add bulk decode optimizations for 64-bit types in FixedBitWidthEncoding, targeting the selective reader and serializer/deserializer materialize() paths. Changes: FixedBitArray: Add bulkGet64WithBaseline() for 64-bit output with arbitrary bitWidth. Three code paths by bit width: - bitWidth <= 32: delegates to the optimized template-unrolled 32-bit path (bulkGetWithBaseline32Into64). - bitWidth 33-57: branchless byte-aligned loads — since the sub-byte offset is at most 7, bitWidth + remainder <= 57 + 7 = 64, so each value fits in a single 64-bit load with no cross-word boundary branch. This eliminates the branch in the hot loop and enables better instruction-level parallelism. - bitWidth > 57: falls back to per-element get() for cross-word handling. FixedBitWidthEncoding: Extend the selective reader fast path (bulkScan + readWithVisitorFast) from 4-byte-only to also support 8-byte integral types (int64/uint64). Previously, 64-bit columns always used the slow per-element path. Legacy FixedBitWidthEncoding: Updated materialize() to use bulkGet64WithBaseline for 8-byte types. Differential Revision: D99154749
33752c3 to
34c2b18
Compare
|
This pull request has been merged in a7f0acd. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Referenced from MRS AusList decode optimization D98819389 (AusLongListForBitpackEncoder). Ports the key branchless byte-aligned load technique to Nimble's FixedBitWidthEncoding for general use.
Add bulk decode optimizations for 64-bit types in FixedBitWidthEncoding, targeting the selective reader and serializer/deserializer materialize() paths.
Changes:
FixedBitArray: Add bulkGet64WithBaseline() for 64-bit output with arbitrary bitWidth. Three code paths by bit width:
FixedBitWidthEncoding: Extend the selective reader fast path (bulkScan + readWithVisitorFast) from 4-byte-only to also support 8-byte integral types (int64/uint64). Previously, 64-bit columns always used the slow per-element path.
Legacy FixedBitWidthEncoding: Updated materialize() to use bulkGet64WithBaseline for 8-byte types.
Differential Revision: D99154749