fix(parquet/variant): correct is_large bit position in valueSize for arrays#840
Merged
Merged
Conversation
…arrays The valueSize() function checked (typeInfo >> 4) for the is_large flag in both the Object and Array cases. While correct for Objects (where is_large occupies bit 4 of the value_header), this is incorrect for Arrays where is_large occupies bit 2 of the value_header. This caused valueSize() to return incorrect sizes for arrays with >255 elements (is_large=true), leading to potential data corruption when FinishObject() compacts duplicate keys whose values are large arrays. The fix changes the Array case from (typeInfo >> 4) to (typeInfo >> 2), matching the correct bit position used in variant.go's Value.Value() and the arrayHeader() constructor. Added regression tests for both large arrays and large objects.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a correctness bug in the Parquet Variant encoder/utility logic where valueSize() decoded the is_large flag for arrays from the wrong bit position, which could lead to incorrect size calculations and downstream silent data corruption when compacting duplicate keys containing large arrays.
Changes:
- Fix
valueSize()forBasicArrayby reading theis_largeflag from bit 2 (array header layout), while leaving object handling unchanged. - Add regression tests covering large arrays (>255 elements) and large objects (>255 fields).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| parquet/variant/utils.go | Corrects is_large bit decoding for arrays in valueSize() to align with the Variant encoding header layout. |
| parquet/variant/valuesize_test.go | Adds regression tests validating valueSize() behavior for large arrays and (intended) large objects. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…on test Verify exact size equality for valueSize() output and validate the is_large flag bit position (bit 4) in the object header. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
zeroshade
approved these changes
Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Closes #839
The
valueSize()function inparquet/variant/utils.gouses(typeInfo >> 4) & 0x1to check theis_largeflag for both objects and arrays. This is correct for objects (whereis_largeis at bit 4 of the value_header) but incorrect for arrays (whereis_largeis at bit 2 per the Variant Encoding Spec).This causes
valueSize()to return an incorrect size for arrays with >255 elements, which can lead to silent data corruption whenFinishObject()compacts duplicate keys whose values are large arrays.What changes are included in this PR?
parquet/variant/utils.go: Changed(typeInfo >> 4)to(typeInfo >> 2)in theBasicArraycase ofvalueSize(). The object case remains unchanged (it was already correct).parquet/variant/valuesize_test.go(new): Added regression tests:TestValueSizeLargeArray: Builds a 300-element array and verifiesvalueSize()returns the correct byte count.TestValueSizeLargeObject: Verifies that large objects (>255 fields) still compute correctly after the fix.Are these changes tested?
Yes. Two new regression tests are included. The full existing test suite passes with no regressions.
Are there any user-facing changes?
No API changes. This is a correctness fix for an internal utility function. Users who previously triggered the bug (allowed duplicate keys in objects where a field value is a large array) will now get correct behavior instead of silent data corruption.