Skip to content

fix(parquet/variant): correct is_large bit position in valueSize for arrays#840

Merged
zeroshade merged 2 commits into
apache:mainfrom
qzyu999:fix-valuesize-array-islarge
Jun 8, 2026
Merged

fix(parquet/variant): correct is_large bit position in valueSize for arrays#840
zeroshade merged 2 commits into
apache:mainfrom
qzyu999:fix-valuesize-array-islarge

Conversation

@qzyu999
Copy link
Copy Markdown
Contributor

@qzyu999 qzyu999 commented Jun 5, 2026

Rationale for this change

Closes #839

The valueSize() function in parquet/variant/utils.go uses (typeInfo >> 4) & 0x1 to check the is_large flag for both objects and arrays. This is correct for objects (where is_large is at bit 4 of the value_header) but incorrect for arrays (where is_large is at bit 2 per the Variant Encoding Spec).

This causes valueSize() to return an incorrect size for arrays with >255 elements, which can lead to silent data corruption when FinishObject() compacts duplicate keys whose values are large arrays.

What changes are included in this PR?

  • parquet/variant/utils.go: Changed (typeInfo >> 4) to (typeInfo >> 2) in the BasicArray case of valueSize(). The object case remains unchanged (it was already correct).

  • parquet/variant/valuesize_test.go (new): Added regression tests:

    • TestValueSizeLargeArray: Builds a 300-element array and verifies valueSize() returns the correct byte count.
    • TestValueSizeLargeObject: Verifies that large objects (>255 fields) still compute correctly after the fix.

Are these changes tested?

Yes. Two new regression tests are included. The full existing test suite passes with no regressions.

Are there any user-facing changes?

No API changes. This is a correctness fix for an internal utility function. Users who previously triggered the bug (allowed duplicate keys in objects where a field value is a large array) will now get correct behavior instead of silent data corruption.

…arrays

The valueSize() function checked (typeInfo >> 4) for the is_large flag
in both the Object and Array cases. While correct for Objects (where
is_large occupies bit 4 of the value_header), this is incorrect for
Arrays where is_large occupies bit 2 of the value_header.

This caused valueSize() to return incorrect sizes for arrays with >255
elements (is_large=true), leading to potential data corruption when
FinishObject() compacts duplicate keys whose values are large arrays.

The fix changes the Array case from (typeInfo >> 4) to (typeInfo >> 2),
matching the correct bit position used in variant.go's Value.Value()
and the arrayHeader() constructor.

Added regression tests for both large arrays and large objects.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a correctness bug in the Parquet Variant encoder/utility logic where valueSize() decoded the is_large flag for arrays from the wrong bit position, which could lead to incorrect size calculations and downstream silent data corruption when compacting duplicate keys containing large arrays.

Changes:

  • Fix valueSize() for BasicArray by reading the is_large flag from bit 2 (array header layout), while leaving object handling unchanged.
  • Add regression tests covering large arrays (>255 elements) and large objects (>255 fields).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
parquet/variant/utils.go Corrects is_large bit decoding for arrays in valueSize() to align with the Variant encoding header layout.
parquet/variant/valuesize_test.go Adds regression tests validating valueSize() behavior for large arrays and (intended) large objects.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread parquet/variant/valuesize_test.go Outdated
…on test

Verify exact size equality for valueSize() output and validate the is_large flag bit position (bit 4) in the object header.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@zeroshade zeroshade merged commit 8293e4a into apache:main Jun 8, 2026
23 checks passed
@qzyu999 qzyu999 deleted the fix-valuesize-array-islarge branch June 8, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Go][Parquet] valuesize() uses incorrect bit position for basicarray large flag

3 participants