Skip to content

[Go][Parquet] valuesize() uses incorrect bit position for basicarray large flag #839

@qzyu999

Description

@qzyu999

Describe the bug, including details regarding any error messages, version, and platform.

Summary:

The valuesize() function in parquet/variant/utils.go checks (typeinfo >> 4) & 0x1 to determine the is_large flag for both objects and arrays. While this is correct for objects, it is incorrect for arrays.

According to the Parquet Variant spec, the array layout shifts the is_large flag to bit position 2 of the value header, rather than bit 4.

Root Cause Analysis

The specification defines different header layouts to optimize space for objects vs. arrays:

Object value_header (6 bits):

Bit Position:  [ 5 ]   [ 4 ]    [ 3   2 ]    [ 1   0 ]
Data Stored:  Unused  is_large   field_id_sz    offset_sz
                         ▲
                (Correctly checks Bit 4)

Array value_header (6 bits):

Bit Position:  [ 5   4   3 ]    [ 2 ]    [ 1   0 ]
Data Stored:      Unused       is_large   offset_sz
                                  ▲
                         (Should check Bit 2!)

Evidence

The Bug (parquet/variant/utils.go):

   case byte(basicarray):
       var szbytes uint8 = 1
       if ((typeinfo >> 4) & 0x1) != 0 { // ❌ Error: Checks bit 4 instead of bit 2
           szbytes = 4
       }

The Correct Implementation (parquet/variant/variant.go):

   case basicarray:
       valuehdr := (v.value[0] >> basictypebits)
       fieldoffsetsz := (valuehdr & 0b11) + 1
       islarge := ((valuehdr >> 2) & 0b1) == 1 //  Correct: Checks bit 2

Impact

This causes valuesize() to return an incorrect size for arrays using 4-byte offsets (is_large = true). This leads directly to silent data corruption or panics during writes/compactions—specifically when FinishObject() compacts duplicate keys whose values happen to be large arrays.

Suggested Fix

Update the basicarray case in parquet/variant/utils.go to shift by 2 instead of 4:

case byte(basicarray):
	var szbytes uint8 = 1
	if ((typeinfo >> 2) & 0x1) != 0 { //  Fix: Shift by 2 for arrays
		szbytes = 4
	}

Component(s)

Parquet

Metadata

Metadata

Assignees

Labels

Type: bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions