Describe the bug, including details regarding any error messages, version, and platform.
Summary:
The valuesize() function in parquet/variant/utils.go checks (typeinfo >> 4) & 0x1 to determine the is_large flag for both objects and arrays. While this is correct for objects, it is incorrect for arrays.
According to the Parquet Variant spec, the array layout shifts the is_large flag to bit position 2 of the value header, rather than bit 4.
Root Cause Analysis
The specification defines different header layouts to optimize space for objects vs. arrays:
Object value_header (6 bits):
Bit Position: [ 5 ] [ 4 ] [ 3 2 ] [ 1 0 ]
Data Stored: Unused is_large field_id_sz offset_sz
▲
(Correctly checks Bit 4)
Array value_header (6 bits):
Bit Position: [ 5 4 3 ] [ 2 ] [ 1 0 ]
Data Stored: Unused is_large offset_sz
▲
(Should check Bit 2!)
Evidence
The Bug (parquet/variant/utils.go):
case byte(basicarray):
var szbytes uint8 = 1
if ((typeinfo >> 4) & 0x1) != 0 { // ❌ Error: Checks bit 4 instead of bit 2
szbytes = 4
}
The Correct Implementation (parquet/variant/variant.go):
case basicarray:
valuehdr := (v.value[0] >> basictypebits)
fieldoffsetsz := (valuehdr & 0b11) + 1
islarge := ((valuehdr >> 2) & 0b1) == 1 // Correct: Checks bit 2
Impact
This causes valuesize() to return an incorrect size for arrays using 4-byte offsets (is_large = true). This leads directly to silent data corruption or panics during writes/compactions—specifically when FinishObject() compacts duplicate keys whose values happen to be large arrays.
Suggested Fix
Update the basicarray case in parquet/variant/utils.go to shift by 2 instead of 4:
case byte(basicarray):
var szbytes uint8 = 1
if ((typeinfo >> 2) & 0x1) != 0 { // Fix: Shift by 2 for arrays
szbytes = 4
}
Component(s)
Parquet
Describe the bug, including details regarding any error messages, version, and platform.
Summary:
The
valuesize()function inparquet/variant/utils.gochecks(typeinfo >> 4) & 0x1to determine theis_largeflag for both objects and arrays. While this is correct for objects, it is incorrect for arrays.According to the Parquet Variant spec, the array layout shifts the
is_largeflag to bit position 2 of the value header, rather than bit 4.Root Cause Analysis
The specification defines different header layouts to optimize space for objects vs. arrays:
Object value_header (6 bits):
Array value_header (6 bits):
Evidence
The Bug (parquet/variant/utils.go):
The Correct Implementation (parquet/variant/variant.go):
Impact
This causes
valuesize()to return an incorrect size for arrays using 4-byte offsets(is_large = true). This leads directly to silent data corruption or panics during writes/compactions—specifically whenFinishObject()compacts duplicate keys whose values happen to be large arrays.Suggested Fix
Update the
basicarraycase inparquet/variant/utils.goto shift by 2 instead of 4:Component(s)
Parquet