-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).
CASE 1
Data (3 lists):
[
"one"
]
null
[
"two"
]
Parameters to TypedColumnWriter<PhysicalTypeparquet::Type::BYTE_ARRAY>::WriteBatchSpaced:
-
num_values: 3
-
def_levels: [3, 0, 3]
-
rep_levels: [0, 0, 0]
-
valid_bits: 0x05 (bit representation 101)
-
valid_bits_offset: 0
-
values: ["one", nullptr, "two"]
When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get an error when running parquet-tools on the outputted parquet file:
Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
CASE 2
Data (4 lists):
[
"one"
]
null
[
"two"
]
[
"three",
"four"
]
Parameters to TypedColumnWriter<PhysicalTypeparquet::Type::BYTE_ARRAY>::WriteBatchSpaced: -
num_values: 5
-
def_levels: [3, 0, 3, 3, 3]
-
rep_levels: [0, 0, 0, 0, 1]
-
valid_bits: 0x29 (bit representation 11101)
-
valid_bits_offset: 0
-
values: ["one", nullptr, "two", "three", "four"]
Outputted Parquet File:
Here we see that the "four" in the last list actually shows up as "one".
Reporter: Ruta Dhaneshwar
Original Issue Attachments:
Note: This issue was originally created as PARQUET-1936. Please see the migration documentation for further details.
