Skip to content

[C++][Parquet] WriteBatchSpaced writes incorrect value for parquet when input contains NULL list #42969

@asfimport

Description

@asfimport

When trying to write a column of parquet lists, if there is a NULL list, WriteBatchSpaced will either throw an error (case 1 below) or incorrectly write the last value in the last list as the first value from the first list (case 2 below).

CASE 1
Data (3 lists):
[
   "one"
]
null
[
   "two"
]
 
Parameters to TypedColumnWriter<PhysicalTypeparquet::Type::BYTE_ARRAY>::WriteBatchSpaced:

  1. num_values: 3

  2. def_levels: [3, 0, 3]

  3. rep_levels: [0, 0, 0]

  4. valid_bits: 0x05 (bit representation 101)

  5. valid_bits_offset: 0

  6. values: ["one", nullptr, "two"]

    When I use WriteBatchSpaced(num_values, def_levels, rep_levels, valid_bits, valid_bits_offset, values), I get an error when running parquet-tools on the outputted parquet file:

    Additionally, if I add another list into the data that I write, then the last element of that additional list is incorrectly written as the first element of the first list. See below.
     
    CASE 2
    Data (4 lists):
    [
       "one"
    ]
    null
    [
       "two"
    ]
    [
       "three",
       "four"
    ]
     
    Parameters to TypedColumnWriter<PhysicalTypeparquet::Type::BYTE_ARRAY>::WriteBatchSpaced:

  7. num_values: 5

  8. def_levels: [3, 0, 3, 3, 3]

  9. rep_levels: [0, 0, 0, 0, 1]

  10. valid_bits: 0x29 (bit representation 11101)

  11. valid_bits_offset: 0

  12. values: ["one", nullptr, "two", "three", "four"]

    Outputted Parquet File: 

      Here we see that the "four" in the last list actually shows up as "one". 

Reporter: Ruta Dhaneshwar

Original Issue Attachments:

Note: This issue was originally created as PARQUET-1936. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions