Skip to content

[C++][Parquet] Strange behavior when null is partition value in parquet datasets with filename and directory partitioning #39938

@rohanjain101

Description

@rohanjain101

Describe the bug, including details regarding any error messages, version, and platform.

null_table = pa.Table.from_pydict({"A": ["", None], "B": [None, ""], "C": ["E", "F"]})
partitioning = pa.dataset.FilenamePartitioning(pa.schema({"A": pa.string()}))
pa.parquet.write_to_dataset(null_table, r"dir\filename\", partitioning=partitioning)
>>> pa.parquet.read_table(  r"dir\filename\", partitioning=partitioning)
pyarrow.Table
A: string
----
A: []
>>>

Null does not roundtrip when it is one of the partition values, and filename or directory partitioning are used.

When mixing null and empty string with directory partitioning, it also doesn't roundtrip:

>>> partitioning = pa.dataset.DirectoryPartitioning(pa.schema({"A": pa.string()}))
>>> pa.parquet.write_to_dataset(null_table, r"dir\directory\", partitioning=partitioning)
>>> pa.parquet.read_table( r"dir\directory\", partitioning=partitioning)
pyarrow.Table
B: string
C: string
A: string
----
B: [[null],[""]]
C: [["E"],["F"]]
A: [[null],[null]]
>>>

I would expect A to be:

A: [[""],[null]]

Which is what the original table had. Also, if there are 2 partition columns, and the first one has a null value, an error is raised, but if there's only 1 partition column, then no error is raised, which seems inconsistent:

>>> partitioning = pa.dataset.DirectoryPartitioning(pa.schema({"A": pa.string(), "B": pa.string()}))
>>> pa.parquet.write_to_dataset(null_table, r"dir\", partitioning=partitioning)
pyarrow.lib.ArrowInvalid: No partition key for A but a key was provided subsequently for B.
>>>

What is the expected behavior when null is in a partition column? Is it expected to work, or should an error always be raised?

Component(s)

C++, Parquet, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions