Describe the bug, including details regarding any error messages, version, and platform.
null_table = pa.Table.from_pydict({"A": ["", None], "B": [None, ""], "C": ["E", "F"]})
partitioning = pa.dataset.FilenamePartitioning(pa.schema({"A": pa.string()}))
pa.parquet.write_to_dataset(null_table, r"dir\filename\", partitioning=partitioning)
>>> pa.parquet.read_table( r"dir\filename\", partitioning=partitioning)
pyarrow.Table
A: string
----
A: []
>>>
Null does not roundtrip when it is one of the partition values, and filename or directory partitioning are used.
When mixing null and empty string with directory partitioning, it also doesn't roundtrip:
>>> partitioning = pa.dataset.DirectoryPartitioning(pa.schema({"A": pa.string()}))
>>> pa.parquet.write_to_dataset(null_table, r"dir\directory\", partitioning=partitioning)
>>> pa.parquet.read_table( r"dir\directory\", partitioning=partitioning)
pyarrow.Table
B: string
C: string
A: string
----
B: [[null],[""]]
C: [["E"],["F"]]
A: [[null],[null]]
>>>
I would expect A to be:
A: [[""],[null]]
Which is what the original table had. Also, if there are 2 partition columns, and the first one has a null value, an error is raised, but if there's only 1 partition column, then no error is raised, which seems inconsistent:
>>> partitioning = pa.dataset.DirectoryPartitioning(pa.schema({"A": pa.string(), "B": pa.string()}))
>>> pa.parquet.write_to_dataset(null_table, r"dir\", partitioning=partitioning)
pyarrow.lib.ArrowInvalid: No partition key for A but a key was provided subsequently for B.
>>>
What is the expected behavior when null is in a partition column? Is it expected to work, or should an error always be raised?
Component(s)
C++, Parquet, Python
Describe the bug, including details regarding any error messages, version, and platform.
Null does not roundtrip when it is one of the partition values, and filename or directory partitioning are used.
When mixing null and empty string with directory partitioning, it also doesn't roundtrip:
I would expect A to be:
A: [[""],[null]]Which is what the original table had. Also, if there are 2 partition columns, and the first one has a null value, an error is raised, but if there's only 1 partition column, then no error is raised, which seems inconsistent:
What is the expected behavior when null is in a partition column? Is it expected to work, or should an error always be raised?
Component(s)
C++, Parquet, Python