-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
When you have a hive-style partitioned dataset, with our current dataset(..) API, it's relatively easy to mess up the inferred partitioning and get confusing results.
For example, if you specify the partitioning field names with partitioning=[...] (which is not needed for hive style since those are inferred), we actually assume you want directory partitioning. This DirectoryPartitioning will then parse the hive-style file paths and take the full "key=value" as the data values for the field.
And then, doing a filter can result in a confusing empty result (because "value" doesn't match "key=value").
I am wondering if we can't relatively cheaply detect this case, and eg give an informative warning about this to the user.
Basically what happens is this:
>>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
>>> part.parse("part=a")
<pyarrow.dataset.Expression (part == "part=a")>If the parsed value is a string that contains a "=" (and in this case also contains the field name), that is I think a clear sign that (in the large majority of cases) the user is doing something wrong.
I am not fully sure where and at what stage the check could be done though. Doing it for every path in the dataset might be too costly.
Illustrative code example:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pathlib
## constructing a small dataset with 1 hive-style partitioning level
basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
basedir.mkdir(exist_ok=True)
(basedir / "part=a").mkdir(exist_ok=True)
(basedir / "part=b").mkdir(exist_ok=True)
table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "part=a" / "data.parquet")
table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "part=b" / "data.parquet")Reading as is (not specifying a partitioning, so default to no partitioning) will at least give an error about a missing field:
>>> dataset = ds.dataset(basedir)
>>> dataset.to_table(filter=ds.field("part") == "a")
...
ArrowInvalid: No match for FieldRef.Name(part) in a: int64But specifying the partitioning field name (which currently gets (silently) interpreted as directory partitioning) gives a confusing empty result:
>>> dataset = ds.dataset(basedir, partitioning=["part"])
>>> dataset.to_table(filter=ds.field("part") == "a")
pyarrow.Table
a: int64
b: int64
part: string
----
a: []
b: []
part: []This filter doesn't work because the values in the "part" column are not "a" but "part=a":
>>> dataset.to_table().to_pandas()
a b part
0 1 1 part=a
1 2 2 part=a
2 3 3 part=a
3 4 1 part=b
4 5 2 part=b
5 6 3 part=bReporter: Joris Van den Bossche / @jorisvandenbossche
Related issues:
Note: This issue was originally created as ARROW-15310. Please see the migration documentation for further details.