Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Introspect partition keys and values in fragments. #33825

Closed
coady opened this issue Jan 21, 2023 · 5 comments · Fixed by #33862
Closed

[Python] Introspect partition keys and values in fragments. #33825

coady opened this issue Jan 21, 2023 · 5 comments · Fixed by #33862

Comments

@coady
Copy link

coady commented Jan 21, 2023

Describe the enhancement requested

It's not possible to programmatically determine the values of partition keys in a fragment. Fragments have a partition_expression attribute, but the Expression type doesn't allow any further introspection. I don't want to have to parse the string representation of the expression.

In []: dataset.partitioning.schema
Out[]: 
year: int32
month: int32

In []: fragment = next(dataset.get_fragments())

In []: fragment.partition_expression
Out[]: <pyarrow.compute.Expression ((year == 2013) and (month == 1))>

My broader use case is more performant (speed and memory) aggregation of partitioned data. Using pc._group_by requires loaded arrays, so it ignores that the data is already partitioned. And iterating get_fragments is crippled if one can't identify the fragment.

Component(s)

Python

@lidavidm
Copy link
Member

There is _get_partition_keys which is private, but has been kept around. I think there were past discussions about promoting it to a public API.

@westonpace
Copy link
Member

That's an interesting broader use case. Can you add a new GH issue for the broader ask of a simpler way to do aggregations on partitions?

@jorisvandenbossche
Copy link
Member

There is _get_partition_keys which is private, but has been kept around. I think there were past discussions about promoting it to a public API.

Yes, we should probably do that (we did the same for pyarrow.parquet.filters_to_expression). I think we can re-scope this issue to do that.

Also @coady feel free to already use the current "private" method. It's private because it was thought to not be really user-facing, we know it is used (eg dask uses it as well), so we promise some stability for it.

@jorisvandenbossche
Copy link
Member

Something related that might be worth to mention (not that it solves your exact use case here though): there is also a dataset.partitioning.dictionaries if you want to inspect all possible values for all fragments that a certain partitioning field can be.

@jorisvandenbossche
Copy link
Member

Opened #33862 to make it public.

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Apr 5, 2023
@jorisvandenbossche jorisvandenbossche added this to the 12.0.0 milestone Apr 5, 2023
jorisvandenbossche added a commit that referenced this issue Apr 5, 2023
… (get key/value from partition expression) (#33862)

#### Rationale for this change

We have an existing "semi-private" `pyarrow.dataset._get_partition_keys` function (to get the partitioning field's key/value from the partition expression of a certain fragment). This is used by external projects (eg dask), and generally useful for advanced users, so let's just make it public. 

* Closes: #33825

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…blicly (get key/value from partition expression) (apache#33862)

#### Rationale for this change

We have an existing "semi-private" `pyarrow.dataset._get_partition_keys` function (to get the partitioning field's key/value from the partition expression of a certain fragment). This is used by external projects (eg dask), and generally useful for advanced users, so let's just make it public. 

* Closes: apache#33825

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…blicly (get key/value from partition expression) (apache#33862)

#### Rationale for this change

We have an existing "semi-private" `pyarrow.dataset._get_partition_keys` function (to get the partitioning field's key/value from the partition expression of a certain fragment). This is used by external projects (eg dask), and generally useful for advanced users, so let's just make it public. 

* Closes: apache#33825

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment