Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get statistics metadata #2233

Closed
mrocklin opened this issue Feb 29, 2024 · 4 comments
Closed

Get statistics metadata #2233

mrocklin opened this issue Feb 29, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@mrocklin
Copy link

Is it possible to get the statistics metadata on a per-file basis? In particular I'm looking for the min/max/null_count for each column for each file. This data is available in the json files, but as far as I can tell from looking through the docs and poking around the API it isn't readily available through the Python API (I'd love to be wrong here though)

@mrocklin mrocklin added the enhancement New feature or request label Feb 29, 2024
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Feb 29, 2024

I think what you are looking for is under DeltaTable()._table.dataset_partitions(). We use that to construct the stats for each fragment for the pyarrow dataset

@mrocklin
Copy link
Author

Oh cool. Thanks for the pointer. That gets me closer. Here's what I'm getting

import deltalake
t = deltalake.DeltaTable("mytable")
filename, info = t._table.dataset_partitions(t.schema().to_pyarrow())[0]
info
<pyarrow.compute.Expression ((((((((((((((((((date == 2024-01-03) and is_valid(username)) and is_valid(repo)) and is_valid(sha)) and is_valid(message)) and is_valid(created_at)) and (username >= "000600")) and (repo >= "0-vortex/0-vortex")) and (sha >= "00002c6c6353df7b4429ea6f4ca8f7674df6e8a0")) and (message >= "")) and (created_at >= 2024-01-03 04:00:00.000000)) and (date >= null[date32[day]])) and (username <= "zzzzseong")) and (repo <= "zzzzseong/algorithm")) and (sha <= "ffff8a90f31cba8f163fdd723c3451a04a8c39a2")) and (message <= "🩺 Checked PDS Health")) and (created_at <= 2024-01-03 05:59:59.000000)) and (date <= null[date32[day]]))>

So clearly the stuff I want is in there, however I'm not sure that it's actually programatically accessible from Python. I guess I could raise this upstream with PyArrow to ask for compute expressions to be more introspectable, but this seems like the wrong path.

Any further thoughts, aside from parsing the string repr?

@sherlockbeard
Copy link
Contributor

maybe you are looking for get_add_actions
https://delta-io.github.io/delta-rs/usage/examining-table/

@mrocklin
Copy link
Author

Oh cool. Yes, that seems like it likely has the information that I'm looking for. Thank you @sherlockbeard !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants