Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order data by partitions if available #70

Open
mrocklin opened this issue Feb 29, 2024 · 3 comments · Fixed by #71
Open

Order data by partitions if available #70

mrocklin opened this issue Feb 29, 2024 · 3 comments · Fixed by #71

Comments

@mrocklin
Copy link
Contributor

I've stored a bunch of data partitioned by date, and written it to delta using the deltalake package like so:

for df in dfs:
    write_deltalake("mytable", df, partition_by="date")

(although actually this was done in parallel, and so things maybe got written out of order

├── date=2024-01-01
│   ├── part-00001-5c3d1646-6a8b-4511-87f5-3cd0acf1c0e8-c000.zstd.parquet
│   ├── part-00001-869909e1-1079-49db-83d6-77a43d67370a-c000.zstd.parquet
│   └── part-00001-eeb823e8-b9ed-49bd-86d9-9f28f0f444b7-c000.zstd.parquet
└── date=2024-01-02
    ├── part-00001-32e6c973-6d2a-4132-82dd-e6b431cf5343-c000.zstd.parquet
    ├── part-00001-426c2af2-6e86-4cfe-86c9-853f243c35e6-c000.zstd.parquet
    ├── part-00001-6fa10ba1-6b14-4908-b328-f8a8fdaec258-c000.zstd.parquet
    └── part-00001-79cd49d4-dee1-4957-8cdf-86b6f86f95f6-c000.zstd.parquet

When I go to read it I find that the data isn't sorted by partition

df = ddt.read_deltalake("mytable")
df.date.head()
0    2024-01-02
1    2024-01-02
2    2024-01-02
3    2024-01-02
4    2024-01-02
Name: date, dtype: date32[day][pyarrow]
df.date.tail()
1392523    2024-01-01
1392524    2024-01-01
1392525    2024-01-01
1392526    2024-01-01
1392527    2024-01-01
Name: date, dtype: date32[day][pyarrow]

We should order things if we can I think. I propose the following:

  1. If partitions are available, order by partitions
  2. Maybe within that we can look at partition statistics? These are stored within the "stats" attribute of the deltalake metadata
  3. We could also think about setting an index with the partition value.

Probably both the effort and uncertainty increase as we go down that list. The first item seems pretty straightforward to me though.

mrocklin added a commit to mrocklin/dask-deltatable that referenced this issue Feb 29, 2024
@mrocklin mrocklin mentioned this issue Feb 29, 2024
@mrocklin
Copy link
Contributor Author

I'd also like to sort partitions by max value of the partition column, but couldn't find an easy way to get statistics out of the metadata

@mrocklin
Copy link
Contributor Author

The min/max values are available through the deltalake.DeltaTable.get_add_actions API mentioned in this issue: delta-io/delta-rs#2233 (comment)

@mrocklin
Copy link
Contributor Author

mrocklin commented Mar 8, 2024

I'm going to reopen this for now. I think that we could probably do better by looking at the data coming out of the get_add_actions function.

@mrocklin mrocklin reopened this Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant