Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] optimize COUNT(*) on partitioned tables #1916

Open
2 of 8 tasks
keen85 opened this issue Jul 17, 2023 · 4 comments · May be fixed by #3345
Open
2 of 8 tasks

[Feature Request] optimize COUNT(*) on partitioned tables #1916

keen85 opened this issue Jul 17, 2023 · 4 comments · May be fixed by #3345
Labels
enhancement New feature or request

Comments

@keen85
Copy link

keen85 commented Jul 17, 2023

Feature request

Running the query SELECT COUNT(*) FROM table WHERE partition_column = 1 should only read Delta log statistics.

@felipepessoto #1192 / 0c349da8 already introduced this feature for SELECT COUNT(*) FROM table in Delta 2.2.0.
I suggest further improving this feature so it also works for partitioned tables when filtering only on partition columns.

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Running the query SELECT COUNT(*) FROM table WHERE partition_column = 1 takes a lot of time for big tables, Spark scans the parquet files just to return the number of rows. But the row count is already available from Delta Logs.

Motivation

Significant performance improvement.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@keen85 keen85 added the enhancement New feature or request label Jul 17, 2023
@felipepessoto
Copy link
Contributor

I'd like to finish this one first: #1525 as the changes would conflict

@geoffrey-hashflow
Copy link

geoffrey-hashflow commented Nov 2, 2023

A related query that is also slow:
SELECT partition_column, COUNT(*) FROM table GROUP BY partition_column

I imagine this might use similar meta data to optimize.

@zzl-7
Copy link
Contributor

zzl-7 commented Apr 5, 2024

Hi @felipepessoto are you currently working on this feature, if not can I take a stab at it? :)

@felipepessoto
Copy link
Contributor

I’m not. Feel free. Thanks

@7mming7 7mming7 linked a pull request Jul 9, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants