[Feature Request] optimize `COUNT(*)` on partitioned tables #1916

keen85 · 2023-07-17T14:21:01Z

Feature request

Running the query SELECT COUNT(*) FROM table WHERE partition_column = 1 should only read Delta log statistics.

@felipepessoto #1192 / 0c349da8 already introduced this feature for SELECT COUNT(*) FROM table in Delta 2.2.0.
I suggest further improving this feature so it also works for partitioned tables when filtering only on partition columns.

Which Delta project/connector is this regarding?

Overview

Running the query SELECT COUNT(*) FROM table WHERE partition_column = 1 takes a lot of time for big tables, Spark scans the parquet files just to return the number of rows. But the row count is already available from Delta Logs.

Motivation

Significant performance improvement.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.

The text was updated successfully, but these errors were encountered:

felipepessoto · 2023-07-17T17:43:37Z

I'd like to finish this one first: #1525 as the changes would conflict

geoffrey-hashflow · 2023-11-02T22:54:18Z

A related query that is also slow:
SELECT partition_column, COUNT(*) FROM table GROUP BY partition_column

I imagine this might use similar meta data to optimize.

zzl-7 · 2024-04-05T07:17:04Z

Hi @felipepessoto are you currently working on this feature, if not can I take a stab at it? :)

felipepessoto · 2024-04-05T07:34:37Z

I’m not. Feel free. Thanks

keen85 added the enhancement New feature or request label Jul 17, 2023

felipepessoto mentioned this issue Sep 22, 2023

Optimize Min/Max using Delta metadata #1525

Closed

felipepessoto mentioned this issue Jan 31, 2024

[Feature Request][Spark][WIP] Metadata only queries - Umbrella issue #2589

Open

8 tasks

7mming7 linked a pull request Jul 9, 2024 that will close this issue

[SPARK] Optimize : SELECT COUNT(*) FROM Table WHERE partitition=1 #3345

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] optimize `COUNT(*)` on partitioned tables #1916

[Feature Request] optimize `COUNT(*)` on partitioned tables #1916

keen85 commented Jul 17, 2023

felipepessoto commented Jul 17, 2023

geoffrey-hashflow commented Nov 2, 2023 •

edited

Loading

zzl-7 commented Apr 5, 2024

felipepessoto commented Apr 5, 2024

[Feature Request] optimize COUNT(*) on partitioned tables #1916

[Feature Request] optimize COUNT(*) on partitioned tables #1916

Comments

keen85 commented Jul 17, 2023

Feature request

Which Delta project/connector is this regarding?

Overview

Motivation

Willingness to contribute

felipepessoto commented Jul 17, 2023

geoffrey-hashflow commented Nov 2, 2023 • edited Loading

zzl-7 commented Apr 5, 2024

felipepessoto commented Apr 5, 2024

[Feature Request] optimize `COUNT(*)` on partitioned tables #1916

[Feature Request] optimize `COUNT(*)` on partitioned tables #1916

geoffrey-hashflow commented Nov 2, 2023 •

edited

Loading