Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 #1377

felipepessoto · 2022-09-12T18:35:42Z

Description

Running the query "SELECT COUNT(*) FROM Table" takes a lot of time for big tables, Spark scan all the parquet files just to return the number of rows, that information is available from Delta Logs.

Resolves #1192

How was this patch tested?

Created unit tests to validate the optimization works, including cases not covered by this optimization.

Does this PR introduce any user-facing changes?

Only performance improvement

core/src/test/scala/org/apache/spark/sql/delta/StatsBasedDataSkippingSuite.scala

sezruby · 2022-09-14T05:16:55Z

Ref) The feature may need to be revisited while delivering deletion vector #1367

zsxwing

This is a great improvement. A high level question: currently if there is any file missing numRecords, we will skip the optimization. I'm wondering if we can split the files to two types:

files containing numRecords. We still apply the optimization.
files not containing numRecords. We read the files to get the result.

At last, we can sum the results from the above two steps.

zsxwing · 2022-09-16T16:23:06Z

Ref) The feature may need to be revisited while delivering deletion vector #1367

Good point. Deletion Vector will introduce the following concept to solve this problem ( #1372 ):

In the presence of Deletion Vectors the statistics may be somewhat outdated, i.e. not reflecting deleted rows yet. The flag stats.tightBounds indicates whether we have tight bounds (i.e. the min/maxValue exists[^1] in the valid state of the file) or wide bounds (i.e. the minValue is <= all valid values in the file, and the maxValue >= all valid values in the file). These upper/lower bounds are sufficient information for data skipping.

felipepessoto · 2022-09-16T19:21:35Z

This is a great improvement. A high level question: currently if there is any file missing numRecords, we will skip the optimization. I'm wondering if we can split the files to two types:

files containing numRecords. We still apply the optimization.

files not containing numRecords. We read the files to get the result.

At last, we can sum the results from the above two steps.

I'm not sure if it is possible considering how Delta and Spark interact. In that case would we read the parquet data files during the plan rewrite?
It can make things too complicated, and the trade-off may not be worth it, especially if later if decide to implement GROUP BY Partition and other aggregations like MIN/MAX.

In my opinion would be better to recommend user to recompute stats if they are missing.

Kimahriman · 2022-09-16T22:09:05Z

Are there any plans to implement a v2 reader that has some of this type of capability more directly vs adding a bunch of custom analyzer/optimizer rules?

felipepessoto · 2022-09-19T22:36:57Z

@Kimahriman, I don't have any information about v2 reader plans. Could you please provide more details how it would help?

Kimahriman · 2022-09-20T01:28:03Z

Was more a question for the maintainers, seems like what SupportsPushDownAggregates was created for essentially

zsxwing · 2022-09-20T03:57:16Z

Are there any plans to implement a v2 reader that has some of this type of capability more directly vs adding a bunch of custom analyzer/optimizer rules?

No plan right now. It would be a huge effort as we need to rewrite a lot of code. TBH, I haven't looked at whether v2 APIs are sufficient for Delta today. We may need to add more changes to Spark in order to do that (such as supporting generated columns and check constraints).

zsxwing · 2022-09-20T04:00:32Z

I'm not sure if it is possible considering how Delta and Spark interact. In that case would we read the parquet data files during the plan rewrite?

I was thinking that we create two logical plans: one logical plan that uses a file index that returns files that don't have stats, and the other one is the new one you create in this PR. And we can just union them. But totally agree that this is complicated and we can still from the simplest one first: optimize only if all files contain numRecords.

By the way, just curious. Do you think if this optimization would give us a better TPCDS benchmark result?

felipepessoto · 2022-09-20T17:09:59Z

I don't believe it will help with TPCDS. I can't promise it because I'm still investigating it, but I plan to use Delta stats to do something similar to ANALYZE TABLE, that would probably help with TPCDS, but I'm facing some issues with some queries: https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17582211.

I'll open another issues/PR once I have more concrete information

felipepessoto · 2022-09-29T19:58:24Z

Hi. Anybody had a chance to review it?

zsxwing

Sorry for the delay. Left a few comments.

core/src/main/scala/org/apache/spark/sql/delta/optimizer/StatsBasedDataSkipping.scala

core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

core/src/main/scala/org/apache/spark/sql/delta/optimizer/StatsBasedDataSkipping.scala

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

felipepessoto · 2022-10-16T01:34:39Z

Sorry for the delay. Left a few comments.

Thanks @zsxwing. I believe to have addressed all the comments

felipepessoto · 2022-10-31T21:51:13Z

Hi @zsxwing, just checking if you had a chance to validate the changes?
Thanks.

zsxwing

Sorry for the delay. It looks much better. Left a few minor comments and questions.

core/src/test/scala/org/apache/spark/sql/delta/OptimizeMetadataOnlyDeltaQuerySuite.scala

core/src/test/scala/org/apache/spark/sql/delta/DeltaHistoryManagerSuite.scala

core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

core/src/main/scala/org/apache/spark/sql/delta/stats/PrepareDeltaScan.scala

core/src/main/scala/org/apache/spark/sql/delta/optimizer/OptimizeMetadataOnlyDeltaQuery.scala

core/src/test/scala/org/apache/spark/sql/delta/OptimizeMetadataOnlyDeltaQuerySuite.scala

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

felipepessoto · 2022-11-08T09:00:04Z

@zsxwing I've addressed the new comments. Thanks!

zsxwing

Thanks for your patience! Left one minor comment to clean up the code. Otherwise, LGTM.

core/src/main/scala/org/apache/spark/sql/delta/optimizer/OptimizeMetadataOnlyDeltaQuery.scala

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

felipepessoto · 2022-11-15T07:41:02Z

@zsxwing, done, thanks.

zsxwing

LGTM! Thanks. Will merge this soon.

felipepessoto changed the title ~~Optimize common case: SELECT COUNT(*) FROM Table Resolves #1192~~ Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 Sep 12, 2022

felipepessoto mentioned this pull request Sep 12, 2022

[Feature Request] Optimize common case: SELECT COUNT(*) FROM Table #1192

Closed

3 tasks

felipepessoto force-pushed the datafromstats branch from c796119 to f2330ed Compare September 12, 2022 18:42

moredatapls reviewed Sep 13, 2022

View reviewed changes

core/src/test/scala/org/apache/spark/sql/delta/StatsBasedDataSkippingSuite.scala Outdated Show resolved Hide resolved

dennyglee mentioned this pull request Sep 14, 2022

Roadmap 2022 H2 (discussion) #1307

Open

scottsand-db requested a review from zsxwing September 14, 2022 18:12

zsxwing reviewed Sep 16, 2022

View reviewed changes

scottsand-db requested review from zsxwing and moredatapls September 19, 2022 18:09

tdas requested review from vkorukanti and removed request for moredatapls October 11, 2022 18:09

zsxwing reviewed Oct 14, 2022

View reviewed changes

felipepessoto added 4 commits October 14, 2022 17:31

Optimize SELECT COUNT using Delta stats

c135999

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

Fix unit test

3cbbdb0

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

Fix Delta History Manager tests

854f5c2

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

Style changes and renames

fd4ee72

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

felipepessoto force-pushed the datafromstats branch from 2c84b02 to fd4ee72 Compare October 15, 2022 00:40

Merge with PrepareDeltaScanBase

7bf4019

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

felipepessoto force-pushed the datafromstats branch 2 times, most recently from abdec84 to 4b31d4a Compare October 15, 2022 09:51

felipepessoto force-pushed the datafromstats branch from 4b31d4a to ff81d14 Compare October 15, 2022 09:56

Fix sub queries

004b9d8

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

felipepessoto force-pushed the datafromstats branch from ff81d14 to 004b9d8 Compare October 15, 2022 09:57

felipepessoto added 2 commits October 15, 2022 16:20

Fix unit test

22c9528

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

Organize imports

3a8f843

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

zsxwing reviewed Nov 4, 2022

View reviewed changes

felipepessoto added 4 commits November 7, 2022 23:54

PR Review. Refactor, renames, improvements to tests

f8cedc7

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

Small refactor

8b2edcb

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

Small refactor

8214628

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

Improve snapshot isolation test

e489bac

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

scottsand-db self-requested a review November 11, 2022 19:07

zsxwing reviewed Nov 15, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/sql/delta/optimizer/OptimizeMetadataOnlyDeltaQuery.scala Outdated Show resolved Hide resolved

Address feedback review

f5de6dd

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

zsxwing approved these changes Nov 15, 2022

View reviewed changes

zsxwing added the waiting for merge label Nov 15, 2022

scottsand-db approved these changes Nov 15, 2022

View reviewed changes

zsxwing mentioned this pull request Nov 17, 2022

[Feature Request] Support reading Delta tables with Deletion Vectors #1485

Closed

3 tasks

allisonport-db closed this in 0c349da Nov 28, 2022

zsxwing added this to the 2.2.0 milestone Nov 29, 2022

felipepessoto deleted the datafromstats branch January 27, 2023 19:26

allisonport-db mentioned this pull request May 11, 2023

[Feature Request] Add number of rows to the DeltaTable.detail() method #1738

Open

3 tasks

felipepessoto mentioned this pull request Sep 22, 2023

Optimize Min/Max using Delta metadata #1525

Closed

felipepessoto mentioned this pull request Jan 31, 2024

[Feature Request][Spark][WIP] Metadata only queries - Umbrella issue #2589

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 #1377

Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 #1377

felipepessoto commented Sep 12, 2022

sezruby commented Sep 14, 2022

zsxwing left a comment

zsxwing commented Sep 16, 2022

felipepessoto commented Sep 16, 2022

Kimahriman commented Sep 16, 2022

felipepessoto commented Sep 19, 2022

Kimahriman commented Sep 20, 2022

zsxwing commented Sep 20, 2022

zsxwing commented Sep 20, 2022

felipepessoto commented Sep 20, 2022

felipepessoto commented Sep 29, 2022

zsxwing left a comment

felipepessoto commented Oct 16, 2022

felipepessoto commented Oct 31, 2022

zsxwing left a comment

felipepessoto commented Nov 8, 2022

zsxwing left a comment

felipepessoto commented Nov 15, 2022

zsxwing left a comment

Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 #1377

Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 #1377

Conversation

felipepessoto commented Sep 12, 2022

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

sezruby commented Sep 14, 2022

zsxwing left a comment

Choose a reason for hiding this comment

zsxwing commented Sep 16, 2022

felipepessoto commented Sep 16, 2022

Kimahriman commented Sep 16, 2022

felipepessoto commented Sep 19, 2022

Kimahriman commented Sep 20, 2022

zsxwing commented Sep 20, 2022

zsxwing commented Sep 20, 2022

felipepessoto commented Sep 20, 2022

felipepessoto commented Sep 29, 2022

zsxwing left a comment

Choose a reason for hiding this comment

felipepessoto commented Oct 16, 2022

felipepessoto commented Oct 31, 2022

zsxwing left a comment

Choose a reason for hiding this comment

felipepessoto commented Nov 8, 2022

zsxwing left a comment

Choose a reason for hiding this comment

felipepessoto commented Nov 15, 2022

zsxwing left a comment

Choose a reason for hiding this comment