-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 #1377
Conversation
c796119
to
f2330ed
Compare
core/src/test/scala/org/apache/spark/sql/delta/StatsBasedDataSkippingSuite.scala
Outdated
Show resolved
Hide resolved
Ref) The feature may need to be revisited while delivering deletion vector #1367 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great improvement. A high level question: currently if there is any file missing numRecords
, we will skip the optimization. I'm wondering if we can split the files to two types:
- files containing
numRecords
. We still apply the optimization. - files not containing
numRecords
. We read the files to get the result.
At last, we can sum the results from the above two steps.
Good point. Deletion Vector will introduce the following concept to solve this problem ( #1372 ):
|
I'm not sure if it is possible considering how Delta and Spark interact. In that case would we read the parquet data files during the plan rewrite? In my opinion would be better to recommend user to recompute stats if they are missing. |
Are there any plans to implement a v2 reader that has some of this type of capability more directly vs adding a bunch of custom analyzer/optimizer rules? |
@Kimahriman, I don't have any information about v2 reader plans. Could you please provide more details how it would help? |
Was more a question for the maintainers, seems like what SupportsPushDownAggregates was created for essentially |
No plan right now. It would be a huge effort as we need to rewrite a lot of code. TBH, I haven't looked at whether v2 APIs are sufficient for Delta today. We may need to add more changes to Spark in order to do that (such as supporting generated columns and check constraints). |
I was thinking that we create two logical plans: one logical plan that uses a file index that returns files that don't have stats, and the other one is the new one you create in this PR. And we can just union them. But totally agree that this is complicated and we can still from the simplest one first: optimize only if all files contain By the way, just curious. Do you think if this optimization would give us a better TPCDS benchmark result? |
I don't believe it will help with TPCDS. I can't promise it because I'm still investigating it, but I plan to use Delta stats to do something similar to ANALYZE TABLE, that would probably help with TPCDS, but I'm facing some issues with some queries: https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17582211. I'll open another issues/PR once I have more concrete information |
Hi. Anybody had a chance to review it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Left a few comments.
core/src/main/scala/org/apache/spark/sql/delta/optimizer/StatsBasedDataSkipping.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/optimizer/StatsBasedDataSkipping.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/optimizer/StatsBasedDataSkipping.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/optimizer/StatsBasedDataSkipping.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/optimizer/StatsBasedDataSkipping.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
2c84b02
to
fd4ee72
Compare
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
abdec84
to
4b31d4a
Compare
4b31d4a
to
ff81d14
Compare
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
ff81d14
to
004b9d8
Compare
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Thanks @zsxwing. I believe to have addressed all the comments |
Hi @zsxwing, just checking if you had a chance to validate the changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. It looks much better. Left a few minor comments and questions.
core/src/test/scala/org/apache/spark/sql/delta/OptimizeMetadataOnlyDeltaQuerySuite.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/sql/delta/DeltaHistoryManagerSuite.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/stats/PrepareDeltaScan.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/optimizer/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/optimizer/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/optimizer/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/optimizer/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/optimizer/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/sql/delta/OptimizeMetadataOnlyDeltaQuerySuite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
@zsxwing I've addressed the new comments. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience! Left one minor comment to clean up the code. Otherwise, LGTM.
core/src/main/scala/org/apache/spark/sql/delta/optimizer/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
@zsxwing, done, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks. Will merge this soon.
Description
Running the query "SELECT COUNT(*) FROM Table" takes a lot of time for big tables, Spark scan all the parquet files just to return the number of rows, that information is available from Delta Logs.
Resolves #1192
How was this patch tested?
Created unit tests to validate the optimization works, including cases not covered by this optimization.
Does this PR introduce any user-facing changes?
Only performance improvement