Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 #1377

Closed
wants to merge 13 commits into from

Conversation

felipepessoto
Copy link
Contributor

Description

Running the query "SELECT COUNT(*) FROM Table" takes a lot of time for big tables, Spark scan all the parquet files just to return the number of rows, that information is available from Delta Logs.

Resolves #1192

How was this patch tested?

Created unit tests to validate the optimization works, including cases not covered by this optimization.

Does this PR introduce any user-facing changes?

Only performance improvement

@felipepessoto felipepessoto changed the title Optimize common case: SELECT COUNT(*) FROM Table Resolves #1192 Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 Sep 12, 2022
@sezruby
Copy link
Contributor

sezruby commented Sep 14, 2022

Ref) The feature may need to be revisited while delivering deletion vector #1367

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement. A high level question: currently if there is any file missing numRecords, we will skip the optimization. I'm wondering if we can split the files to two types:

  • files containing numRecords. We still apply the optimization.
  • files not containing numRecords. We read the files to get the result.

At last, we can sum the results from the above two steps.

@zsxwing
Copy link
Member

zsxwing commented Sep 16, 2022

Ref) The feature may need to be revisited while delivering deletion vector #1367

Good point. Deletion Vector will introduce the following concept to solve this problem ( #1372 ):

In the presence of Deletion Vectors the statistics may be somewhat outdated, i.e. not reflecting deleted rows yet. The flag stats.tightBounds indicates whether we have tight bounds (i.e. the min/maxValue exists[^1] in the valid state of the file) or wide bounds (i.e. the minValue is <= all valid values in the file, and the maxValue >= all valid values in the file). These upper/lower bounds are sufficient information for data skipping.

@felipepessoto
Copy link
Contributor Author

This is a great improvement. A high level question: currently if there is any file missing numRecords, we will skip the optimization. I'm wondering if we can split the files to two types:

  • files containing numRecords. We still apply the optimization.
  • files not containing numRecords. We read the files to get the result.

At last, we can sum the results from the above two steps.

I'm not sure if it is possible considering how Delta and Spark interact. In that case would we read the parquet data files during the plan rewrite?
It can make things too complicated, and the trade-off may not be worth it, especially if later if decide to implement GROUP BY Partition and other aggregations like MIN/MAX.

In my opinion would be better to recommend user to recompute stats if they are missing.

@Kimahriman
Copy link
Contributor

Are there any plans to implement a v2 reader that has some of this type of capability more directly vs adding a bunch of custom analyzer/optimizer rules?

@felipepessoto
Copy link
Contributor Author

@Kimahriman, I don't have any information about v2 reader plans. Could you please provide more details how it would help?

@Kimahriman
Copy link
Contributor

Was more a question for the maintainers, seems like what SupportsPushDownAggregates was created for essentially

@zsxwing
Copy link
Member

zsxwing commented Sep 20, 2022

Are there any plans to implement a v2 reader that has some of this type of capability more directly vs adding a bunch of custom analyzer/optimizer rules?

No plan right now. It would be a huge effort as we need to rewrite a lot of code. TBH, I haven't looked at whether v2 APIs are sufficient for Delta today. We may need to add more changes to Spark in order to do that (such as supporting generated columns and check constraints).

@zsxwing
Copy link
Member

zsxwing commented Sep 20, 2022

I'm not sure if it is possible considering how Delta and Spark interact. In that case would we read the parquet data files during the plan rewrite?

I was thinking that we create two logical plans: one logical plan that uses a file index that returns files that don't have stats, and the other one is the new one you create in this PR. And we can just union them. But totally agree that this is complicated and we can still from the simplest one first: optimize only if all files contain numRecords.

By the way, just curious. Do you think if this optimization would give us a better TPCDS benchmark result?

@felipepessoto
Copy link
Contributor Author

I don't believe it will help with TPCDS. I can't promise it because I'm still investigating it, but I plan to use Delta stats to do something similar to ANALYZE TABLE, that would probably help with TPCDS, but I'm facing some issues with some queries: https://issues.apache.org/jira/browse/SPARK-39971?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17582211.

I'll open another issues/PR once I have more concrete information

@felipepessoto
Copy link
Contributor Author

Hi. Anybody had a chance to review it?

@tdas tdas requested review from vkorukanti and removed request for moredatapls October 11, 2022 18:09
Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Left a few comments.

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
@felipepessoto felipepessoto force-pushed the datafromstats branch 2 times, most recently from abdec84 to 4b31d4a Compare October 15, 2022 09:51
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
@felipepessoto
Copy link
Contributor Author

Sorry for the delay. Left a few comments.

Thanks @zsxwing. I believe to have addressed all the comments

@felipepessoto
Copy link
Contributor Author

Hi @zsxwing, just checking if you had a chance to validate the changes?
Thanks.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. It looks much better. Left a few minor comments and questions.

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
@felipepessoto
Copy link
Contributor Author

@zsxwing I've addressed the new comments. Thanks!

@scottsand-db scottsand-db self-requested a review November 11, 2022 19:07
Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience! Left one minor comment to clean up the code. Otherwise, LGTM.

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
@felipepessoto
Copy link
Contributor Author

@zsxwing, done, thanks.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks. Will merge this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Optimize common case: SELECT COUNT(*) FROM Table
6 participants