Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37099][SQL] Optimize the filter based on rank-like window function by reduce not required rows #38745

Closed
wants to merge 9 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Nov 21, 2022

What changes were proposed in this pull request?

Sometimes, the filter condition compares rank-like(e.g. row_number, rank, dense_rank) window functions with number. For example,

SELECT *,
         ROW_NUMBER() OVER(PARTITION BY key ORDER BY a) AS rn
FROM Tab1
WHERE rn <= 5

We can extract the limit value 5 for window group and skip rows of window group in WindowExec.

In short, it supports following pattern:

SELECT (... (row_number|rank|dense_rank)()
    OVER (
PARTITION BY ...
ORDER BY  ... ) AS rn)
WHERE rn (==|<|<=) k
        AND other conditions

For these three rank-like functions (row_number|rank|dense_rank), the rank of a key computed on dataset always <= its total rows of whole dataset,so we can safely discard rows with rank > k, anywhere.

This PR also take over some functions from #34367.

Why are the changes needed?

Improve the performance.

Micro Benchmark
TPC-DS data size: 2TB.
This improvement is valid for tpcds q67 and no regression for other test cases.

TPC-DS Query Default(Seconds) After(Seconds) Speedup(Percent)
q67 997.7585 882.8005 13.02%
All TPC-DS 7076.6715 6918.309 1.47%

Does this PR introduce any user-facing change?

'No'.
Just update the inner implementation.

How was this patch tested?

New tests.

@github-actions github-actions bot added the SQL label Nov 21, 2022
@beliefer beliefer changed the title [WIP][SPARK-37099][SQL] Optimize the filter based on rank-like window function by reduce not required rows [SPARK-37099][SQL] Optimize the filter based on rank-like window function by reduce not required rows Nov 23, 2022
@beliefer
Copy link
Contributor Author

ping @zhengruifeng cc @cloud-fan

@@ -87,7 +88,8 @@ case class WindowExec(
windowExpression: Seq[NamedExpression],
partitionSpec: Seq[Expression],
orderSpec: Seq[SortOrder],
child: SparkPlan)
child: SparkPlan,
groupLimitInfo: Option[(Int, Expression)] = None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's overkill and very risky to make invasive changes to a fundamental physical operator like WindownExec. I like #34367 more which adds a new physical node. Can you elaborate on why is this better than #34367 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK to add a physical node, but the amount of code is a little large, and the filtering and reduction of data occur a little late.

@beliefer
Copy link
Contributor Author

This PR has been replaced by #38799

@beliefer beliefer closed this Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants