Skip to content

[SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic#37040

Closed
beliefer wants to merge 14 commits intoapache:masterfrom
beliefer:SPARK-39651
Closed

[SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic#37040
beliefer wants to merge 14 commits intoapache:masterfrom
beliefer:SPARK-39651

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Jul 1, 2022

What changes were proposed in this pull request?

Currently, the SQL show below evaluate rand(1) < 2 for rows one by one.
SELECT * FROM tab WHERE rand(1) < 2

In fact, we can prune the filter condition.

Why are the changes needed?

Prune filter condition and improve the performance.

Does this PR introduce any user-facing change?

'No'.
The internal behavior.

How was this patch tested?

New tests.

@github-actions github-actions bot added the SQL label Jul 1, 2022
@beliefer beliefer changed the title [SPARK-39651][SQL] Prune filter condition compare rand function with foldable expression [WIP][SPARK-39651][SQL] Prune filter condition compare rand function with foldable expression Jul 1, 2022
@beliefer beliefer changed the title [WIP][SPARK-39651][SQL] Prune filter condition compare rand function with foldable expression [WIP][SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic Jul 4, 2022
@beliefer
Copy link
Contributor Author

beliefer commented Jul 5, 2022

ping @cloud-fan

@beliefer beliefer changed the title [WIP][SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic [SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic Jul 5, 2022
@cloud-fan
Copy link
Contributor

Can we add a new rule OptimizeRand for this optimization? Basically it turns rand predicates to true or false literals.

@beliefer
Copy link
Contributor Author

beliefer commented Jul 8, 2022

Can we add a new rule OptimizeRand for this optimization? Basically it turns rand predicates to true or false literals.

OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put it in a new file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can match DoubleLiteral directly. Other optimizer rules will optimize foldable expressions to literals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder.


/**
* Rand() generates a random column with i.i.d. uniformly distributed values in [0, 1), so
* compare double literal value with 1.0 could eliminate Rand() in binary comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0 or 0.0

def apply(plan: LogicalPlan): LogicalPlan =
plan.transformAllExpressionsWithPruning(_.containsAllPatterns(
EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) {
case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we swap the comparison so that we don't need to handle each comparison twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that swap introduces additional complexity and reduces readability.

}

test("Nondeterministic predicate is not pruned") {
val originalQuery = testRelation.where(Rand(10) > 5).select($"a").where(Rand(10) > 5).analyze
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to change this file? The new rule is not invoked in this test suite.

EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) {
case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 =>
TrueLiteral
case GreaterThan(_: Rand, DoubleLiteral(value)) if value >= 1.0 =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also handle the rand < 0.0 case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Thanks.

plan.transformAllExpressionsWithPruning(_.containsAllPatterns(
EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) {
case gt @ GreaterThan(DoubleLiteral(value), _: Rand) =>
if (value >= 1.0) TrueLiteral else if (value <= 0.0) FalseLiteral else gt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if value == 0.0, we can't optimize, as Rand may return 0.0.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in c800d29 Jul 13, 2022
@beliefer
Copy link
Contributor Author

@cloud-fan Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants