[SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic#37040
[SPARK-39651][SQL] Prune filter condition if compare with rand is deterministic#37040beliefer wants to merge 14 commits intoapache:masterfrom
Conversation
|
ping @cloud-fan |
|
Can we add a new rule |
OK |
There was a problem hiding this comment.
nit: put it in a new file
There was a problem hiding this comment.
we can match DoubleLiteral directly. Other optimizer rules will optimize foldable expressions to literals.
There was a problem hiding this comment.
Thank you for the reminder.
|
|
||
| /** | ||
| * Rand() generates a random column with i.i.d. uniformly distributed values in [0, 1), so | ||
| * compare double literal value with 1.0 could eliminate Rand() in binary comparison. |
| def apply(plan: LogicalPlan): LogicalPlan = | ||
| plan.transformAllExpressionsWithPruning(_.containsAllPatterns( | ||
| EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) { | ||
| case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 => |
There was a problem hiding this comment.
can we swap the comparison so that we don't need to handle each comparison twice?
There was a problem hiding this comment.
I feel that swap introduces additional complexity and reduces readability.
| } | ||
|
|
||
| test("Nondeterministic predicate is not pruned") { | ||
| val originalQuery = testRelation.where(Rand(10) > 5).select($"a").where(Rand(10) > 5).analyze |
There was a problem hiding this comment.
why do we need to change this file? The new rule is not invoked in this test suite.
| EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) { | ||
| case GreaterThan(DoubleLiteral(value), _: Rand) if value >= 1.0 => | ||
| TrueLiteral | ||
| case GreaterThan(_: Rand, DoubleLiteral(value)) if value >= 1.0 => |
There was a problem hiding this comment.
we should also handle the rand < 0.0 case
| plan.transformAllExpressionsWithPruning(_.containsAllPatterns( | ||
| EXPRESSION_WITH_RANDOM_SEED, LITERAL, BINARY_COMPARISON), ruleId) { | ||
| case gt @ GreaterThan(DoubleLiteral(value), _: Rand) => | ||
| if (value >= 1.0) TrueLiteral else if (value <= 0.0) FalseLiteral else gt |
There was a problem hiding this comment.
if value == 0.0, we can't optimize, as Rand may return 0.0.
|
thanks, merging to master! |
|
@cloud-fan Thank you ! |
What changes were proposed in this pull request?
Currently, the SQL show below evaluate rand(1) < 2 for rows one by one.
SELECT * FROM tab WHERE rand(1) < 2In fact, we can prune the filter condition.
Why are the changes needed?
Prune filter condition and improve the performance.
Does this PR introduce any user-facing change?
'No'.
The internal behavior.
How was this patch tested?
New tests.