Skip to content

Incorrect common subexpression elimination of a subquery #8190

@helgikrs

Description

@helgikrs

Describe the bug

A query containing two distinct InSubquery expressions are merged (one eliminated) by common subexpression elimination optimization pass where the expressions are not the same.

To Reproduce

In datafusion-cli (v32.0.0)

create table a (a bigint);
create table b (b bigint);
create table c (c bigint);
explain select * from a where a in (select b from b) AND a > 0 OR a in (select c from c);
+---------------+-------------------------------------------------------------------------------+
| plan_type     | plan                                                                          |
+---------------+-------------------------------------------------------------------------------+
| logical_plan  | LeftSemi Join: a.a = __correlated_sq_1.b                                      |
|               |   TableScan: a projection=[a]                                                 |
|               |   SubqueryAlias: __correlated_sq_1                                            |
|               |     TableScan: b projection=[b]                                               |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                   |
|               |   HashJoinExec: mode=Partitioned, join_type=LeftSemi, on=[(a@0, b@0)]         |
|               |     CoalesceBatchesExec: target_batch_size=8192                               |
|               |       RepartitionExec: partitioning=Hash([a@0], 24), input_partitions=24      |
|               |         RepartitionExec: partitioning=RoundRobinBatch(24), input_partitions=1 |
|               |           MemoryExec: partitions=1, partition_sizes=[0]                       |
|               |     CoalesceBatchesExec: target_batch_size=8192                               |
|               |       RepartitionExec: partitioning=Hash([b@0], 24), input_partitions=24      |
|               |         RepartitionExec: partitioning=RoundRobinBatch(24), input_partitions=1 |
|               |           MemoryExec: partitions=1, partition_sizes=[0]                       |
|               |                                                                               |
+---------------+-------------------------------------------------------------------------------+
2 rows in set. Query took 0.003 seconds.

We can see the second table scan removed after common subexpression elimination

❯ explain verbose select * from a where a in (select b from b) AND a > 0 OR a in (select c from c);
+------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
| plan_type                                                  | plan                                                                                            |
+------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
| initial_logical_plan                                       | Projection: a.a                                                                                 |
|                                                            |   Filter: a.a IN (<subquery>) AND a.a > Int64(0) OR a.a IN (<subquery>)                         |
|                                                            |     Subquery:                                                                                   |
|                                                            |       Projection: b.b                                                                           |
|                                                            |         TableScan: b                                                                            |
|                                                            |     Subquery:                                                                                   |
|                                                            |       Projection: c.c                                                                           |
|                                                            |         TableScan: c                                                                            |
|                                                            |     TableScan: a                                                                                |
| logical_plan after inline_table_scan                       | SAME TEXT AS ABOVE                                                                              |
| logical_plan after type_coercion                           | SAME TEXT AS ABOVE                                                                              |
| logical_plan after count_wildcard_rule                     | SAME TEXT AS ABOVE                                                                              |
| analyzed_logical_plan                                      | SAME TEXT AS ABOVE                                                                              |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                                                                              |
| logical_plan after unwrap_cast_in_comparison               | SAME TEXT AS ABOVE                                                                              |
| logical_plan after replace_distinct_aggregate              | SAME TEXT AS ABOVE                                                                              |
| logical_plan after eliminate_join                          | SAME TEXT AS ABOVE                                                                              |
| logical_plan after decorrelate_predicate_subquery          | SAME TEXT AS ABOVE                                                                              |
| logical_plan after scalar_subquery_to_join                 | SAME TEXT AS ABOVE                                                                              |
| logical_plan after extract_equijoin_predicate              | SAME TEXT AS ABOVE                                                                              |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                                                                              |
| logical_plan after merge_projection                        | SAME TEXT AS ABOVE                                                                              |
| logical_plan after rewrite_disjunctive_predicate           | SAME TEXT AS ABOVE                                                                              |
| logical_plan after eliminate_duplicated_expr               | SAME TEXT AS ABOVE                                                                              |
| logical_plan after eliminate_filter                        | SAME TEXT AS ABOVE                                                                              |
| logical_plan after eliminate_cross_join                    | SAME TEXT AS ABOVE                                                                              |
| logical_plan after common_sub_expression_eliminate         | Projection: a.a                                                                                 |
|                                                            |   Projection: a.a                                                                               |
|                                                            |     Filter: a.a IN (<subquery>)a.a AS IN AND a.a > Int64(0) OR a.a IN (<subquery>)a.a AS IN     |
|                                                            |       Projection: a.a IN (<subquery>) AS a.a IN (<subquery>)a.a, a.a                            |
|                                                            |         Subquery:                                                                               |
|                                                            |           Projection: b.b                                                                       |
|                                                            |             TableScan: b                                                                        |
|                                                            |         TableScan: a                                                                            |
... <truncated> ...

Expected behavior

The table scan for c should not be removed, the resulting logical plan should show the two subqueries.

Additional context

Changing the second InSubquery such that the lhs expression is different from the first InSubquery expression avoids triggering the bug (looks like maybe CSE is just using the expression, and not the subquery plan for evaluating equivalence?)

explain select * from a where a in (select b from b) AND a > 0 OR (a + 1) in (select c from c);
+--------------+----------------------------------------------------------------------------------+
| plan_type    | plan                                                                             |
+--------------+----------------------------------------------------------------------------------+
| logical_plan | Filter: a.a IN (<subquery>) AND a.a > Int64(0) OR a.a + Int64(1) IN (<subquery>) |
|              |   Subquery:                                                                      |
|              |     Projection: b.b                                                              |
|              |       TableScan: b                                                               |
|              |   Subquery:                                                                      |
|              |     Projection: c.c                                                              |
|              |       TableScan: c                                                               |
|              |   TableScan: a projection=[a]                                                    |
+--------------+----------------------------------------------------------------------------------+

P.S.

I think the AND a > 0 clause is immaterial, but without it a different bug is triggered, I'll file a separate issue for that.

select * from a where a in (select b from b) OR a in (select c from c);
Optimizer rule 'simplify_expressions' failed
caused by
Error during planning: Attempted to create Filter predicate with expression `a.a IN (<subquery>)` aliased as 'IN'. Filter predicates should not be aliased.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions