Skip to content

Bug in HashJoin output_partition cause the input from left input not fully executed #5738

@duongcongtoai

Description

@duongcongtoai

Describe the bug

The HashJoinExec decides output_partition based on this function: https://github.com/apache/arrow-datafusion/blob/b7a33317c2abf265f4ab6b3fe636f87c4d01334c/datafusion/core/src/physical_plan/joins/utils.rs#L90

If PartitionMode is set to Partitioned, join_type is RIGHT, output_partition will depend on output_partition of the right child, this may cause missing execution on left child partitions, if left child has more partitions than right child partition: https://github.com/apache/arrow-datafusion/blob/e87754cfe3afa4c358a8ca9c21c3c4acd020dfe5/datafusion/core/src/physical_plan/joins/hash_join.rs#L413

To Reproduce

Code in this gist

Create 2 ExecutionPlan input from csv with only 1 field "id" and create a HashJoinExec from these inputs. Because during the execution, some parition from the left input is not executed on, they are never probed with associated rows in the right input, so result in a false join:

+----+----+
| id | id |
+----+----+
|    | 2  |
|    | 3  |
|    | 6  |
|    | 7  |
|    | 9  |
|    | 1  |
|    | 4  |
|    | 5  |
|    | 8  |
+----+----+

Expected behavior

HashJoin executes correctly

+----+----+
| id | id |
+----+----+
| 1  | 1  |
| 9  | 9  |
| 5  | 5  |
| 8  | 8  |
| 6  | 6  |
| 7  | 7  |
| 4  | 4  |
| 2  | 2  |
| 3  | 3  |
+----+----+

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions