-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
The HashJoinExec decides output_partition based on this function: https://github.com/apache/arrow-datafusion/blob/b7a33317c2abf265f4ab6b3fe636f87c4d01334c/datafusion/core/src/physical_plan/joins/utils.rs#L90
If PartitionMode is set to Partitioned, join_type is RIGHT, output_partition will depend on output_partition of the right child, this may cause missing execution on left child partitions, if left child has more partitions than right child partition: https://github.com/apache/arrow-datafusion/blob/e87754cfe3afa4c358a8ca9c21c3c4acd020dfe5/datafusion/core/src/physical_plan/joins/hash_join.rs#L413
To Reproduce
Create 2 ExecutionPlan input from csv with only 1 field "id" and create a HashJoinExec from these inputs. Because during the execution, some parition from the left input is not executed on, they are never probed with associated rows in the right input, so result in a false join:
+----+----+
| id | id |
+----+----+
| | 2 |
| | 3 |
| | 6 |
| | 7 |
| | 9 |
| | 1 |
| | 4 |
| | 5 |
| | 8 |
+----+----+
Expected behavior
HashJoin executes correctly
+----+----+
| id | id |
+----+----+
| 1 | 1 |
| 9 | 9 |
| 5 | 5 |
| 8 | 8 |
| 6 | 6 |
| 7 | 7 |
| 4 | 4 |
| 2 | 2 |
| 3 | 3 |
+----+----+
Additional context
No response