-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Describe the bug
From my reading of documentation a JOIN USING call should remove duplicate fields on the output.
In our current implementation DataFrame does not have a join_using function. I wrote a simple implementation based on the join_on approach. When I do so I get duplicate output columns.
To Reproduce
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
let rb1 = record_batch!(("a", Int32, [1, 2, 3]), ("b", UInt32, [4, 5, 6]))?;
let rb2 = record_batch!(("a", Int32, [1, 2, 3]), ("c", Utf8, ["x", "y", "z"]))?;
let _ = ctx.register_batch("left_table", rb1)?;
let _ = ctx.register_batch("right_table", rb2)?;
let (state, left_plan) = ctx.table("left_table").await?.into_parts();
let (_, right_plan) = ctx.table("right_table").await?.into_parts();
let plan = LogicalPlanBuilder::new(left_plan)
.join_using(right_plan, JoinType::Inner, vec![Column::from_name("a")])?
.build()?;
DataFrame::new(state, plan).show().await?;
Ok(())
}This generates the output:
+---+---+---+---+
| a | b | a | c |
+---+---+---+---+
| 1 | 4 | 1 | x |
| 2 | 5 | 2 | y |
| 3 | 6 | 3 | z |
+---+---+---+---+
Expected behavior
I would expect from reading the documentation to only get one column with name a
Additional context
Here is a relevant section from the PostgreSQL documentation: https://www.postgresql.org/docs/current/queries-table-expressions.html
Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print both of the matched columns, since they must have equal values. While JOIN ON produces all columns from T1 followed by all columns from T2, JOIN USING produces one output column for each of the listed column pairs (in the listed order), followed by any remaining columns from T1, followed by any remaining columns from T2.
Additionally in our documentation for the join_using function the docstring says:
/// Apply a join with using constraint, which duplicates all join columns in output schema.
pub fn join_using(
self,
right: LogicalPlan,
join_type: JoinType,
using_keys: Vec<Column>,
) -> Result<Self>It would appear the code is working as the authors of the function intended, but it also appears that this doesn't match industry expectations.