Skip to content

join_using creates duplicate output fields #18537

@timsaucer

Description

@timsaucer

Describe the bug

From my reading of documentation a JOIN USING call should remove duplicate fields on the output.

In our current implementation DataFrame does not have a join_using function. I wrote a simple implementation based on the join_on approach. When I do so I get duplicate output columns.

To Reproduce

#[tokio::main]
async fn main() -> Result<()> {
    let ctx = SessionContext::new();

    let rb1 = record_batch!(("a", Int32, [1, 2, 3]), ("b", UInt32, [4, 5, 6]))?;
    let rb2 = record_batch!(("a", Int32, [1, 2, 3]), ("c", Utf8, ["x", "y", "z"]))?;

    let _ = ctx.register_batch("left_table", rb1)?;
    let _ = ctx.register_batch("right_table", rb2)?;

    let (state, left_plan) = ctx.table("left_table").await?.into_parts();
    let (_, right_plan) = ctx.table("right_table").await?.into_parts();

    let plan = LogicalPlanBuilder::new(left_plan)
        .join_using(right_plan, JoinType::Inner, vec![Column::from_name("a")])?
        .build()?;

    DataFrame::new(state, plan).show().await?;

    Ok(())
}

This generates the output:

+---+---+---+---+
| a | b | a | c |
+---+---+---+---+
| 1 | 4 | 1 | x |
| 2 | 5 | 2 | y |
| 3 | 6 | 3 | z |
+---+---+---+---+

Expected behavior

I would expect from reading the documentation to only get one column with name a

Additional context

Here is a relevant section from the PostgreSQL documentation: https://www.postgresql.org/docs/current/queries-table-expressions.html

Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print both of the matched columns, since they must have equal values. While JOIN ON produces all columns from T1 followed by all columns from T2, JOIN USING produces one output column for each of the listed column pairs (in the listed order), followed by any remaining columns from T1, followed by any remaining columns from T2.

Additionally in our documentation for the join_using function the docstring says:

    /// Apply a join with using constraint, which duplicates all join columns in output schema.
    pub fn join_using(
        self,
        right: LogicalPlan,
        join_type: JoinType,
        using_keys: Vec<Column>,
    ) -> Result<Self>

It would appear the code is working as the authors of the function intended, but it also appears that this doesn't match industry expectations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions