Skip to content

design: DataFrame joins (join, joinOn) and the Java Expr question #44

@andygrove

Description

@andygrove

Is your feature request related to a problem or challenge?

Joins are the largest missing piece of the DataFrame API, and they
force a design decision that the smaller methods have so far avoided:
how do we represent DataFusion Expr values from Java?

DataFusion's join surface:

  • join(right, JoinType, left_cols, right_cols, filter: Option<Expr>)
    — equi-join on named columns, optional residual predicate.
  • join_on(right, JoinType, on: impl IntoIterator<Item = Expr>)
    arbitrary join predicates as a list of expressions.

join (the column-name form) is buildable today via the SQL-string
pattern (parse_sql_expr for the optional filter). join_on is not —
it fundamentally takes Expr values.

Describe the solution you'd like

Split into two phases.

Phase 1 — join (column-name form). Ship this first, no Expr
model required:

  • DataFrame.join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols)
  • Overload accepting a String filter parsed via parse_sql_expr for
    the residual predicate.
  • JoinType Java enum mirroring DataFusion's (INNER, LEFT,
    RIGHT, FULL, LEFT_SEMI, RIGHT_SEMI, LEFT_ANTI,
    RIGHT_ANTI, LEFT_MARK).

Phase 2 — joinOn (Expr form). Requires picking one of:

  1. SQL strings. df.joinOn(right, INNER, "l.a = r.b", "l.c > r.d").
    Each predicate parsed via parse_sql_expr after both sides are
    stitched into a synthetic schema. Cheapest; no Java-side model.
  2. Typed Expr builder. Java class hierarchy mirroring DataFusion's
    Expr (Column, BinaryExpr, Literal, ScalarUDF…). Discoverable but a
    large, ongoing maintenance surface that has to track DataFusion's
    Expr enum.
  3. Defer. Keep only join (Phase 1) until a concrete user needs
    non-equi joins.

Recommendation: do (1) for symmetry with filter(String) and the
proposed withColumn(String) / sort(String). (2) is its own
multi-PR effort that the project may or may not want to commit to.

Describe alternatives you've considered

SQL joins via ctx.sql(...). Works for everything, but requires
naming both sides — losing the DataFrame composition story.

Additional context

Filing this as a discussion issue, not an implementation issue.
Phase 1 (column-name join) can be implemented immediately once the
JoinType enum and lifecycle (does the right-hand DataFrame get
consumed?) are settled. The Phase 2 decision is the real ask here.

Related: the right-hand DataFrame consume-vs-clone question is
identical to the set-operations issue and should be answered the same
way in both.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions