Is your feature request related to a problem or challenge?
Joins are the largest missing piece of the DataFrame API, and they
force a design decision that the smaller methods have so far avoided:
how do we represent DataFusion Expr values from Java?
DataFusion's join surface:
join(right, JoinType, left_cols, right_cols, filter: Option<Expr>)
— equi-join on named columns, optional residual predicate.
join_on(right, JoinType, on: impl IntoIterator<Item = Expr>) —
arbitrary join predicates as a list of expressions.
join (the column-name form) is buildable today via the SQL-string
pattern (parse_sql_expr for the optional filter). join_on is not —
it fundamentally takes Expr values.
Describe the solution you'd like
Split into two phases.
Phase 1 — join (column-name form). Ship this first, no Expr
model required:
DataFrame.join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols)
- Overload accepting a
String filter parsed via parse_sql_expr for
the residual predicate.
JoinType Java enum mirroring DataFusion's (INNER, LEFT,
RIGHT, FULL, LEFT_SEMI, RIGHT_SEMI, LEFT_ANTI,
RIGHT_ANTI, LEFT_MARK).
Phase 2 — joinOn (Expr form). Requires picking one of:
- SQL strings.
df.joinOn(right, INNER, "l.a = r.b", "l.c > r.d").
Each predicate parsed via parse_sql_expr after both sides are
stitched into a synthetic schema. Cheapest; no Java-side model.
- Typed
Expr builder. Java class hierarchy mirroring DataFusion's
Expr (Column, BinaryExpr, Literal, ScalarUDF…). Discoverable but a
large, ongoing maintenance surface that has to track DataFusion's
Expr enum.
- Defer. Keep only
join (Phase 1) until a concrete user needs
non-equi joins.
Recommendation: do (1) for symmetry with filter(String) and the
proposed withColumn(String) / sort(String). (2) is its own
multi-PR effort that the project may or may not want to commit to.
Describe alternatives you've considered
SQL joins via ctx.sql(...). Works for everything, but requires
naming both sides — losing the DataFrame composition story.
Additional context
Filing this as a discussion issue, not an implementation issue.
Phase 1 (column-name join) can be implemented immediately once the
JoinType enum and lifecycle (does the right-hand DataFrame get
consumed?) are settled. The Phase 2 decision is the real ask here.
Related: the right-hand DataFrame consume-vs-clone question is
identical to the set-operations issue and should be answered the same
way in both.
Is your feature request related to a problem or challenge?
Joins are the largest missing piece of the DataFrame API, and they
force a design decision that the smaller methods have so far avoided:
how do we represent DataFusion
Exprvalues from Java?DataFusion's join surface:
join(right, JoinType, left_cols, right_cols, filter: Option<Expr>)— equi-join on named columns, optional residual predicate.
join_on(right, JoinType, on: impl IntoIterator<Item = Expr>)—arbitrary join predicates as a list of expressions.
join(the column-name form) is buildable today via the SQL-stringpattern (
parse_sql_exprfor the optional filter).join_onis not —it fundamentally takes
Exprvalues.Describe the solution you'd like
Split into two phases.
Phase 1 —
join(column-name form). Ship this first, noExprmodel required:
DataFrame.join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols)String filterparsed viaparse_sql_exprforthe residual predicate.
JoinTypeJava enum mirroring DataFusion's (INNER,LEFT,RIGHT,FULL,LEFT_SEMI,RIGHT_SEMI,LEFT_ANTI,RIGHT_ANTI,LEFT_MARK).Phase 2 —
joinOn(Expr form). Requires picking one of:df.joinOn(right, INNER, "l.a = r.b", "l.c > r.d").Each predicate parsed via
parse_sql_exprafter both sides arestitched into a synthetic schema. Cheapest; no Java-side model.
Exprbuilder. Java class hierarchy mirroring DataFusion'sExpr(Column, BinaryExpr, Literal, ScalarUDF…). Discoverable but alarge, ongoing maintenance surface that has to track DataFusion's
Exprenum.join(Phase 1) until a concrete user needsnon-equi joins.
Recommendation: do (1) for symmetry with
filter(String)and theproposed
withColumn(String)/sort(String). (2) is its ownmulti-PR effort that the project may or may not want to commit to.
Describe alternatives you've considered
SQL joins via
ctx.sql(...). Works for everything, but requiresnaming both sides — losing the DataFrame composition story.
Additional context
Filing this as a discussion issue, not an implementation issue.
Phase 1 (column-name
join) can be implemented immediately once theJoinTypeenum and lifecycle (does the right-hand DataFrame getconsumed?) are settled. The Phase 2 decision is the real ask here.
Related: the right-hand
DataFrameconsume-vs-clone question isidentical to the set-operations issue and should be answered the same
way in both.