design: DataFrame joins (join, joinOn) and the Java Expr question

### Is your feature request related to a problem or challenge?

Joins are the largest missing piece of the DataFrame API, and they
force a design decision that the smaller methods have so far avoided:
how do we represent DataFusion `Expr` values from Java?

DataFusion's join surface:

- `join(right, JoinType, left_cols, right_cols, filter: Option<Expr>)`
  — equi-join on named columns, optional residual predicate.
- `join_on(right, JoinType, on: impl IntoIterator<Item = Expr>)` —
  arbitrary join predicates as a list of expressions.

`join` (the column-name form) is buildable today via the SQL-string
pattern (`parse_sql_expr` for the optional filter). `join_on` is not —
it fundamentally takes `Expr` values.

### Describe the solution you'd like

Split into two phases.

**Phase 1 — `join` (column-name form).** Ship this first, no `Expr`
model required:

- `DataFrame.join(DataFrame right, JoinType type, String[] leftCols, String[] rightCols)`
- Overload accepting a `String filter` parsed via `parse_sql_expr` for
  the residual predicate.
- `JoinType` Java enum mirroring DataFusion's (`INNER`, `LEFT`,
  `RIGHT`, `FULL`, `LEFT_SEMI`, `RIGHT_SEMI`, `LEFT_ANTI`,
  `RIGHT_ANTI`, `LEFT_MARK`).

**Phase 2 — `joinOn` (Expr form).** Requires picking one of:

1. **SQL strings.** `df.joinOn(right, INNER, "l.a = r.b", "l.c > r.d")`.
   Each predicate parsed via `parse_sql_expr` after both sides are
   stitched into a synthetic schema. Cheapest; no Java-side model.
2. **Typed `Expr` builder.** Java class hierarchy mirroring DataFusion's
   `Expr` (Column, BinaryExpr, Literal, ScalarUDF…). Discoverable but a
   large, ongoing maintenance surface that has to track DataFusion's
   `Expr` enum.
3. **Defer.** Keep only `join` (Phase 1) until a concrete user needs
   non-equi joins.

Recommendation: do (1) for symmetry with `filter(String)` and the
proposed `withColumn(String)` / `sort(String)`. (2) is its own
multi-PR effort that the project may or may not want to commit to.

### Describe alternatives you've considered

SQL joins via `ctx.sql(...)`. Works for everything, but requires
naming both sides — losing the DataFrame composition story.

### Additional context

Filing this as a **discussion** issue, not an implementation issue.
Phase 1 (column-name `join`) can be implemented immediately once the
`JoinType` enum and lifecycle (does the right-hand DataFrame get
consumed?) are settled. The Phase 2 decision is the real ask here.

Related: the right-hand `DataFrame` consume-vs-clone question is
identical to the set-operations issue and should be answered the same
way in both.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design: DataFrame joins (join, joinOn) and the Java Expr question #44

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

design: DataFrame joins (join, joinOn) and the Java Expr question #44

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions