Is your feature request related to a problem or challenge?
Two ordering / layout primitives are missing from the DataFrame API:
sort — no way to order a DataFrame today without dropping to SQL.
repartition — no way to control parallelism / partitioning of a
DataFrame.
Describe the solution you'd like
sort. Two ergonomics options worth considering:
- SQL-string flavour matching
filter(String) / proposed
withColumn: df.sort("a ASC, b DESC NULLS FIRST"). Parsed via
parse_sql_expr plus an ORDER BY shim, or via the SQL parser's
parse_order_by. Cheapest to implement; no Java-side model.
- Typed: a
SortExpr Java record (column, ascending, nullsFirst) and
df.sort(SortExpr... exprs). Discoverable, IDE-friendly.
Suggest starting with (1) for consistency with filter, then layering
(2) on top if/when an Expr builder lands for joins.
repartition. DataFusion's Partitioning enum has three variants:
RoundRobinBatch(usize) — df.repartitionRoundRobin(n)
Hash(Vec<Expr>, usize) — df.repartitionHash(int n, String... columns) (column-name flavour to start; expression variant later)
UnknownPartitioning(usize) — not user-facing.
Tests in DataFrameTransformationsTest (sort round-trip, partition
count assertion via collect_partitioned once that's wired, or via
plan inspection).
Describe alternatives you've considered
ORDER BY / DISTRIBUTE BY via SQL. Works but loses the lazy
DataFrame composition.
Additional context
Each carries a small Java-side design choice (sort-expression shape,
partitioning constructor shape); fine to land them as two separate PRs
under this issue if that's cleaner than one batched PR.
Is your feature request related to a problem or challenge?
Two ordering / layout primitives are missing from the DataFrame API:
sort— no way to order a DataFrame today without dropping to SQL.repartition— no way to control parallelism / partitioning of aDataFrame.
Describe the solution you'd like
sort. Two ergonomics options worth considering:filter(String)/ proposedwithColumn:df.sort("a ASC, b DESC NULLS FIRST"). Parsed viaparse_sql_exprplus anORDER BYshim, or via the SQL parser'sparse_order_by. Cheapest to implement; no Java-side model.SortExprJava record (column, ascending, nullsFirst) anddf.sort(SortExpr... exprs). Discoverable, IDE-friendly.Suggest starting with (1) for consistency with
filter, then layering(2) on top if/when an
Exprbuilder lands for joins.repartition. DataFusion'sPartitioningenum has three variants:RoundRobinBatch(usize)—df.repartitionRoundRobin(n)Hash(Vec<Expr>, usize)—df.repartitionHash(int n, String... columns)(column-name flavour to start; expression variant later)UnknownPartitioning(usize)— not user-facing.Tests in
DataFrameTransformationsTest(sort round-trip, partitioncount assertion via
collect_partitionedonce that's wired, or viaplan inspection).
Describe alternatives you've considered
ORDER BY/DISTRIBUTE BYvia SQL. Works but loses the lazyDataFrame composition.
Additional context
Each carries a small Java-side design choice (sort-expression shape,
partitioning constructor shape); fine to land them as two separate PRs
under this issue if that's cleaner than one batched PR.