Skip to content

feat(dataframe): add select/filter/count/show methods#19

Merged
andygrove merged 7 commits into
apache:mainfrom
andygrove:feat/dataframe-select-filter-count-show
May 13, 2026
Merged

feat(dataframe): add select/filter/count/show methods#19
andygrove merged 7 commits into
apache:mainfrom
andygrove:feat/dataframe-select-filter-count-show

Conversation

@andygrove
Copy link
Copy Markdown
Member

Summary

Adds four transformation/action methods to org.apache.datafusion.DataFrame, expanding the Java surface beyond sqlcollect:

DataFrame select(String... columnNames);
DataFrame filter(String sqlPredicate);
long      count();
void      show();
void      show(int limit);

Callers can now chain small queries in Java without round-tripping through SQL strings for every step:

try (SessionContext ctx = new SessionContext()) {
  ctx.registerParquet("lineitem", "tpch-data/sf1/lineitem.parquet");
  try (DataFrame df = ctx.sql("SELECT * FROM lineitem")) {
    long n = df.filter("l_orderkey < 100").count();
    df.select("l_orderkey", "l_quantity").show(20);
  }
}

Design notes

  • Non-destructive on the Rust side. DataFusion's DataFrame::select_columns/filter/count/show all take self by value, so each new JNI fn clones the borrowed DataFrame (cheap: Arc<SessionState> + LogicalPlan) and operates on the clone. The caller's original Java DataFrame stays usable for further operations. This is intentionally different from collect(), which still consumes its receiver because it ships the actual execution stream out.
  • filter parsing. Uses DataFrame::parse_sql_expr so the predicate is parsed against the DataFrame's own schema.
  • show() output. Goes to native stdout via DataFusion's printer (same behavior as the Rust / Python APIs). This collides with Surefire's forked-JVM IPC stream and produces a Corrupted channel warning during tests — non-fatal, BUILD SUCCESS — but worth addressing in a follow-up (Surefire forkNode extension or a formatString() companion method).
  • Sync API. All four methods are synchronous to match the existing sql/collect shape. A future async refactor would touch all of DataFrame and SessionContext together.

Out of scope

sort, join, aggregate, limit, distinct, withColumn, typed Expr/Column API, async/CompletableFuture overloads. All can land in follow-ups using the same pattern.

Testing

12 new tests in DataFrameTransformationsTest cover:

  • Each new method on small inline VALUES tables.
  • Non-destructive semantics — original DataFrame still usable after select/filter/count/show.
  • Chained operations: filter().select().count().
  • IllegalStateException after close() and after collect().
  • RuntimeException on invalid column / malformed predicate.
  • TPC-H lineitem smoke: filter(...).count() matches SELECT COUNT(*) ... WHERE ... (guarded by Assumptions.assumeTrue on file presence — skipped when TPC-H data is absent, as it is on CI).

make test runs both the existing 13 tests and the 12 new ones — all 25 pass. cargo clippy --all-targets -- -D warnings is clean. ./mvnw spotless:check and ./mvnw apache-rat:check are clean.

@andygrove andygrove merged commit 717b735 into apache:main May 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant