Skip to content

feat: initial seed of Apache DataFusion Java bindings#1

Merged
andygrove merged 5 commits into
apache:mainfrom
andygrove:initial-seed
May 13, 2026
Merged

feat: initial seed of Apache DataFusion Java bindings#1
andygrove merged 5 commits into
apache:mainfrom
andygrove:initial-seed

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 12, 2026

Summary

Seed the project with a minimal end-to-end JNI binding from the JVM to Apache DataFusion, plus the build, format, and license-check tooling needed for ongoing contribution.

What is in this PR

Java surface (org.apache.datafusion)

  • SessionContextAutoCloseable session. sql(String) returns a lazy DataFrame; registerParquet(String, String) registers a local Parquet file as a SQL table.
  • DataFrame — lazy, AutoCloseable. collect(BufferAllocator) executes the plan and returns result batches as an Arrow ArrowReader via the Arrow C Data Interface. collect() consumes the DataFrame; close() releases the native plan if never collected and is idempotent.

Native side (native/, crate datafusion-jni)

  • JNI entry points for SessionContext create/close/registerParquet/createDataFrame and DataFrame collect/close.
  • Results are exported as FFI_ArrowArrayStream, so the JVM reads batches without per-row JNI crossings or row-by-row copies.

Build and contributor tooling

  • pom.xml with Maven wrapper, JUnit 5, Arrow 19, JDK 17 toolchain.
  • apache-rat-plugin (license-header check) and spotless-maven-plugin (google-java-format), both bound to the verify phase.
  • Makefile targets for native build, JVM build, test, clean, and TPC-H SF1 test data generation via tpchgen-cli.
  • GitHub Actions workflow running spotless:check and cargo fmt --check on push / PR to main.

Project status

This is the first code drop into a brand-new repository. The README labels the project as early development: the API is small and will change without notice, and there is no published release.

A Roadmap section in the README outlines near-term priorities: session configuration, full SessionContext/DataFrame API parity with the Rust side, JVM-side plan construction via DataFusion's Protobuf representation, and Java-defined vectorized expressions over Arrow.

Verification

Locally, on this branch:

  • ./mvnw verify — Java compile, unit tests (4 run, 0 fail, 1 skipped because TPC-H SF1 data is not generated), spotless:check, apache-rat:check (10 approved, 0 unapproved).
  • cargo fmt --all -- --check — clean.
  • cargo clippy --all-targets --workspace -- -D warnings — clean.

The optional TPC-H integration test runs after make tpch-data (requires tpchgen-cli); it reads lineitem.parquet via registerParquet and asserts SELECT COUNT(*) returns 6,001,215.

Seed the project with a minimal end-to-end JNI binding from the JVM to
Apache DataFusion, plus the build, format, and license-check tooling
needed for ongoing contribution.

Java surface (org.apache.datafusion):
- SessionContext: AutoCloseable session, sql(String) returning a lazy
  DataFrame, registerParquet(String, String) for registering local
  Parquet files as SQL tables.
- DataFrame: AutoCloseable, collect(BufferAllocator) executes the plan
  and returns result batches as an Arrow ArrowReader via the Arrow C
  Data Interface. collect() consumes the DataFrame; close() releases
  the native plan if never collected.

Native side (native/, crate datafusion-jni):
- JNI entry points for SessionContext create/close/registerParquet/
  createDataFrame and DataFrame collect/close.
- Results are exported as FFI_ArrowArrayStream so the JVM reads batches
  without per-row JNI crossings or row-by-row copies.

Build and contributor tooling:
- pom.xml with Maven wrapper, JUnit 5, Arrow 19, JDK 17 toolchain.
- apache-rat-plugin (license-header check) and spotless-maven-plugin
  (google-java-format) both bound to the verify phase.
- Makefile targets for native build, JVM build, test, clean, and TPC-H
  SF1 test data generation via tpchgen-cli.
- GitHub Actions workflow running spotless:check and cargo fmt --check
  on push and pull_request to main.
@andygrove andygrove marked this pull request as ready for review May 12, 2026 23:28
Comment thread native/Cargo.toml Outdated
Co-authored-by: Oleks V <comphead@users.noreply.github.com>

public static synchronized void loadLibrary() {
if (!loaded) {
System.loadLibrary("datafusion_jni");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets make the string as constant?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@andygrove
Copy link
Copy Markdown
Member Author

Note that no GitHub workflows will run until after this PR is merged. Also we cannot create any issues until this PR is merged because it contains the .asf.yaml change to enable GitHub issues.

andygrove added 2 commits May 12, 2026 18:18
Re-enable datafusion's default features (parquet, sql) and add arrow
dependency with the ffi feature so FFI_ArrowArrayStream, ctx.sql, and
register_parquet compile again.

private static native long createSessionContext();

private static native long createDataFrame(long handle, String sql);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can specify what kind of handle expected in methods? like sessionHandle?

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove, you actually dont need an approve to merge the first PR :)

@andygrove andygrove merged commit 26a3b4c into apache:main May 13, 2026
@andygrove andygrove deleted the initial-seed branch May 13, 2026 00:28
@alamb
Copy link
Copy Markdown

alamb commented May 13, 2026

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants