feat: initial seed of Apache DataFusion Java bindings#1
Merged
Conversation
Seed the project with a minimal end-to-end JNI binding from the JVM to Apache DataFusion, plus the build, format, and license-check tooling needed for ongoing contribution. Java surface (org.apache.datafusion): - SessionContext: AutoCloseable session, sql(String) returning a lazy DataFrame, registerParquet(String, String) for registering local Parquet files as SQL tables. - DataFrame: AutoCloseable, collect(BufferAllocator) executes the plan and returns result batches as an Arrow ArrowReader via the Arrow C Data Interface. collect() consumes the DataFrame; close() releases the native plan if never collected. Native side (native/, crate datafusion-jni): - JNI entry points for SessionContext create/close/registerParquet/ createDataFrame and DataFrame collect/close. - Results are exported as FFI_ArrowArrayStream so the JVM reads batches without per-row JNI crossings or row-by-row copies. Build and contributor tooling: - pom.xml with Maven wrapper, JUnit 5, Arrow 19, JDK 17 toolchain. - apache-rat-plugin (license-header check) and spotless-maven-plugin (google-java-format) both bound to the verify phase. - Makefile targets for native build, JVM build, test, clean, and TPC-H SF1 test data generation via tpchgen-cli. - GitHub Actions workflow running spotless:check and cargo fmt --check on push and pull_request to main.
comphead
reviewed
May 13, 2026
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
comphead
reviewed
May 13, 2026
|
|
||
| public static synchronized void loadLibrary() { | ||
| if (!loaded) { | ||
| System.loadLibrary("datafusion_jni"); |
Contributor
There was a problem hiding this comment.
lets make the string as constant?
Member
Author
|
Note that no GitHub workflows will run until after this PR is merged. Also we cannot create any issues until this PR is merged because it contains the |
Re-enable datafusion's default features (parquet, sql) and add arrow dependency with the ffi feature so FFI_ArrowArrayStream, ctx.sql, and register_parquet compile again.
comphead
reviewed
May 13, 2026
|
|
||
| private static native long createSessionContext(); | ||
|
|
||
| private static native long createDataFrame(long handle, String sql); |
Contributor
There was a problem hiding this comment.
maybe we can specify what kind of handle expected in methods? like sessionHandle?
comphead
approved these changes
May 13, 2026
Contributor
comphead
left a comment
There was a problem hiding this comment.
Thanks @andygrove, you actually dont need an approve to merge the first PR :)
|
🎉 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Seed the project with a minimal end-to-end JNI binding from the JVM to Apache DataFusion, plus the build, format, and license-check tooling needed for ongoing contribution.
What is in this PR
Java surface (
org.apache.datafusion)SessionContext—AutoCloseablesession.sql(String)returns a lazyDataFrame;registerParquet(String, String)registers a local Parquet file as a SQL table.DataFrame— lazy,AutoCloseable.collect(BufferAllocator)executes the plan and returns result batches as an ArrowArrowReadervia the Arrow C Data Interface.collect()consumes the DataFrame;close()releases the native plan if never collected and is idempotent.Native side (
native/, cratedatafusion-jni)SessionContextcreate/close/registerParquet/createDataFrameandDataFramecollect/close.FFI_ArrowArrayStream, so the JVM reads batches without per-row JNI crossings or row-by-row copies.Build and contributor tooling
pom.xmlwith Maven wrapper, JUnit 5, Arrow 19, JDK 17 toolchain.apache-rat-plugin(license-header check) andspotless-maven-plugin(google-java-format), both bound to theverifyphase.Makefiletargets for native build, JVM build, test, clean, and TPC-H SF1 test data generation viatpchgen-cli.spotless:checkandcargo fmt --checkon push / PR tomain.Project status
This is the first code drop into a brand-new repository. The README labels the project as early development: the API is small and will change without notice, and there is no published release.
A
Roadmapsection in the README outlines near-term priorities: session configuration, fullSessionContext/DataFrameAPI parity with the Rust side, JVM-side plan construction via DataFusion's Protobuf representation, and Java-defined vectorized expressions over Arrow.Verification
Locally, on this branch:
./mvnw verify— Java compile, unit tests (4 run, 0 fail, 1 skipped because TPC-H SF1 data is not generated),spotless:check,apache-rat:check(10 approved, 0 unapproved).cargo fmt --all -- --check— clean.cargo clippy --all-targets --workspace -- -D warnings— clean.The optional TPC-H integration test runs after
make tpch-data(requirestpchgen-cli); it readslineitem.parquetviaregisterParquetand assertsSELECT COUNT(*)returns 6,001,215.