feat(datasource): add Java-implemented data sources by andygrove · Pull Request #65 · apache/datafusion-java

andygrove · 2026-05-18T20:38:12Z

Which issue does this PR close?

Closes Add support for Java data sources #63.

Rationale for this change

Java users have no way to expose custom in-process tables (JDBC scans, in-memory
collections, custom file formats, etc.) to DataFusion. This adds a minimal
DataSource interface and the JNI wiring to register it on a SessionContext.
The implementation mirrors the existing scalar-UDF JNI pattern.

What changes are included in this PR?

New public DataSource interface in org.apache.datafusion with
Schema schema() and ArrowReader scan(BufferAllocator).
SessionContext.registerDataSource(name, source) registers a Java-backed
table; schema is captured at registration time.
JniBridge.invokeDataSourceScan exports the user's ArrowReader through
the Arrow C Data Interface (zero-copy).
Native: JavaDataSource: TableProvider + JavaScanExec: ExecutionPlan in
native/src/data_source.rs, plus the JNI entry point.
Shared jthrowable_to_string helper lifted into native/src/jni_util.rs
so the UDF and data-source paths share Java-exception formatting.
New JdbcExample in the examples module demonstrating an end-to-end
JDBC-backed DataSource: populates an H2 in-memory table, wraps a JDBC
query in a JdbcDataSource, registers it, and runs an aggregation query.
Streams batches via arrow-jdbc's ArrowVectorIterator wrapped in a small
ArrowReader subclass — no IPC re-serialisation. Adds arrow-jdbc and
com.h2database:h2 as examples-module deps.
v1 scope: single partition, no projection or filter pushdown into Java
(DataFusion projects/filters on top), no deregisterTable. Multi-partition,
pushdown, and deregistration are listed as follow-ups in the user guide.

Run the JDBC example with:

./mvnw -pl examples exec:exec -Dexec.mainClass=org.apache.datafusion.examples.JdbcExample

Are these changes tested?

Yes — eight integration tests in
`core/src/test/java/org/apache/datafusion/DataSourceTest.java`:

`SELECT *` happy path
`UNION ALL` over the same registered table (multi-scan)
Empty stream
Column projection through DataFusion
Two registered tables joinable in one query
Schema-mismatch surfaces a readable error
`scan()` throwing propagates the Java exception class and message
`scan()` returning null is rejected with `IllegalStateException`

The JDBC example is exercised end-to-end manually (output verified: aggregation
produces `alice → 119.99` and `bob → 7.50` over the H2 fixture). It compiles
as part of the standard `mvn package` build alongside the other example
classes (`AddOneExample`, `DataFrameExample`, etc.) — none of which carry
JUnit tests, by convention.

Are there any user-facing changes?

Yes — new public `DataSource` interface and `SessionContext.registerDataSource`
method, plus a new user-guide page at `docs/source/user-guide/data-source.md`
covering the API, contract, threading, errors, and v1 limitations. The
runnable `JdbcExample` shows the API in action against an embedded H2.

…rDataSource Java API

Arrow Java's Data.exportArrayStream requires the reader's buffers to share the same allocator root as the export allocator. The previous workaround re-serialised every batch through IPC bytes, defeating zero-copy. The correct fix is to require DataSource.scan to accept a BufferAllocator argument (the framework's own ALLOCATOR) and allocate its reader's buffers from it. This mirrors the ScalarFunction.evaluate(BufferAllocator, ...) API.

…hema

andygrove · 2026-05-18T20:59:34Z

@pgwhalen could you review?

pgwhalen · 2026-05-19T13:58:22Z

+   *     is closed.
+   * @throws RuntimeException if native registration fails.
+   */
+  public void registerDataSource(String name, DataSource source) {


Since this is basically a simplified API on top of the SessionContext::register_table rust function, what if we called the java function that instead (registerTable), and made the interface it accepts TableProvider?

I get that this PR is basically barebones support for custom table registration in java, and that data_source.rs is handling a lot so the java user gets a simple scan() callback. I think only providing that for now makes sense as a first step (and will always be useful for simple cases), but I'd like to make sure this can evolve towards all the flexibility of the TableProvider trait that interacts with ExecutionPlan and ultimately an ArrowReader. The LiteralGuaranteeTest from my bindings demonstrates what this could look like and what it enables (filter pushdown).

To keep things minimal for PR, maybe we could just

rename registerDataSource to registerTable

rename the DataSource interface to TableProvider

provide a simple implementation of TableProvider that just holds what the current DataSource does - not sure about a name for that, but maybe like SimpleTableProvider or FullScanTableProvider or something

Then we can make TableProvider more featured over time. Totally open to other ideas too.

Part of my motivation in renaming is that in the back of my head I'm thinking about eventual support for the separate DataSource, so don't want to clash on naming.

Thanks @pgwhalen. I have addressed your feedback.

Address PR apache#65 review: align Java-side naming with DataFusion's Rust TableProvider trait and free up the DataSource name for the separate datafusion-datasource concept in the future. Add SimpleTableProvider as a convenience wrapper for the (schema, scan-fn) case. - DataSource -> TableProvider (Java interface) - SessionContext.registerDataSource -> registerTable - JniBridge.invokeDataSourceScan -> invokeTableScan - Native JavaDataSource struct + module renamed to JavaTableProvider / table_provider.rs; JNI entry point + signature updated accordingly - New SimpleTableProvider class wraps a Schema and a Function<BufferAllocator, ArrowReader> for the common no-pushdown case - Test, example, and user-guide docs updated to match

# Conflicts: # native/Cargo.toml

pgwhalen

Looks good, thanks!

andygrove added 16 commits May 18, 2026 14:28

build(native): add async-trait and futures deps for Java data sources

b64accb

refactor(native): lift jthrowable_to_string into shared jni_util module

484cd12

feat(datasource): add DataSource interface and SessionContext.registe…

b699291

…rDataSource Java API

feat(native): add JavaDataSource TableProvider and JNI registration

e16a99e

docs(native): clarify JavaScanExec safety + schema check + JVM attach

79213dc

test(datasource): cover repeated scans within a single query

cd03d90

test(datasource): cover empty-stream scan

1004f6c

test(datasource): cover column projection through DataFusion

bf9c435

test(datasource): reject scan whose schema differs from registered sc…

0ff2d8c

…hema

test(datasource): surface Java exception class and message from scan()

248dc70

test(datasource): reject null ArrowReader from scan()

82d13fb

test(datasource): cover joining two registered Java data sources

953fcf2

docs(datasource): document SessionContext.registerDataSource

af57098

docs(datasource): clarify scan() is per-physical-scan, not per-query

a4eb41e

feat(examples): add JDBC-backed DataSource example using H2 + arrow-jdbc

82c740a

pgwhalen reviewed May 19, 2026

View reviewed changes

andygrove added 2 commits May 19, 2026 08:38

Merge remote-tracking branch 'apache/main' into feat/columnar-value-udf

9e8279d

# Conflicts: # native/Cargo.toml

pgwhalen approved these changes May 19, 2026

View reviewed changes

andygrove merged commit 89d5496 into apache:main May 19, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasource): add Java-implemented data sources#65

feat(datasource): add Java-implemented data sources#65
andygrove merged 18 commits into
apache:mainfrom
andygrove:feat/columnar-value-udf

andygrove commented May 18, 2026 •

edited

Loading

Uh oh!

andygrove commented May 18, 2026

Uh oh!

pgwhalen May 19, 2026

Uh oh!

andygrove May 19, 2026

Uh oh!

pgwhalen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

andygrove commented May 18, 2026

Uh oh!

pgwhalen May 19, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove May 19, 2026

Choose a reason for hiding this comment

Uh oh!

pgwhalen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented May 18, 2026 •

edited

Loading