Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
95cccca
docs: scaffold Sphinx documentation site
andygrove May 13, 2026
0dcab13
docs: add ASF license header to make.bat
andygrove May 13, 2026
19d1666
docs: write landing page with toctree
andygrove May 13, 2026
da500e7
docs: add user guide installation page
andygrove May 13, 2026
dbc55e3
docs: write user guide landing page
andygrove May 13, 2026
86eb8fb
docs: add user guide quickstart
andygrove May 13, 2026
b5e0c96
docs: add user guide sessioncontext page
andygrove May 13, 2026
abbaf09
docs: add user guide dataframe page
andygrove May 13, 2026
b7b692f
docs: add user guide parquet page
andygrove May 13, 2026
f276700
docs: add user guide project status page
andygrove May 13, 2026
4552569
docs: write contributor guide landing page
andygrove May 13, 2026
056d487
docs: add contributor guide development page
andygrove May 13, 2026
a947eab
docs: add contributor guide code style page
andygrove May 13, 2026
b3e7de8
docs: add contributor guide releasing placeholder
andygrove May 13, 2026
0c05dc7
docs: add contributor guide datafusion bump recipe
andygrove May 13, 2026
7cf3531
docs: trim README and link to docs site
andygrove May 13, 2026
62216f8
docs: trim CONTRIBUTING and link to docs site
andygrove May 13, 2026
9cb7dc3
docs: fix incorrect ParquetReadOptions API and tighten development page
andygrove May 13, 2026
a691900
docs: nest toctrees in section index pages for sidebar nav
andygrove May 13, 2026
c8018e4
docs: add user guide page on building plans via datafusion-proto
andygrove May 13, 2026
19c6409
Merge remote-tracking branch 'apache/main' into worktree-docs-site
andygrove May 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,5 @@ target/
tpch-data/
.claude
docs/superpowers
docs/build/
docs/venv/
105 changes: 6 additions & 99 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,105 +22,12 @@ under the License.
Bug reports, design discussion, and patches are welcome. This project follows
the Apache DataFusion contribution model.

## Filing issues and discussing changes

- File bugs and feature requests on [GitHub issues](https://github.com/apache/datafusion-java/issues).
- File bugs and feature requests on
[GitHub issues](https://github.com/apache/datafusion-java/issues).
- For larger or design-level discussion, the mailing list is
[dev@datafusion.apache.org](mailto:dev@datafusion.apache.org).
- Please open an issue before sending a PR for any significant change so the
approach can be agreed on first.

## Development workflow

Branch from `main`, write changes with [conventional commit](https://www.conventionalcommits.org/)
messages in the imperative mood (e.g. `feat: add foo`, `fix(native): handle bar`),
and open a pull request targeting `main`.

The first build in a fresh checkout reaches out to `raw.githubusercontent.com`
to fetch the DataFusion protobuf schemas (see *Updating the DataFusion /
protobuf schema version* below). Subsequent builds are offline — the
`download-maven-plugin` cache under `~/.m2/repository/.cache/` satisfies them.

## Build prerequisites

- JDK 17 or newer.
- Rust toolchain (stable, installed via [rustup]).
- [`tpchgen-cli`] — only needed to generate test data for the Parquet
integration test (`cargo install tpchgen-cli`).

Maven is bundled via the `./mvnw` wrapper; no separate Maven install required.

[rustup]: https://rustup.rs/
[`tpchgen-cli`]: https://github.com/clflushopt/tpchgen-rs

## Build and test

make test

This builds the native Rust crate and runs the JUnit tests. The steps can be
run individually:

cd native && cargo build
./mvnw test

The native library must be built before running JVM tests.

## Test data

The Parquet integration test reads TPC-H SF1 data (~345 MB across 8 tables in
Snappy-compressed Parquet). Generate it once with:

make tpch-data

Tests that need this data skip cleanly if it is missing. `make clean` does
**not** remove `tpch-data/` — delete it manually to reclaim the disk space.

## Code style

- Java: run `./mvnw spotless:apply` before committing. CI fails the build if
formatting drifts.
- Rust: run `cargo fmt` and `cargo clippy --all-targets -- -D warnings` inside
`native/`.
- New source files need the Apache 2.0 license header. Apache RAT enforces this
during `verify`.

## Updating the DataFusion / protobuf schema version

Three things must move together when bumping DataFusion:

1. `native/Cargo.toml` — the `datafusion` crate dependency.
2. `pom.xml` — the `<datafusion.version>` Maven property. **Must equal the
Cargo version**; a mismatch means JVM-built protobuf plans won't deserialize
on the native side.
3. `pom.xml` — the `<sha512>` checksums on the two `download-maven-plugin`
executions. These pin the downloaded `.proto` files; the build fails if
upstream silently re-tags them, which is the desired behavior.

Recipe:

```sh
# 1. Bump the Cargo dep
$EDITOR native/Cargo.toml # set datafusion = "<new>"
(cd native && cargo update -p datafusion)

# 2. Bump the Maven property to match
$EDITOR pom.xml # set <datafusion.version>

# 3. Compute the new SHA-512 hashes for both `.proto` files from the upstream
# tag you just set in step 2, then paste them into the two <sha512> elements
# in pom.xml.
NEW=$(grep -m1 -oE '<datafusion.version>[^<]+' pom.xml | cut -d'>' -f2)
curl -sL "https://raw.githubusercontent.com/apache/datafusion/$NEW/datafusion/proto-common/proto/datafusion_common.proto" | shasum -a 512 | awk '{print $1}'
curl -sL "https://raw.githubusercontent.com/apache/datafusion/$NEW/datafusion/proto/proto/datafusion.proto" | shasum -a 512 | awk '{print $1}'
$EDITOR pom.xml # paste the two hashes into the <sha512> elements

# Drop the local download cache so the next build re-downloads against the new hashes.
rm -rf ~/.m2/repository/.cache/download-maven-plugin target/proto

# 4. Verify
make && make test
```
- Please open an issue before sending a PR for any significant change so
the approach can be agreed on first.

The protobuf runtime version (`<protobuf.version>` in `pom.xml`) tracks the
Java ecosystem (security and JDK compatibility), not DataFusion. Bump it
independently when there is a reason.
For build, test, code style, and version-bump workflows, see the
[contributor guide](docs/source/contributor-guide/index.md).
45 changes: 12 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,47 +36,26 @@ try (var allocator = new RootAllocator();

`SessionContext` and `DataFrame` are `AutoCloseable` and not thread-safe.

## Project status
## Documentation

Query interfaces:
The full documentation lives under [`docs/source/`](docs/source/index.md)
and is built with Sphinx (see [`docs/README.md`](docs/README.md) for the
build steps):

- [x] SQL: `SessionContext.sql(String)`
- [x] DataFrame: `select`, `filter` (other transformations TBD)
- [x] DataFusion-Proto `LogicalPlanNode`: `SessionContext.fromProto(byte[])`.
The `datafusion-proto` Java classes are generated by the build.

Data sources:

- [x] Parquet via `registerParquet` / `readParquet`, with `ParquetReadOptions`
- [x] CSV via `registerCsv` / `readCsv`, with `CsvReadOptions`
- [ ] JSON, Avro
- [ ] Custom catalog and table providers

Results:

- [x] `DataFrame.collect(allocator)` — Arrow C Data Interface stream
- [x] `DataFrame.count()`, `show()`, `show(int)`
- [x] `SessionContext.tableSchema(String)`

Not yet:

- [ ] `SessionConfig` / `RuntimeEnv` knobs
- [ ] Java UDFs
- [ ] `write_*` outputs
- [User guide](docs/source/user-guide/index.md) — installation, the
DataFrame and SQL APIs, Parquet ingestion, project status.
- [Contributor guide](docs/source/contributor-guide/index.md) — build,
test, code style, and how to bump the DataFusion version.

## Requirements

JDK 17+. Building from source: see [CONTRIBUTING.md](CONTRIBUTING.md).

## Layout

- `src/` — Java sources and tests
- `native/` — Rust crate (JNI + Arrow C Data Interface)
JDK 17+. Building from source: see
[`docs/source/contributor-guide/development.md`](docs/source/contributor-guide/development.md).

## Contributing

Open an issue to discuss non-trivial changes before sending a PR.
See [CONTRIBUTING.md](CONTRIBUTING.md).
Open an issue to discuss non-trivial changes before sending a PR. See the
[contributor guide](docs/source/contributor-guide/index.md).

## License

Expand Down
31 changes: 31 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Minimal makefile for Sphinx documentation

SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build

help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
49 changes: 49 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Apache DataFusion Java Documentation

This directory contains the Sphinx source for the Apache DataFusion Java
documentation site.

## Build

Building the docs requires Python 3.9 or newer. A virtual environment under
`docs/venv/` is the recommended workflow.

```sh
cd docs
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
./build.sh
```

The generated site is written to `docs/build/html/`. Open
`docs/build/html/index.html` in a browser to preview.

Subsequent builds need only:

```sh
cd docs
source venv/bin/activate
./build.sh
```

`./build.sh` runs `sphinx-build` with `-W` so warnings fail the build.
31 changes: 31 additions & 0 deletions docs/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

set -e

cd "$(dirname "$0")"

rm -rf build

if [ -d venv ]; then
# shellcheck disable=SC1091
source venv/bin/activate
fi

sphinx-build -b html -W --keep-going source build/html
52 changes: 52 additions & 0 deletions docs/make.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
@ECHO OFF

@rem Licensed to the Apache Software Foundation (ASF) under one
@rem or more contributor license agreements. See the NOTICE file
@rem distributed with this work for additional information
@rem regarding copyright ownership. The ASF licenses this file
@rem to you under the Apache License, Version 2.0 (the
@rem "License"); you may not use this file except in compliance
@rem with the License. You may obtain a copy of the License at
@rem
@rem http://www.apache.org/licenses/LICENSE-2.0
@rem
@rem Unless required by applicable law or agreed to in writing,
@rem software distributed under the License is distributed on an
@rem "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
@rem KIND, either express or implied. See the License for the
@rem specific language governing permissions and limitations
@rem under the License.

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)

if "%1" == "" goto help

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd
20 changes: 20 additions & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

sphinx>=7.0,<8.0
myst-parser>=2.0,<4.0
pydata-sphinx-theme>=0.16.1,<0.17.0
Empty file added docs/source/_static/.gitkeep
Empty file.
Empty file added docs/source/_templates/.gitkeep
Empty file.
Loading
Loading