feat(json): expose NdJsonReadOptions via registerJson and readJson#47
Open
LantaoJin wants to merge 1 commit into
Open
feat(json): expose NdJsonReadOptions via registerJson and readJson#47LantaoJin wants to merge 1 commit into
LantaoJin wants to merge 1 commit into
Conversation
Mirror the parquet/csv reader pattern for newline-delimited JSON. Adds NdJsonReadOptionsProto in proto/json_read_options.proto (reusing FileCompressionType from csv_read_options.proto), an NdJsonReadOptions Java builder, registerJson/readJson overloads on SessionContext, and a native/src/json.rs JNI module that decodes the proto and forwards to SessionContext::register_json / read_json. Surface is intentionally aligned with the CSV PR (apache#21): builder fields are fileExtension, fileCompressionType, schemaInferMaxRecords, and an explicit Arrow schema. tablePartitionCols, fileSortOrder, and newline_delimited are left for follow-ups -- none have a parquet/csv counterpart on the Java side yet. The Rust side imports JsonReadOptions from prelude; NdJsonReadOptions is a deprecated alias in DataFusion 53.x. The Java/proto/issue surface keeps the user-facing NdJsonReadOptions name from apache#35.
andygrove
reviewed
May 14, 2026
| public final class NdJsonReadOptions { | ||
|
|
||
| private String fileExtension = ".json"; | ||
| private CsvReadOptions.FileCompressionType fileCompressionType = |
Member
There was a problem hiding this comment.
If FileCompressType is no longer specific to CSV, we should move it to a shared location
andygrove
reviewed
May 14, 2026
| * | ||
| * @throws RuntimeException if registration fails (path not found, schema inference error, etc.). | ||
| */ | ||
| public void registerJson(String name, String path, NdJsonReadOptions options) { |
Member
There was a problem hiding this comment.
could you add null checks for the arguments
andygrove
reviewed
May 14, 2026
| * | ||
| * @throws RuntimeException if the read fails. | ||
| */ | ||
| public DataFrame readJson(String path, NdJsonReadOptions options) { |
Member
There was a problem hiding this comment.
please add null checks for args
andygrove
reviewed
May 14, 2026
|
|
||
| package datafusion_java; | ||
|
|
||
| import "csv_read_options.proto"; |
Member
There was a problem hiding this comment.
same point as earlier - json proto should not reference csv proto
Member
|
Thanks @LantaoJin. Looks good overall. left some comments |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #35
Rationale for this change
DataFusion 53.x supports newline-delimited JSON via
SessionContext::read_json/register_json, but the Java bindings only expose Parquet and CSV readers today. Users with NDJSON input have to fall back toCREATE EXTERNAL TABLE … STORED AS JSONthroughSessionContext.sql, which works but loses the typed-builder ergonomics the parquet/CSV bindings already provide. Issue #35 tracks closing that gap; this PR is the implementation.What changes are included in this PR?
proto/json_read_options.proto— newNdJsonReadOptionsProtomessage. ReusesFileCompressionTypefromcsv_read_options.proto(CSV and JSON accept the same compression set in DataFusion).NdJsonReadOptionsJava builder withfileExtension,fileCompressionType,schemaInferMaxRecords, and an explicit Arrowschema(Schema). Defaults match the Rust struct (.json,UNCOMPRESSED, infer from data).SessionContext.registerJson(name, path[, options])andreadJson(path[, options])overloads, structurally identical to the parquet/CSV entry points (Java builds the proto, JNI hands abyte[]to native).native/src/json.rs— JNI module that decodesNdJsonReadOptionsProto, constructs the upstreamJsonReadOptions, and forwards toregister_json/read_json. Importsprelude::JsonReadOptionsrather than the deprecatedNdJsonReadOptionsalias; the user-facing Java/proto name still matches the issue ask.Out of scope (kept for follow-ups so each PR stays small):
tablePartitionCols,fileSortOrder— neither parquet nor CSV exposes these in the Java surface today; adding them only for JSON would diverge.newline_delimited— DataFusion 53.x exposes the knob, but the JSON-array reader path is not yet stable upstream. Both the issue title and the Rust API name (NdJson) imply newline-delimited.Are these changes tested?
Yes.
NdJsonReadOptionsTest(4 tests):schema(Schema)is held by reference and not embedded in proto bytes,FileCompressionTypevariant.SessionContextJsonTest(3 tests):registerJson+ SQLCOUNT(*)and projection on an inferred-schemaNDJSON file,
readJsonwith an explicit Arrow schema,registerJsonwith a custom.ndjsonfile extension.make testis green: 68 tests, 0 failures, 0 errors. The 12 skippedcases are pre-existing parquet/TPC-H data-dependent tests unaffected
by this PR.
cargo clippy --all-targets -- -D warnings,cargo fmt -- --check,and
./mvnw spotless:applyare all clean.Are there any user-facing changes?
Yes — purely additive. New public API:
org.apache.datafusion.NdJsonReadOptionsSessionContext.registerJson(String, String)SessionContext.registerJson(String, String, NdJsonReadOptions)SessionContext.readJson(String) → DataFrameSessionContext.readJson(String, NdJsonReadOptions) → DataFrameNo existing API changes; no deprecations.