Skip to content

feat(json): expose NdJsonReadOptions via registerJson and readJson#47

Open
LantaoJin wants to merge 1 commit into
apache:mainfrom
LantaoJin:feat/json-source-v2
Open

feat(json): expose NdJsonReadOptions via registerJson and readJson#47
LantaoJin wants to merge 1 commit into
apache:mainfrom
LantaoJin:feat/json-source-v2

Conversation

@LantaoJin
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #35

Rationale for this change

DataFusion 53.x supports newline-delimited JSON via SessionContext::read_json / register_json, but the Java bindings only expose Parquet and CSV readers today. Users with NDJSON input have to fall back to CREATE EXTERNAL TABLE … STORED AS JSON through SessionContext.sql, which works but loses the typed-builder ergonomics the parquet/CSV bindings already provide. Issue #35 tracks closing that gap; this PR is the implementation.

What changes are included in this PR?

  • proto/json_read_options.proto — new NdJsonReadOptionsProto message. Reuses FileCompressionType from csv_read_options.proto (CSV and JSON accept the same compression set in DataFusion).
  • NdJsonReadOptions Java builder with fileExtension, fileCompressionType, schemaInferMaxRecords, and an explicit Arrow schema(Schema). Defaults match the Rust struct (.json, UNCOMPRESSED, infer from data).
  • SessionContext.registerJson(name, path[, options]) and readJson(path[, options]) overloads, structurally identical to the parquet/CSV entry points (Java builds the proto, JNI hands a byte[] to native).
  • native/src/json.rs — JNI module that decodes NdJsonReadOptionsProto, constructs the upstream JsonReadOptions, and forwards to register_json / read_json. Imports prelude::JsonReadOptions rather than the deprecated NdJsonReadOptions alias; the user-facing Java/proto name still matches the issue ask.

Out of scope (kept for follow-ups so each PR stays small):

  • tablePartitionCols, fileSortOrder — neither parquet nor CSV exposes these in the Java surface today; adding them only for JSON would diverge.
  • newline_delimited — DataFusion 53.x exposes the knob, but the JSON-array reader path is not yet stable upstream. Both the issue title and the Rust API name (NdJson) imply newline-delimited.
  • AVRO source — separate issue.

Are these changes tested?

Yes.

  • NdJsonReadOptionsTest (4 tests):
    • defaults round-trip through proto,
    • fully-configured options round-trip through proto,
    • schema(Schema) is held by reference and not embedded in proto bytes,
    • sweep over every FileCompressionType variant.
  • SessionContextJsonTest (3 tests):
    • registerJson + SQL COUNT(*) and projection on an inferred-schema
      NDJSON file,
    • readJson with an explicit Arrow schema,
    • registerJson with a custom .ndjson file extension.
  • make test is green: 68 tests, 0 failures, 0 errors. The 12 skipped
    cases are pre-existing parquet/TPC-H data-dependent tests unaffected
    by this PR.
  • cargo clippy --all-targets -- -D warnings, cargo fmt -- --check,
    and ./mvnw spotless:apply are all clean.

Are there any user-facing changes?

Yes — purely additive. New public API:

  • org.apache.datafusion.NdJsonReadOptions
  • SessionContext.registerJson(String, String)
  • SessionContext.registerJson(String, String, NdJsonReadOptions)
  • SessionContext.readJson(String) → DataFrame
  • SessionContext.readJson(String, NdJsonReadOptions) → DataFrame

No existing API changes; no deprecations.

Mirror the parquet/csv reader pattern for newline-delimited JSON.
Adds NdJsonReadOptionsProto in proto/json_read_options.proto (reusing
FileCompressionType from csv_read_options.proto), an NdJsonReadOptions
Java builder, registerJson/readJson overloads on SessionContext, and a
native/src/json.rs JNI module that decodes the proto and forwards to
SessionContext::register_json / read_json.

Surface is intentionally aligned with the CSV PR (apache#21): builder fields
are fileExtension, fileCompressionType, schemaInferMaxRecords, and an
explicit Arrow schema. tablePartitionCols, fileSortOrder, and
newline_delimited are left for follow-ups -- none have a parquet/csv
counterpart on the Java side yet.

The Rust side imports JsonReadOptions from prelude; NdJsonReadOptions
is a deprecated alias in DataFusion 53.x. The Java/proto/issue surface
keeps the user-facing NdJsonReadOptions name from apache#35.
public final class NdJsonReadOptions {

private String fileExtension = ".json";
private CsvReadOptions.FileCompressionType fileCompressionType =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FileCompressType is no longer specific to CSV, we should move it to a shared location

*
* @throws RuntimeException if registration fails (path not found, schema inference error, etc.).
*/
public void registerJson(String name, String path, NdJsonReadOptions options) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add null checks for the arguments

*
* @throws RuntimeException if the read fails.
*/
public DataFrame readJson(String path, NdJsonReadOptions options) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add null checks for args


package datafusion_java;

import "csv_read_options.proto";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same point as earlier - json proto should not reference csv proto

@andygrove
Copy link
Copy Markdown
Member

Thanks @LantaoJin. Looks good overall. left some comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: expose JSON reader via registerJson and readJson

2 participants