Skip to content

feat: expose ConfigOptions.set/get as generic SessionContextBuilder.setOption / SessionContext.getOption #48

@LantaoJin

Description

@LantaoJin

Is your feature request related to a problem or challenge?

SessionContextBuilder (introduced in #28) exposes a typed setter for the six most-common SessionConfig knobs: batchSize, targetPartitions, collectStatistics, informationSchema, memoryLimit, and tempDirectory. There is currently no Java surface to read any config value back, either.

The Rust ConfigOptions struct that backs SessionConfig carries roughly 200 keys split across seven sections (datafusion.catalog.*, datafusion.execution.*, datafusion.optimizer.*, datafusion.sql_parser.*, datafusion.explain.*, datafusion.format.*, plus user extensions). The Java builder reaches none of these except the six it names explicitly.

A few representative gaps the typed approach would have to fill one at a time, but a generic setOption(key, value) lights up in a single PR:

  • datafusion.execution.parquet.pushdown_filters — decode-time predicate pushdown via RowFilter. Workload-dependent default; many embedders flip this per-context.
  • datafusion.execution.parquet.bloom_filter_on_read, …parquet.reorder_filters, …parquet.schema_force_view_types, …parquet.binary_as_string — the rest of the parquet family.
  • datafusion.optimizer.prefer_hash_join, …optimizer.default_filter_selectivity, …optimizer.repartition_joins, …optimizer.expand_views_at_output — optimizer dials almost every embedder ends up tuning.
  • datafusion.execution.time_zone — required for any timestamp arithmetic that needs a non-UTC zone.
  • datafusion.format.timestamp_format, …format.date_format, …format.binary_format — affect every DataFrame.show() and any text rendering.
  • datafusion.sql_parser.dialect, …sql_parser.enable_ident_normalization — needed to integrate with non-PostgreSQL frontends (MySQL, Snowflake, Hive, …).
  • datafusion.explain.show_statistics, …explain.show_sizes, …explain.format — needed by tooling/UI built on top of EXPLAIN.
  • User extensions set via Extensions::insert(...) — currently unreachable from Java at any granularity.

DataFusion already exposes string-keyed get/set on its config (ConfigOptions::set(key, value) and ConfigOptions::entries()). The Java binding can mirror both directly:

config.options_mut().set(key, value)?;       // ConfigOptions::set
ctx.copied_config().options().entries();     // ConfigOptions::entries

This issue tracks adding the matching Java surface for both directions.

Describe the solution you'd like

Three new methods, two on SessionContextBuilder and one on SessionContext:

ctx.builder()
    .batchSize(8192)                                       // existing typed
    .targetPartitions(16)                                  // existing typed
    .setOption("datafusion.execution.parquet.pushdown_filters", "true")
    .setOption("datafusion.optimizer.prefer_hash_join", "true")
    .setOptions(Map.of(                                    // bulk overload
        "datafusion.execution.time_zone", "UTC",
        "datafusion.optimizer.default_filter_selectivity", "10"))
    .build();

// read side, on the constructed context:
String tz = ctx.getOption("datafusion.execution.time_zone");

Wire format

Purely additive — one new field on the existing SessionOptions proto:

message SessionOptions {
  optional uint64 batch_size = 1;
  // ... unchanged ...
  optional string temp_directory = 6;
  map<string, string> options = 7;     // new
}

map<string, string> over repeated KeyValue because:

  • The Java/Rust generated APIs are nicer (putOptions / HashMap<String,String> rather than a wrapper struct).
  • Each key is unique by definition, which is the constraint we want.
  • Wire bytes are identical to a repeated KV under proto3.

setOption / setOptions (write side)

public SessionContextBuilder setOption(String key, String value) {
    if (key == null || value == null) {
        throw new IllegalArgumentException("setOption key and value must be non-null");
    }
    this.options.put(key, value);   // LinkedHashMap<String, String>
    return this;
}

public SessionContextBuilder setOptions(Map<String, String> entries) { ... }

Stored into a LinkedHashMap<String, String> so caller-order is preserved on the Java side; serialized via b.putAllOptions(...) into the proto.

getOption (read side)

public String getOption(String key) { ... }

Lives on SessionContext rather than the builder, because the value it returns is "what DataFusion actually compiled" — only knowable post-construction. Returns the value as a String, or null if the key is recognised but has no value set and no default. Unknown keys throw RuntimeException to mirror setOption's strictness, and a closed-context call throws IllegalStateException like every other method on the class.

Returning a plain String rather than a richer ConfigEntry { key, value, description } because the Java side has no ConfigEntry mirror today; introducing one is review surface for another day, and a richer return type is purely additive.

Rust surface

In native/src/lib.rs::createSessionContextWithOptions, after the existing typed-field decoding:

let mut config = SessionConfig::new();
// ... existing typed setters unchanged ...
for (k, v) in &opts.options {
    config.options_mut().set(k, v)?;
}

Plus a new getOptionNative JNI handler that walks ctx.copied_config().options().entries() and returns the matching entry.value as a JString, or null for known-but-unset.

ConfigOptions::set returns a DataFusionError on unknown keys or unparseable values, which try_unwrap_or_throw already surfaces as a RuntimeException with the underlying message — same error path as the rest of the JNI crate.

Behavior precedence

Map entries are applied after the typed setters, so a caller that sets both batchSize(8192) and setOption("datafusion.execution.batch_size", "1") gets the overriding behavior. This is the only semantically defensible order (otherwise the typed-setter-wins case would silently drop the explicit setOption call), but flagged here for explicit reviewer sign-off — see Open questions below.

Out of scope (separate follow-ups)

  • Read-time options like CsvReadOptions.has_header — per-call, already typed, not session-level.
  • Compile-time validation of keys. Java has no way to know whether a key string is valid in the pinned DataFusion version; we rely on the runtime error from ConfigOptions::set. Same UX as the Rust string-keyed API.
  • A typed ConfigException hierarchy. Today every JNI error collapses to RuntimeException. Promoting these to typed exceptions is a separate cross-cutting concern.
  • A listOptions() enumerator returning all keys. Useful for tooling but not needed by users; easy follow-up if requested.

Out of scope (separate follow-ups)

  • Read-time options like CsvReadOptions.has_header per-call, already typed, not session-level.
  • Compile-time validation of keys. Java has no way to know whether a key string is valid in the pinned DataFusion version; we rely on the runtime error from ConfigOptions::set. Same UX as the Rust string-keyed API.
  • A typed ConfigException hierarchy. Today every JNI error collapses to RuntimeException. Promoting these to typed exceptions is a separate cross-cutting concern (see related work in prs/CONTRIB_ISSUES.md Add test section to README #12 in the OpenSearch-sourced backlog).

Describe alternatives you've considered

  1. One named setter per key, e.g. parquetPushdownFilters(boolean). Originally drafted as the first cut. Rejected because it scales to ~200 setters and forces a separate PR for every new dial DataFusion adds upstream. Named setters for the common knobs (the six already in feat: configure SessionContext and RuntimeEnv via builder #28) are the right call; a generic escape hatch covers the long tail.

  2. A bundled parquetOptions(ParquetSessionOptions) builder per-section. Decent ergonomics but loses everything outside parquet, and forces us to re-design the surface every time DataFusion adds a section. The setOption(key, value) shape tracks the upstream API exactly and needs no follow-up when new keys are added.

  3. getOption on the builder instead of on SessionContext. Considered but rejected: a builder-side getter would only know what's pending in the local map, which is strictly worse than reading what DataFusion actually compiled. The post-construction SessionContext.getOption is the meaningful contract.

  4. Returning null for unknown keys instead of throwing. Conflates "no such key" (caller typo) with "known-but-unset" (legitimate state). Throwing on unknown keys matches setOption's strictness and gives faster feedback on typos.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions