You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
SessionContextBuilder (introduced in #28) exposes a typed setter for the six most-common SessionConfig knobs: batchSize, targetPartitions, collectStatistics, informationSchema, memoryLimit, and tempDirectory. There is currently no Java surface to read any config value back, either.
The Rust ConfigOptions struct that backs SessionConfig carries roughly 200 keys split across seven sections (datafusion.catalog.*, datafusion.execution.*, datafusion.optimizer.*, datafusion.sql_parser.*, datafusion.explain.*, datafusion.format.*, plus user extensions). The Java builder reaches none of these except the six it names explicitly.
A few representative gaps the typed approach would have to fill one at a time, but a genericsetOption(key, value) lights up in a single PR:
datafusion.execution.parquet.pushdown_filters — decode-time predicate pushdown via RowFilter. Workload-dependent default; many embedders flip this per-context.
datafusion.execution.parquet.bloom_filter_on_read, …parquet.reorder_filters, …parquet.schema_force_view_types, …parquet.binary_as_string — the rest of the parquet family.
datafusion.optimizer.prefer_hash_join, …optimizer.default_filter_selectivity, …optimizer.repartition_joins, …optimizer.expand_views_at_output — optimizer dials almost every embedder ends up tuning.
datafusion.execution.time_zone — required for any timestamp arithmetic that needs a non-UTC zone.
datafusion.format.timestamp_format, …format.date_format, …format.binary_format — affect every DataFrame.show() and any text rendering.
datafusion.sql_parser.dialect, …sql_parser.enable_ident_normalization — needed to integrate with non-PostgreSQL frontends (MySQL, Snowflake, Hive, …).
datafusion.explain.show_statistics, …explain.show_sizes, …explain.format — needed by tooling/UI built on top of EXPLAIN.
User extensions set via Extensions::insert(...) — currently unreachable from Java at any granularity.
DataFusion already exposes string-keyed get/set on its config (ConfigOptions::set(key, value) and ConfigOptions::entries()). The Java binding can mirror both directly:
map<string, string> over repeated KeyValue because:
The Java/Rust generated APIs are nicer (putOptions / HashMap<String,String> rather than a wrapper struct).
Each key is unique by definition, which is the constraint we want.
Wire bytes are identical to a repeated KV under proto3.
setOption / setOptions (write side)
publicSessionContextBuildersetOption(Stringkey, Stringvalue) {
if (key == null || value == null) {
thrownewIllegalArgumentException("setOption key and value must be non-null");
}
this.options.put(key, value); // LinkedHashMap<String, String>returnthis;
}
publicSessionContextBuildersetOptions(Map<String, String> entries) { ... }
Stored into a LinkedHashMap<String, String> so caller-order is preserved on the Java side; serialized via b.putAllOptions(...) into the proto.
getOption (read side)
publicStringgetOption(Stringkey) { ... }
Lives on SessionContext rather than the builder, because the value it returns is "what DataFusion actually compiled" — only knowable post-construction. Returns the value as a String, or null if the key is recognised but has no value set and no default. Unknown keys throw RuntimeException to mirror setOption's strictness, and a closed-context call throws IllegalStateException like every other method on the class.
Returning a plain String rather than a richer ConfigEntry { key, value, description } because the Java side has no ConfigEntry mirror today; introducing one is review surface for another day, and a richer return type is purely additive.
Rust surface
In native/src/lib.rs::createSessionContextWithOptions, after the existing typed-field decoding:
Plus a new getOptionNative JNI handler that walks ctx.copied_config().options().entries() and returns the matching entry.value as a JString, or null for known-but-unset.
ConfigOptions::set returns a DataFusionError on unknown keys or unparseable values, which try_unwrap_or_throw already surfaces as a RuntimeException with the underlying message — same error path as the rest of the JNI crate.
Behavior precedence
Map entries are applied after the typed setters, so a caller that sets both batchSize(8192) and setOption("datafusion.execution.batch_size", "1") gets the overriding behavior. This is the only semantically defensible order (otherwise the typed-setter-wins case would silently drop the explicit setOption call), but flagged here for explicit reviewer sign-off — see Open questions below.
Out of scope (separate follow-ups)
Read-time options like CsvReadOptions.has_header — per-call, already typed, not session-level.
Compile-time validation of keys. Java has no way to know whether a key string is valid in the pinned DataFusion version; we rely on the runtime error from ConfigOptions::set. Same UX as the Rust string-keyed API.
A typed ConfigException hierarchy. Today every JNI error collapses to RuntimeException. Promoting these to typed exceptions is a separate cross-cutting concern.
A listOptions() enumerator returning all keys. Useful for tooling but not needed by users; easy follow-up if requested.
Out of scope (separate follow-ups)
Read-time options like CsvReadOptions.has_header per-call, already typed, not session-level.
Compile-time validation of keys. Java has no way to know whether a key string is valid in the pinned DataFusion version; we rely on the runtime error from ConfigOptions::set. Same UX as the Rust string-keyed API.
A typed ConfigException hierarchy. Today every JNI error collapses to RuntimeException. Promoting these to typed exceptions is a separate cross-cutting concern (see related work in prs/CONTRIB_ISSUES.mdAdd test section to README #12 in the OpenSearch-sourced backlog).
Describe alternatives you've considered
One named setter per key, e.g. parquetPushdownFilters(boolean). Originally drafted as the first cut. Rejected because it scales to ~200 setters and forces a separate PR for every new dial DataFusion adds upstream. Named setters for the common knobs (the six already in feat: configure SessionContext and RuntimeEnv via builder #28) are the right call; a generic escape hatch covers the long tail.
A bundled parquetOptions(ParquetSessionOptions) builder per-section. Decent ergonomics but loses everything outside parquet, and forces us to re-design the surface every time DataFusion adds a section. The setOption(key, value) shape tracks the upstream API exactly and needs no follow-up when new keys are added.
getOption on the builder instead of on SessionContext. Considered but rejected: a builder-side getter would only know what's pending in the local map, which is strictly worse than reading what DataFusion actually compiled. The post-construction SessionContext.getOption is the meaningful contract.
Returning null for unknown keys instead of throwing. Conflates "no such key" (caller typo) with "known-but-unset" (legitimate state). Throwing on unknown keys matches setOption's strictness and gives faster feedback on typos.
Is your feature request related to a problem or challenge?
SessionContextBuilder(introduced in #28) exposes a typed setter for the six most-commonSessionConfigknobs:batchSize,targetPartitions,collectStatistics,informationSchema,memoryLimit, andtempDirectory. There is currently no Java surface to read any config value back, either.The Rust
ConfigOptionsstruct that backsSessionConfigcarries roughly 200 keys split across seven sections (datafusion.catalog.*,datafusion.execution.*,datafusion.optimizer.*,datafusion.sql_parser.*,datafusion.explain.*,datafusion.format.*, plus userextensions). The Java builder reaches none of these except the six it names explicitly.A few representative gaps the typed approach would have to fill one at a time, but a generic
setOption(key, value)lights up in a single PR:datafusion.execution.parquet.pushdown_filters— decode-time predicate pushdown viaRowFilter. Workload-dependent default; many embedders flip this per-context.datafusion.execution.parquet.bloom_filter_on_read,…parquet.reorder_filters,…parquet.schema_force_view_types,…parquet.binary_as_string— the rest of the parquet family.datafusion.optimizer.prefer_hash_join,…optimizer.default_filter_selectivity,…optimizer.repartition_joins,…optimizer.expand_views_at_output— optimizer dials almost every embedder ends up tuning.datafusion.execution.time_zone— required for any timestamp arithmetic that needs a non-UTC zone.datafusion.format.timestamp_format,…format.date_format,…format.binary_format— affect everyDataFrame.show()and any text rendering.datafusion.sql_parser.dialect,…sql_parser.enable_ident_normalization— needed to integrate with non-PostgreSQL frontends (MySQL, Snowflake, Hive, …).datafusion.explain.show_statistics,…explain.show_sizes,…explain.format— needed by tooling/UI built on top ofEXPLAIN.extensionsset viaExtensions::insert(...)— currently unreachable from Java at any granularity.DataFusion already exposes string-keyed get/set on its config (
ConfigOptions::set(key, value)andConfigOptions::entries()). The Java binding can mirror both directly:This issue tracks adding the matching Java surface for both directions.
Describe the solution you'd like
Three new methods, two on
SessionContextBuilderand one onSessionContext:Wire format
Purely additive — one new field on the existing
SessionOptionsproto:map<string, string>overrepeated KeyValuebecause:putOptions/HashMap<String,String>rather than a wrapper struct).setOption / setOptions (write side)
Stored into a
LinkedHashMap<String, String>so caller-order is preserved on the Java side; serialized viab.putAllOptions(...)into the proto.getOption (read side)
Lives on
SessionContextrather than the builder, because the value it returns is "what DataFusion actually compiled" — only knowable post-construction. Returns the value as aString, ornullif the key is recognised but has no value set and no default. Unknown keys throwRuntimeExceptionto mirrorsetOption's strictness, and a closed-context call throwsIllegalStateExceptionlike every other method on the class.Returning a plain
Stringrather than a richerConfigEntry { key, value, description }because the Java side has noConfigEntrymirror today; introducing one is review surface for another day, and a richer return type is purely additive.Rust surface
In
native/src/lib.rs::createSessionContextWithOptions, after the existing typed-field decoding:Plus a new
getOptionNativeJNI handler that walksctx.copied_config().options().entries()and returns the matchingentry.valueas aJString, ornullfor known-but-unset.ConfigOptions::setreturns aDataFusionErroron unknown keys or unparseable values, whichtry_unwrap_or_throwalready surfaces as aRuntimeExceptionwith the underlying message — same error path as the rest of the JNI crate.Behavior precedence
Map entries are applied after the typed setters, so a caller that sets both
batchSize(8192)andsetOption("datafusion.execution.batch_size", "1")gets the overriding behavior. This is the only semantically defensible order (otherwise the typed-setter-wins case would silently drop the explicitsetOptioncall), but flagged here for explicit reviewer sign-off — see Open questions below.Out of scope (separate follow-ups)
CsvReadOptions.has_header— per-call, already typed, not session-level.ConfigOptions::set. Same UX as the Rust string-keyed API.ConfigExceptionhierarchy. Today every JNI error collapses toRuntimeException. Promoting these to typed exceptions is a separate cross-cutting concern.listOptions()enumerator returning all keys. Useful for tooling but not needed by users; easy follow-up if requested.Out of scope (separate follow-ups)
CsvReadOptions.has_headerper-call, already typed, not session-level.ConfigOptions::set. Same UX as the Rust string-keyed API.ConfigExceptionhierarchy. Today every JNI error collapses toRuntimeException. Promoting these to typed exceptions is a separate cross-cutting concern (see related work inprs/CONTRIB_ISSUES.mdAdd test section to README #12 in the OpenSearch-sourced backlog).Describe alternatives you've considered
One named setter per key, e.g.
parquetPushdownFilters(boolean). Originally drafted as the first cut. Rejected because it scales to ~200 setters and forces a separate PR for every new dial DataFusion adds upstream. Named setters for the common knobs (the six already in feat: configure SessionContext and RuntimeEnv via builder #28) are the right call; a generic escape hatch covers the long tail.A bundled
parquetOptions(ParquetSessionOptions)builder per-section. Decent ergonomics but loses everything outside parquet, and forces us to re-design the surface every time DataFusion adds a section. ThesetOption(key, value)shape tracks the upstream API exactly and needs no follow-up when new keys are added.getOption on the builder instead of on
SessionContext. Considered but rejected: a builder-side getter would only know what's pending in the local map, which is strictly worse than reading what DataFusion actually compiled. The post-constructionSessionContext.getOptionis the meaningful contract.Returning
nullfor unknown keys instead of throwing. Conflates "no such key" (caller typo) with "known-but-unset" (legitimate state). Throwing on unknown keys matchessetOption's strictness and gives faster feedback on typos.Additional context
No response