feat(builder): expose ConfigOptions.set/get as setOption / setOptions / getOption#49
Merged
Merged
Conversation
… / getOption DataFusion's SessionConfig carries roughly 200 keys split across seven sections (datafusion.execution.*, datafusion.optimizer.*, etc). The Java SessionContextBuilder introduced in apache#28 covers six of them with named setters, and there is no Java surface to read any config value back at all. Rather than ship ~200 named get/set pairs one at a time, mirror DataFusion's existing string-keyed surface (ConfigOptions::set) as a generic escape hatch on the builder + context. Adds two additions to session_options.proto: a `repeated ConfigOption options = 7;` field plus the matching ConfigOption message. `repeated` is used instead of `map<string,string>` because protobuf maps decode into a Rust HashMap whose iteration order is randomized -- that would silently break overlapping-key cases like `datafusion.optimizer.enable_dynamic_filter_pushdown` (whose setter has side effects on the per-operator `enable_*_dynamic_filter_pushdown` flags). The Java builder gains setOption(key, value) and setOptions(Map). Java-side storage is a LinkedHashMap so caller insertion order is preserved end to end. On the native side, free-form options are applied via config.options_mut().set(k, v)? before SessionContext construction. Map entries are applied after the typed setters so an explicit setOption call wins over a typed setter for the same knob, and within the entry list the caller's last write wins -- both for same-key duplicates (LinkedHashMap dedups) and for overlapping side-effect keys. The ? form (rather than SessionConfig::set_str's .unwrap()) means unknown keys or unparseable values surface as a RuntimeException with DataFusion's error message instead of panicking the JVM. Adds SessionContext.getOption(key) on the constructed context (not on the builder, since the value reflects "what DataFusion actually compiled" -- only knowable post-construction). The native side walks ctx.copied_config().options().entries() and returns ConfigEntry.value as a String, or null if the key is known but unset and has no default. Unknown keys throw RuntimeException to mirror setOption's strictness. datafusion.runtime.* keys (memory limit, temp directory, cache sizes) live on a separate RuntimeEnv config object and have several upstream-shaped round-trip pitfalls that don't apply to the SessionConfig subtree (lazy default tempdir, per-session datafusion-Xxxxxx spill suffixes, OS-specific path separators, integer K/M/G truncation, sub-1KB byte formatting, the unlimited sentinel, multi-statement SET clobbering). Both setOption and getOption reject runtime keys with a clear error pointing at the typed memoryLimit() / tempDirectory() setters; round-trippable runtime support is tracked as a follow-up PR that needs a per-context side-cache. Tests cover the proto round-trip with explicit on-the-wire ordering assertions, bulk setOptions, null rejection, override-typed-setter semantics (asserted by reading the value back via getOption), last-write-wins for repeated keys, an unknown-key error path on both set and get, default-fallback on get, a closed-context guard on get, the runtime-key rejection on both set and get with messages that point at the typed setters, and the order-preservation case where setting the umbrella `enable_dynamic_filter_pushdown` flag followed by an explicit `enable_topk_dynamic_filter_pushdown=false` correctly leaves topk disabled (the override winning over the umbrella's side effect). Common knobs this unlocks include parquet pushdown_filters / bloom_filter_on_read, optimizer prefer_hash_join, execution time_zone, sql_parser dialect, and explain show_statistics -- previously inaccessible from Java without adding a named get/set per key.
30595c0 to
304ef6e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
SessionContextBuilder(introduced in #28) currently exposes typed setters for the six most-common session knobs:batchSize,targetPartitions,collectStatistics,informationSchema,memoryLimit, andtempDirectory. The Rust DataFusion configuration surface backing these is roughly 200 keys split across seven sections underSessionConfig(datafusion.catalog.*,datafusion.execution.*,datafusion.optimizer.*,datafusion.sql_parser.*,datafusion.explain.*,datafusion.format.*, plus userextensions). The Java builder reaches none of those except the six it names explicitly, and there is no Java surface to read any config value back at all.DataFusion already exposes a string-keyed setter for these (
ConfigOptions::set(key, value)). This PR mirrors that surface in Java rather than adding ~200 named setters one at a time.What changes are included in this PR?
SessionOptionsproto: a newrepeated ConfigOption options = 7;field plus theConfigOption { key, value }message. No existing field changes; wire format stays backward and forward compatible.repeatedis used instead ofmap<string, string>because protobuf maps decode intoHashMapon the Rust side, whose iteration order is randomized — that would silently break overlapping-key cases likedatafusion.optimizer.enable_dynamic_filter_pushdown(which has side effects on the per-operatorenable_*_dynamic_filter_pushdownflags).SessionContextBuilder:setOption(String key, String value)— single entry.setOptions(Map<String, String> entries)— bulk apply.SessionContext:getOption(String key) → String— read the current value of anydatafusion.*config key. Returns the value as a string, ornullif the key is recognised but has no value set and no default. ThrowsRuntimeExceptionon unknown keys (mirrorssetOption's strictness for fast feedback on typos), andIllegalStateExceptionwhen called on a closed context.config.options_mut().set(k, v)?before theSessionContextis constructed, so the typedbatchSize/targetPartitions/etc setters can be overridden by an explicitsetOptionfor the same key.getOptionNativewalksctx.copied_config().options().entries().The Rust calls use the
?form (rather thanSessionConfig::set_str(...).unwrap()) so unknown keys or unparseable values surface as aRuntimeExceptionon the Java side carrying DataFusion's error message, instead of panicking the JVM.What's intentionally out of scope: datafusion.runtime.* keys
datafusion.runtime.*keys (memory limit, temp directory, cache sizes, list-files-cache TTL) live on a separateRuntimeEnvconfig object. Round-tripping them throughgetOption/setOptionhas several upstream-shaped correctness pitfalls that don't apply to the session-config subtree (lazy default tempdir creation that materializes a path the user never set, per-sessiondatafusion-Xxxxxxspill suffixes, OS-specific path separators, integer K/M/G truncation that loses fractional capacities, sub-1KB values formatted as bare bytes, theunlimitedsentinel, and disk-manager state being clobbered when multiple runtime keys are routed throughSET). Doing them right needs a per-context side-cache of the user's verbatim values.This PR rejects
datafusion.runtime.*in bothsetOptionandgetOptionwith a clear error pointing at the typedmemoryLimit/tempDirectorysetters. Adding round-trippable runtime support is tracked as a separate follow-up.Apply order for setOption
Map entries are applied after typed setters by design — otherwise a caller writing both
batchSize(8192)andsetOption("datafusion.execution.batch_size", "1024")would have their explicitsetOptionsilently dropped. The opposite order would makesetOptionstrictly weaker than the typed setters and give callers no way to override.Among
setOptioncalls themselves, entries are applied in caller insertion order (first call applied first). The Java side stores entries in aLinkedHashMap, the proto field isrepeated(preserves order on the wire), and the Rust side iterates aVec(preserves order on apply). Caller's last write wins for both same-key duplicates and overlapping side-effect keys.Why getOption lives on SessionContext, not the builder
The value
getOptionreturns is "what DataFusion actually compiled". That's only knowable post-construction, so the read-side method belongs on the constructedSessionContext. A builder-side getter would only return what's pending in the local map, which is a strictly worse signal.Are these changes tested?
Yes — 14 new tests in
SessionContextBuilderTest, bringing the file from 5 to 19 tests total. All pass.Are there any user-facing changes?
Yes — purely additive. New public API:
SessionContextBuilder.setOption(String, String)SessionContextBuilder.setOptions(Map<String, String>)SessionContextBuilder.setOptions(LinkedHashMap<String, String>)SessionContext.getOption(String) → StringThe new
optionsfield is also exposed on the existingorg.apache.datafusion.protobuf.SessionOptionsgenerated class (getOptionsMap()etc.). No API removals, no deprecations, no behavior change for callers who don't call any of the new methods.