[SPARK-56554][SQL] Respect inferSchema option when parsing XML as variant#55433
Closed
xiaonanyang-db wants to merge 5 commits into
Closed
[SPARK-56554][SQL] Respect inferSchema option when parsing XML as variant#55433xiaonanyang-db wants to merge 5 commits into
xiaonanyang-db wants to merge 5 commits into
Conversation
…iant ### What changes were proposed in this pull request? The XML-to-Variant parser in `StaxXmlParser` always inferred primitive types (boolean/long/decimal) for leaf text and attribute values, regardless of the `inferSchema` option on the XML data source. When `inferSchema=false`, users reasonably expect the raw XML text to be preserved as strings inside the Variant, but the option was silently ignored on the Variant code path (the StructType path already honors it). This PR adds an `options.inferSchema` check in `appendXMLCharacterToVariant`: when inference is disabled, the function appends the leaf value as a string and skips the boolean/long/decimal path. Null and `nullValue` semantics are unchanged -- those are explicit user rules, not type inference. The default behavior (`inferSchema=true`) is preserved. ### Why are the changes needed? The current behavior silently corrupts data. Concrete examples reported by users: - Integer product codes with leading zeros lose the zeros: `<code>0501</code>` becomes `501`, breaking downstream lookup logic. - German-locale decimals with `,` as the decimal separator are misread: `<price>37,77</price>` (= 37.77) becomes `3777` because the decimal parser treats `,` as a thousands separator. Passing `inferSchema=false` is the expected knob for avoiding these, and it should work on the Variant path the same way it works on the StructType path. ### Does this PR introduce _any_ user-facing change? Yes, but only when the user explicitly sets `inferSchema=false` while reading XML as Variant. Previously the option had no effect on the Variant path; after this PR, primitive leaf values are preserved as strings instead of being type-inferred. The default behavior (`inferSchema=true`) is unchanged. ### How was this patch tested? New unit tests in `XmlVariantSuite` cover boolean/integer/decimal leaves, attribute values, value tags, and null/`nullValue` handling under `inferSchema=false`, including the two customer-reported cases (leading-zero integers and German-locale decimals). Tests run against both the default parser and the legacy parser via `XmlVariantSuiteWithLegacyParser`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Anthropic Claude Opus 4.7)
Because disabling primitive type inference for Variant leaves is a user- visible behavior change for any existing workload that passes inferSchema=false while reading XML as Variant, gate the fix behind a new opt-in SQLConf so the default behavior is unchanged. New conf: spark.sql.xml.variant.respectInferSchema (internal, default false). When true, XmlOptions surfaces the conf as options.respectVariantInferSchema, and the parser short-circuits to appendString only when both the conf is on and inferSchema is false. Defaults preserve the pre-SPARK-56554 behavior. Update XmlVariantSuite: wrap the inferSchema=false cases in withSQLConf with the conf enabled, and add a new case verifying that with the conf disabled inferSchema=false still type-infers (leading-zero integers, German-locale decimals, and booleans all return inferred values). Co-authored-by: Isaac
Drop the .internal() marker so users can see and enable the conf directly via SET or the Spark config page, without relying on internal-only tooling. Co-authored-by: Isaac
Flip the default of spark.sql.xml.variant.respectInferSchema from false to true so the SPARK-56554 fix is on by default, matching Spark's convention of making correctness fixes the default and gating the old behavior behind a conf. Users who were silently relying on the pre-fix behavior can set the conf to false to opt back into it. Updates the conf's doc string to describe the kill-switch role, and renames the "conf off" XmlVariantSuite test to reflect that it now covers the opt-out path instead of the default. Co-authored-by: Isaac
sandip-db
approved these changes
Apr 21, 2026
HyukjinKwon
approved these changes
Apr 21, 2026
SparkConfigBindingPolicySuite requires every dynamic session config to declare an explicit ConfigBindingPolicy. spark.sql.xml.variant.respectInferSchema is tunable per session, so SESSION is the right policy. Co-authored-by: Isaac
Member
|
Merged to master. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
The XML-to-Variant parser in
StaxXmlParseralways inferred primitive types (boolean/long/decimal) for leaf text and attribute values, regardless of theinferSchemaoption on the XML data source. WheninferSchema=false, users reasonably expect the raw XML text to be preserved as strings inside the Variant, but the option was silently ignored on the Variant code path (the StructType path already honors it).This PR adds an
options.inferSchemacheck inappendXMLCharacterToVariant: when inference is disabled, the function appends the leaf value as a string and skips the boolean/long/decimal path. Null andnullValuesemantics are unchanged -- those are explicit user rules, not type inference. The default behavior (inferSchema=true) is preserved.Why are the changes needed?
The current behavior silently corrupts data. Concrete examples reported by users:
<code>0501</code>becomes501, breaking downstream lookup logic.,as the decimal separator are misread:<price>37,77</price>(= 37.77) becomes3777because the decimal parser treats,as a thousands separator.Passing
inferSchema=falseis the expected knob for avoiding these, and it should work on the Variant path the same way it works on the StructType path.Does this PR introduce any user-facing change?
None
How was this patch tested?
New unit tests in
XmlVariantSuitecover boolean/integer/decimal leaves, attribute values, value tags, and null/nullValuehandling underinferSchema=false, including the two customer-reported cases (leading-zero integers and German-locale decimals). Tests run against both the default parser and the legacy parser viaXmlVariantSuiteWithLegacyParser.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic Claude Opus 4.7)