Skip to content

[SPARK-56554][SQL] Respect inferSchema option when parsing XML as variant#55433

Closed
xiaonanyang-db wants to merge 5 commits into
apache:masterfrom
xiaonanyang-db:SPARK-56554
Closed

[SPARK-56554][SQL] Respect inferSchema option when parsing XML as variant#55433
xiaonanyang-db wants to merge 5 commits into
apache:masterfrom
xiaonanyang-db:SPARK-56554

Conversation

@xiaonanyang-db
Copy link
Copy Markdown
Contributor

@xiaonanyang-db xiaonanyang-db commented Apr 20, 2026

What changes were proposed in this pull request?

The XML-to-Variant parser in StaxXmlParser always inferred primitive types (boolean/long/decimal) for leaf text and attribute values, regardless of the inferSchema option on the XML data source. When inferSchema=false, users reasonably expect the raw XML text to be preserved as strings inside the Variant, but the option was silently ignored on the Variant code path (the StructType path already honors it).

This PR adds an options.inferSchema check in appendXMLCharacterToVariant: when inference is disabled, the function appends the leaf value as a string and skips the boolean/long/decimal path. Null and nullValue semantics are unchanged -- those are explicit user rules, not type inference. The default behavior (inferSchema=true) is preserved.

Why are the changes needed?

The current behavior silently corrupts data. Concrete examples reported by users:

  • Integer product codes with leading zeros lose the zeros: <code>0501</code> becomes 501, breaking downstream lookup logic.
  • German-locale decimals with , as the decimal separator are misread: <price>37,77</price> (= 37.77) becomes 3777 because the decimal parser treats , as a thousands separator.

Passing inferSchema=false is the expected knob for avoiding these, and it should work on the Variant path the same way it works on the StructType path.

Does this PR introduce any user-facing change?

None

How was this patch tested?

New unit tests in XmlVariantSuite cover boolean/integer/decimal leaves, attribute values, value tags, and null/nullValue handling under inferSchema=false, including the two customer-reported cases (leading-zero integers and German-locale decimals). Tests run against both the default parser and the legacy parser via XmlVariantSuiteWithLegacyParser.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic Claude Opus 4.7)

…iant

### What changes were proposed in this pull request?

The XML-to-Variant parser in `StaxXmlParser` always inferred primitive types
(boolean/long/decimal) for leaf text and attribute values, regardless of the
`inferSchema` option on the XML data source. When `inferSchema=false`, users
reasonably expect the raw XML text to be preserved as strings inside the
Variant, but the option was silently ignored on the Variant code path (the
StructType path already honors it).

This PR adds an `options.inferSchema` check in `appendXMLCharacterToVariant`:
when inference is disabled, the function appends the leaf value as a string
and skips the boolean/long/decimal path. Null and `nullValue` semantics are
unchanged -- those are explicit user rules, not type inference. The default
behavior (`inferSchema=true`) is preserved.

### Why are the changes needed?

The current behavior silently corrupts data. Concrete examples reported by
users:

- Integer product codes with leading zeros lose the zeros: `<code>0501</code>`
  becomes `501`, breaking downstream lookup logic.
- German-locale decimals with `,` as the decimal separator are misread:
  `<price>37,77</price>` (= 37.77) becomes `3777` because the decimal parser
  treats `,` as a thousands separator.

Passing `inferSchema=false` is the expected knob for avoiding these, and it
should work on the Variant path the same way it works on the StructType path.

### Does this PR introduce _any_ user-facing change?

Yes, but only when the user explicitly sets `inferSchema=false` while reading
XML as Variant. Previously the option had no effect on the Variant path;
after this PR, primitive leaf values are preserved as strings instead of being
type-inferred. The default behavior (`inferSchema=true`) is unchanged.

### How was this patch tested?

New unit tests in `XmlVariantSuite` cover boolean/integer/decimal leaves,
attribute values, value tags, and null/`nullValue` handling under
`inferSchema=false`, including the two customer-reported cases (leading-zero
integers and German-locale decimals). Tests run against both the default
parser and the legacy parser via `XmlVariantSuiteWithLegacyParser`.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic Claude Opus 4.7)
Because disabling primitive type inference for Variant leaves is a user-
visible behavior change for any existing workload that passes
inferSchema=false while reading XML as Variant, gate the fix behind a new
opt-in SQLConf so the default behavior is unchanged.

New conf: spark.sql.xml.variant.respectInferSchema (internal, default false).
When true, XmlOptions surfaces the conf as options.respectVariantInferSchema,
and the parser short-circuits to appendString only when both the conf is on
and inferSchema is false. Defaults preserve the pre-SPARK-56554 behavior.

Update XmlVariantSuite: wrap the inferSchema=false cases in withSQLConf with
the conf enabled, and add a new case verifying that with the conf disabled
inferSchema=false still type-infers (leading-zero integers, German-locale
decimals, and booleans all return inferred values).

Co-authored-by: Isaac
Drop the .internal() marker so users can see and enable the conf directly
via SET or the Spark config page, without relying on internal-only tooling.

Co-authored-by: Isaac
Flip the default of spark.sql.xml.variant.respectInferSchema from false to
true so the SPARK-56554 fix is on by default, matching Spark's convention of
making correctness fixes the default and gating the old behavior behind a
conf. Users who were silently relying on the pre-fix behavior can set the
conf to false to opt back into it.

Updates the conf's doc string to describe the kill-switch role, and renames
the "conf off" XmlVariantSuite test to reflect that it now covers the
opt-out path instead of the default.

Co-authored-by: Isaac
SparkConfigBindingPolicySuite requires every dynamic session config to
declare an explicit ConfigBindingPolicy. spark.sql.xml.variant.respectInferSchema
is tunable per session, so SESSION is the right policy.

Co-authored-by: Isaac
@HyukjinKwon
Copy link
Copy Markdown
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants