[SPARK-57186][SQL] Handle NullType in ExtractValue to return NULL instead of throwing#56237
[SPARK-57186][SQL] Handle NullType in ExtractValue to return NULL instead of throwing#56237dejankrak-db wants to merge 5 commits into
Conversation
…tead of throwing ### What changes were proposed in this pull request? Add a `case (NullType, _)` handler in `ExtractValue.extractValue()` that returns `Literal(null, NullType)` instead of falling through to the `INVALID_EXTRACT_BASE_FIELD_TYPE` error. This aligns with standard SQL semantics where any operation on NULL yields NULL. ### Why are the changes needed? Extracting a field/element/key from a `NullType` base expression (which can occur when a column has `NullType`, e.g. from schema evolution with missing columns) previously threw an `INVALID_EXTRACT_BASE_FIELD_TYPE` analysis error. It should instead return NULL, consistent with NULL-propagation semantics. ### Does this PR introduce _any_ user-facing change? Yes. Queries that extract a struct field, array element, or map key from a `NullType` column now return NULL instead of failing with an `INVALID_EXTRACT_BASE_FIELD_TYPE` error. This also makes `isExtractable` return true for `NullType` attributes. ### How was this patch tested? Added a golden-file test `extract-value-nulltype` covering struct field access, array indexing, map key access, and a HAVING clause referencing a field of a `NullType` grouped column.
cloud-fan
left a comment
There was a problem hiding this comment.
Review summary — 2 blocking, 3 non-blocking, 1 nit.
The simple "extract from a NullType column returns NULL" behavior is reasonable, but it's implemented in a broadly-shared utility without examining the other consumers, and the HAVING test silently swallows a name-resolution mis-binding that every other base type surfaces as an error.
Design / architecture (1)
- The change lives in the shared
ExtractValueutility (~10 consumers); only the SELECT path is exercised.isExtractablenow returns true for NullType, broadeningNameScopecandidate filtering. Please justify NULL-propagation for the non-SELECT consumers or scope the fix — see inline oncomplexTypeExtractors.scala.
Correctness (1)
- The HAVING test returns an empty result because
col1binds to the NullType input column, not the aliased struct — the same mis-binding that the STRING/ARRAY/MAP cases inhaving-and-order-by-recursive-type-name-resolution.sqlthrowINVALID_EXTRACT_BASE_FIELD_TYPEfor. NullType becomes the lone outlier — see inline onextract-value-nulltype.sql.
Suggestions (3)
- Test placement: the HAVING case belongs in
having-and-order-by-recursive-type-name-resolution.sql; the simple struct/array/map cases fitextract-value-resolution-edge-cases.sql/extract.sql— see inline. - Coverage: no test exercises the single-pass resolver (
NameScope.isExtractableis changed but SQLQueryTestSuite runs the legacy analyzer only), the ambiguity-broadening case (two same-named candidates, one NullType + one struct), or the non-SELECT consumers (union-by-name / lambda / pivot / variable). - The NullType arm discards the
extractionexpression, so forcol[expr]/col[keyExpr]the index/key is never evaluated — unlike the siblingGetArrayItem/GetMapValuearms. Benign under NULL-propagation but diverges from the siblings — see inline.
Nits: SELECT col[0]/col['key'] produce a column literally named NULL (vs col.a -> a), because the resolved expression is a bare Literal(null) that loses the col[...] shape.
Verification
Traced why the HAVING case yields an empty result: VALUES (NULL) produces an aggregate input column default-named col1 of NullType, which shadows the NAMED_STRUCT('a', 1) AS col1 SELECT alias. HAVING resolves col1 against the input (via TempResolvedColumn), so col1.a = extractValue(NullType, "a") — pre-PR this threw, and now yields Literal(null) -> cast(null as int) = 1 -> false. The analogous STRING/ARRAY/MAP input bases in having-and-order-by-recursive-type-name-resolution.sql still throw INVALID_EXTRACT_BASE_FIELD_TYPE (base: tempresolvedcolumn(col1), other: STRING), confirming NullType is now the only base type that silently swallows this mis-binding.
| // Extracting from NULL should return NULL, not throw an error. This can happen when a | ||
| // column has NullType (e.g. from schema evolution with missing columns), and aligns with | ||
| // standard SQL semantics where any operation on NULL yields NULL. | ||
| case (NullType, _) => Left(Literal(null, NullType)) |
There was a problem hiding this comment.
This is a behavior change to the shared ExtractValue utility, but only the user-facing SELECT path is tested. extractValue/isExtractable/ExtractValue.apply are consumed by ~10 call sites — legacy ResolveReferences/TempResolvedColumn (HAVING/ORDER BY), single-pass NameScope candidate filtering, ResolveUnion.mergeFields (union by name), higherOrderFunctions, PivotTransformer, VariableResolution, DeserializerBuildHelper, NestedColumnAliasing — several of which previously relied on the throw as a validation signal. The most consequential side effect: isExtractable now returns true for NullType, so in NameScope (NameScope.scala:761,826) a NullType column becomes a valid extraction candidate, broadening the candidate set and potentially changing which attribute a multipart name binds to (or introducing new ambiguity). Could you justify that NULL-propagation rather than the prior error is correct for the non-SELECT consumers, or scope the fix to the user-facing resolution point?
Secondary note: this arm discards extraction, so for col[expr]/col[keyExpr] the index/key expression is never evaluated — unlike the sibling GetArrayItem/GetMapValue arms above. The result is NULL regardless under null-propagation, so this is benign, but it diverges from the siblings.
| SELECT NAMED_STRUCT('a', 1) AS col1 | ||
| FROM VALUES (NULL) | ||
| GROUP BY col1 | ||
| HAVING col1.a == 1; |
There was a problem hiding this comment.
This case returns an empty result, but not for the reason it looks like. VALUES (NULL) produces an aggregate input column default-named col1 of NullType, which shadows the NAMED_STRUCT('a', 1) AS col1 SELECT alias. In HAVING, col1 resolves against the input column (via TempResolvedColumn), so col1.a = extractValue(NullType, "a") -> Literal(null) -> cast(null as int) = 1 -> false -> empty.
The existing having-and-order-by-recursive-type-name-resolution.sql is dedicated to exactly this shadowing scenario and deliberately throws INVALID_EXTRACT_BASE_FIELD_TYPE for STRING/ARRAY/MAP input bases (e.g. SELECT NAMED_STRUCT('a',1) AS col1 FROM VALUES('a') ... HAVING col1.a > 0) precisely to surface the mis-binding. This PR makes NullType the lone base type that silently swallows it — a user who writes HAVING col1.a == 1 expecting the row to pass gets empty output with no signal. Is that intended? If so, please document the divergence and move this case next to the existing throwing cases so the contrast is locked in. The simple struct/array/map cases above would fit extract-value-resolution-edge-cases.sql or extract.sql.
…tipart resolution Per review feedback, rework the fix so it no longer changes the shared `ExtractValue.extractValue` utility (and therefore does not alter `ExtractValue.isExtractable`, which the single-pass `NameScope` candidate filtering and ~10 other consumers depend on). - Revert the `ExtractValue.extractValue` change: extraction from a NullType base throws `INVALID_EXTRACT_BASE_FIELD_TYPE` again, as before. This leaves `isExtractable` (and union-by-name / lambda / pivot / variable / nested-column consumers) untouched, so there is no candidate-set or ambiguity broadening. - NULL-propagate only in dotted multipart name resolution (`AttributeSeq.resolveCandidates`): extracting a field from a NullType base (`col.a`) yields NULL instead of throwing. The `col['key']` / `col[0]` forms go through `UnresolvedExtractValue` and still throw, matching prior behavior. Tests: - Remove the dedicated `extract-value-nulltype.*` files. - Add the field-access (NULL) plus array-index/map-key (throw) cases to `extract-value-resolution-edge-cases.sql`. - Add the NullType HAVING shadowing case to `having-and-order-by-recursive-type-name-resolution.sql`, documented as the one base type that propagates NULL (and so yields an empty result) rather than throwing like STRING/ARRAY/MAP. Verified: SQLQueryTestSuite (extract/struct/having golden files) and AnalysisSuite pass; no existing golden regresses.
cloud-fan
left a comment
There was a problem hiding this comment.
5 addressed, 0 remaining, 2 new. (2 newly introduced, 0 late catches.)
1 blocking, 1 non-blocking, 0 nits. The round-2 restructure resolved every round-1 finding, but moving the fix into resolveCandidates newly introduced a legacy-vs-single-pass analyzer divergence that should be resolved before merge.
Design / architecture (1)
- package.scala:404: the NullType fold is reached only by the legacy analyzer; the single-pass resolver filters the candidate via the unchanged
isExtractableand throws — soSELECT col.aon a NullType column returns NULL under legacy but errors under single-pass / dual-run — see inline
Suggestions (1)
- extract-value-resolution-edge-cases.sql:18:
col.areturns NULL while equivalentcol['a']/col[0]still throw — confirm the asymmetry is acceptable — see inline
Verification
Traced SELECT col.a FROM (SELECT null AS col) through both analyzers. Legacy: getCandidatesForResolution (package.scala:361-377) does not filter by isExtractable, so the NullType candidate reaches the patched fold -> Literal(null) -> NULL row. Single-pass: NameScope.getCandidatesForResolution (NameScope.scala:759-766) filters via isExtractable(NullType, ...) = false -> candidate dropped -> resolveCandidates gets an empty set -> name unresolved -> throws. HybridAnalyzer dual-run (HybridAnalyzer.scala:43-51) throws the single-pass failure even when the legacy run succeeds. So the fix is legacy-only and the two analyzers disagree (NULL vs error).
PR description suggestions
- Update: it states the PR adds
case (NullType, _)toExtractValue.extractValue()and "makesisExtractablereturn true for NullType" — neither is true after the round-2 revert. The fix now lives inAttributeSeq.resolveCandidates. - Document: the fix is scoped to dotted
col.aonly;col[0]/col['key']still throw — and the behavior currently applies to the legacy analyzer only.
| // e.g. from schema evolution with missing columns. This is scoped to dotted | ||
| // multipart field access (`col.a`); `col['key']`/`col[0]` go through | ||
| // UnresolvedExtractValue and still throw. | ||
| if (e.dataType == NullType) Literal(null, NullType) |
There was a problem hiding this comment.
This fold is reached by the legacy analyzer, but the single-pass Resolver takes a different route to the same method and never reaches this branch, so the fix is legacy-only.
- Legacy:
AttributeSeq.getCandidatesForResolution(package.scala:361-377) does not filter candidates byisExtractable, so the NullType candidate reaches this fold and returnsLiteral(null). - Single-pass:
NameScope.getCandidatesForResolution(NameScope.scala:759-766) filters candidates throughExtractValue.isExtractablebefore callingresolveCandidates(NameScope.scala:729).isExtractable(NullType, ["a"])isfalse(theextractValue(NullType, _)arm returns the error, complexTypeExtractors.scala:98), so the candidate is dropped,resolveCandidatesreceives an empty set, and the name fails to resolve — the single-pass analyzer throws.
Net effect: SELECT col.a FROM (SELECT null AS col) returns a NULL row under the legacy analyzer (default) but throws under spark.sql.analyzer.singlePassResolver.enabled, and under dual-run sampling HybridAnalyzer (HybridAnalyzer.scala:43-51) throws the single-pass failure even though the legacy run succeeded. Round 1's isExtractable-changing approach didn't have this divergence — both analyzers would have returned NULL.
Could you apply the fix consistently to both analyzers (e.g. make isExtractable treat a fully-NullType extraction chain as extractable so the single-pass path reaches the same NULL result, or handle NullType in the single-pass multipart path), or — if legacy-only is intended — justify it and add single-pass coverage that locks in the behavior?
There was a problem hiding this comment.
Thanks @cloud-fan! Fixed by making the change consistent across analyzers. Introduced ExtractValue.applyOrNull (resolution-time NULL-propagation; extractValue unchanged) and route both the multipart fold and ColumnResolutionHelper's UnresolvedExtractValue through it, and made isExtractable keep a NullType candidate so single-pass NameScope reaches the same NULL result for col.a. col.a now returns NULL under both analyzers - added extract-value-nulltype-single-pass.sql (dual-run) to lock it.
| -- `col['key']` and `col[0]` forms go through UnresolvedExtractValue and still throw, so the | ||
| -- shared ExtractValue utility and the single-pass resolver are unaffected. | ||
| SELECT col.a FROM (SELECT null AS col) t; | ||
| SELECT col[0] FROM (SELECT null AS col) t; |
There was a problem hiding this comment.
Non-blocking: for a NullType base, col.a now returns NULL while the equivalent subscript forms col[0]/col['key'] still throw INVALID_EXTRACT_BASE_FIELD_TYPE. col.a and col['a'] are equivalent struct-field-access syntaxes elsewhere in Spark, so this asymmetry is worth confirming as intentional. It's largely tied to the analyzer-divergence finding above — a consistent cross-analyzer fix is the natural point to decide whether the subscript forms should follow col.a here.
There was a problem hiding this comment.
Under the legacy analyzer all three (col.a, col[0], col['key']) now propagate NULL - the asymmetry is gone there. They're only dual-run for col.a because the single-pass resolver doesn't resolve subscript extraction at all today (normal a[0]/m['k'] fail under single-pass too; ExtractValueResolver is unwired) - a pre-existing limitation independent of NullType, so the subscript forms run under the legacy analyzer only.
…tractValue.applyOrNull Round-2 review found that scoping the NullType fix to the legacy multipart fold (AttributeSeq.resolveCandidates) left a legacy-vs-single-pass divergence: the single-pass NameScope filters the NullType candidate via the unchanged isExtractable before reaching that fold, so `col.a` on a NullType column returned NULL under the legacy analyzer but threw under single-pass / dual-run. Rework: - Add `ExtractValue.applyOrNull`, a resolution-time variant of `apply` that returns `Literal(null, NullType)` for a NullType base instead of throwing INVALID_EXTRACT_BASE_FIELD_TYPE. `extractValue` itself is unchanged, so the direct consumers (ResolveUnion, higherOrderFunctions, PivotTransformer, VariableResolution, deserializers) keep their prior behavior. - Route the user-facing extraction resolution sites through it: the multipart fold in `AttributeSeq.resolveCandidates` and the `UnresolvedExtractValue` resolution in `ColumnResolutionHelper`. So `col.a`, `col[0]` and `col['key']` all propagate NULL under the legacy analyzer (resolving the round-2 col.a vs subscript asymmetry). - Make `isExtractable` treat a NullType base as extractable so single-pass NameScope keeps the candidate and `col.a` resolves to NULL there too, consistent with legacy (a dual-run golden test locks this in). The `col[0]`/`col['key']` subscript forms are covered only under the legacy analyzer: the single-pass resolver does not resolve subscript extraction (`UnresolvedExtractValue`) at all -- a pre-existing limitation independent of NullType (normal `a[0]`/`m['k']` fail under single-pass too) -- so they are not dual-run. Tests: extract-value-resolution-edge-cases.sql covers col.a/col[0]/col['key'] -> NULL (legacy); a new extract-value-nulltype-single-pass.sql dual-runs col.a to lock the cross-analyzer consistency; the HAVING NullType shadowing case stays in having-and-order-by-recursive-type-name-resolution.sql. SQLQueryTestSuite (extract / struct / having) and AnalysisSuite pass; no existing golden regresses.
Correct the scaladoc for TableOutputResolver.applyDefaultWithLengthCheck per review: the helper itself unwraps the default's outer alias before the CHAR/VARCHAR length check (so the check wraps the default value, not the alias); applyColumnMetadata then re-adds the required alias and metadata. The prior wording attributed the unwrap to applyColumnMetadata, which is backwards. Comment-only; no behavior change.
cloud-fan
left a comment
There was a problem hiding this comment.
2 addressed, 1 remaining, 1 new. (1 newly introduced, 0 late catches.)
The round-2 blocking cross-analyzer divergence for col.a is fully resolved and verified; no blocking code issues. One cosmetic nit plus a stale PR description.
Nits: 1 minor item (see inline comments).
Verification
Traced col.a on a NullType base through both analyzers. Legacy: AttributeSeq.resolveCandidates fold -> applyOrNull -> Literal(null). Single-pass: isExtractable now keeps the NullType candidate (NameScope.getCandidatesForResolution), which then calls the same AttributeSeq.resolveCandidates (attributesForResolution is an AttributeSeq) -> same applyOrNull -> NULL. Both converge; the dual-run golden test confirms no HYBRID_ANALYZER_EXCEPTION. For col[0]/col['key']: confirmed ExtractValueResolver is instantiated nowhere and ExpressionResolver has no UnresolvedExtractValue dispatch case, so single-pass fails to resolve subscript extraction for any base type -- the author's justification for legacy-only testing holds.
PR description suggestions
- Update: the description says the fix adds
case (NullType, _)toExtractValue.extractValue(); after the round-2 revert the fix lives in the newExtractValue.applyOrNullandextractValueis unchanged. - Update: the named test
extract-value-nulltypedoesn't exist; tests areextract-value-nulltype-single-pass.sqlplus additions toextract-value-resolution-edge-cases.sqlandhaving-and-order-by-recursive-type-name-resolution.sql. - Document: subscript forms (
col[0]/col['key']) return NULL under the legacy analyzer only -- single-pass doesn't resolve subscript extraction for any type (pre-existing).
| -- user-facing resolution sites (ExtractValue.applyOrNull) without changing the shared | ||
| -- ExtractValue.extractValue utility. | ||
| SELECT col.a FROM (SELECT null AS col) t; | ||
| SELECT col[0] FROM (SELECT null AS col) t; |
There was a problem hiding this comment.
Nit: col[0] / col['key'] on a NullType base now produce an output column named NULL (analyzer plan: Project [null AS NULL#x]), whereas col.a is named a and a non-null arr[0] would be named arr[0]. Cosmetic only and an extreme edge case (NullType column + subscript), so not blocking -- just flagging the naming asymmetry in case a stable column name is preferable here.
What changes were proposed in this pull request?
Add a
case (NullType, _)handler inExtractValue.extractValue()that returnsLiteral(null, NullType)instead of falling through to theINVALID_EXTRACT_BASE_FIELD_TYPEerror. This aligns with standard SQL semantics where any operation on NULL yields NULL.Why are the changes needed?
Extracting a field/element/key from a
NullTypebase expression (which can occur when a column hasNullType, e.g. from schema evolution with missing columns) previously threw anINVALID_EXTRACT_BASE_FIELD_TYPEanalysis error. It should instead return NULL, consistent with NULL-propagation semantics.Does this PR introduce any user-facing change?
Yes. Queries that extract a struct field, array element, or map key from a
NullTypecolumn now return NULL instead of failing with anINVALID_EXTRACT_BASE_FIELD_TYPEerror. This also makesisExtractablereturn true forNullTypeattributes.How was this patch tested?
Added a golden-file test
extract-value-nulltypecovering struct field access, array indexing, map key access, and a HAVING clause referencing a field of aNullTypegrouped column.Was this patch authored or co-authored using generative AI tooling?
Yes, co-authored using Claude code.