Skip to content

Commit

Permalink
[SPARK-38030][SQL][3.2] Canonicalization should not remove nullabilit…
Browse files Browse the repository at this point in the history
…y of AttributeReference dataType

This is a backport of #35332 to branch 3.2

### What changes were proposed in this pull request?
Canonicalization of AttributeReference should not remove nullability information of its dataType.

### Why are the changes needed?
SPARK-38030 lists an issue where canonicalization of cast resulted in an unresolved expression, thus causing query failure. The issue was that the child AttributeReference's dataType was converted to nullable during canonicalization and hence the Cast's `checkInputDataTypes` fails. Although the exact repro listed in SPARK-38030 no longer works in branch-3.2 due to an unrelated change (details in the JIRA), some other codepaths which depend on canonicalized representations can trigger the same issue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit test to ensure that canonicalization preserves nullability of AttributeReference and does not result in an unresolved cast

Closes #35446 from shardulm94/SPARK-38030-3.2.

Authored-by: Shardul Mahadik <smahadik@linkedin.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
  • Loading branch information
shardulm94 authored and dongjoon-hyun committed Feb 9, 2022
1 parent ba3c8c5 commit d62735d
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,7 @@ package org.apache.spark.sql.catalyst.expressions
* return the same answer given any input (i.e. false negatives are possible).
*
* The following rules are applied:
* - Names and nullability hints for [[org.apache.spark.sql.types.DataType]]s are stripped.
* - Names for [[GetStructField]] are stripped.
* - Names for [[org.apache.spark.sql.types.DataType]]s and [[GetStructField]] are stripped.
* - TimeZoneId for [[Cast]] and [[AnsiCast]] are stripped if `needsTimeZone` is false.
* - Commutative and associative operations ([[Add]] and [[Multiply]]) have their children ordered
* by `hashCode`.
Expand All @@ -39,10 +38,10 @@ object Canonicalize {
expressionReorder(ignoreTimeZone(ignoreNamesTypes(e)))
}

/** Remove names and nullability from types, and names from `GetStructField`. */
/** Remove names from types and `GetStructField`. */
private[expressions] def ignoreNamesTypes(e: Expression): Expression = e match {
case a: AttributeReference =>
AttributeReference("none", a.dataType.asNullable)(exprId = a.exprId)
AttributeReference("none", a.dataType)(exprId = a.exprId)
case GetStructField(child, ordinal, Some(_)) => GetStructField(child, ordinal, None)
case _ => e
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ import org.apache.spark.SparkFunSuite
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._
import org.apache.spark.sql.catalyst.plans.logical.Range
import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType}
import org.apache.spark.sql.types.{IntegerType, LongType, StringType, StructField, StructType}

class CanonicalizeSuite extends SparkFunSuite {

Expand Down Expand Up @@ -177,4 +177,17 @@ class CanonicalizeSuite extends SparkFunSuite {
assert(expr.semanticEquals(attr))
assert(attr.semanticEquals(expr))
}

test("SPARK-38030: Canonicalization should not remove nullability of AttributeReference" +
" dataType") {
val structType = StructType(Seq(StructField("name", StringType, nullable = false)))
val attr = AttributeReference("col", structType)()
// AttributeReference dataType should not be converted to nullable
assert(attr.canonicalized.dataType === structType)

val cast = Cast(attr, structType)
assert(cast.resolved)
// canonicalization should not converted resolved cast to unresolved
assert(cast.canonicalized.resolved)
}
}

0 comments on commit d62735d

Please sign in to comment.