[SPARK-40615][SQL] Check unsupported data types when decorrelating subqueries #38050

allisonwang-db · 2022-09-29T16:02:58Z

What changes were proposed in this pull request?

This PR checks unsupported data types when decorrelating subqueries, and throws a more user-friendly error message if so.

Why are the changes needed?

There are certain data types (e.g MapType) that do not support ordering. This will cause the join conditions added by DecorrelateInnerQuery unresolved. We want to improve the error messages in this case.

Does this PR introduce any user-facing change?

Yes. This PR introduces a new error message. For example:

-- Suppose x is a map type column
select (select a + a from (select upper(x['a']) as a)) from v1

Before this PR, this will throw an exception:

After applying rule org.apache.spark.sql.catalyst.optimizer.PullupCorrelatedPredicates in batch Pullup Correlated Expressions, the structural integrity of the plan is broken.

After this PR, this will throw an exception with a better error message:

Correlated column reference 'v1.x' cannot be map type

How was this patch tested?

Unit tests

amaliujia · 2022-09-29T17:48:03Z

core/src/main/resources/error/error-classes.json

+      },
+      "UNSUPPORTED_OUTER_REFERENCE_DATA_TYPE" : {
+        "message" : [
+          "Correlated column references do not support data type <dataType>: <expr>"


Nit: I found it was a bit confusing to read the error message instance: Correlated column references do not support data type map<string,int>: v1.x.

Basically v1.x as an expression was not clear in the error message.

For example to me this is more clear: Correlated column references do not support expression v1.x due to unsupported data type map<string,int>. Something like this.

Maybe even Correlated column references do not support data type map<string,int> of expression v1.x is better.

Maybe just say:

The data type of correlated column references cannot be/contain map type.

Thanks @amaliujia @cloud-fan I will update the error message.

allisonwang-db · 2022-10-03T14:34:46Z

cc @cloud-fan

cloud-fan · 2022-10-07T14:45:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala

+                  throw QueryCompilationErrors.unsupportedCorrelatedReferenceDataTypeError(
+                    o, a.dataType, plan.origin)
+                } else {
+                  throw new IllegalStateException(s"Unable to decorrelate subquery: " +


if this should never happen, let's use SparkException.internalError

amaliujia · 2022-10-07T18:48:23Z

core/src/main/resources/error/error-classes.json

@@ -3138,4 +3138,4 @@
      "<className> must override either <m1> or <m2>"
    ]
  }
-}
+}


Nit: revert this.

amaliujia · 2022-10-07T18:48:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala

@@ -369,7 +370,7 @@ object DecorrelateInnerQuery extends PredicateHelper {
                  throw QueryCompilationErrors.unsupportedCorrelatedReferenceDataTypeError(
                    o, a.dataType, plan.origin)
                } else {
-                  throw new IllegalStateException(s"Unable to decorrelate subquery: " +
+                  throw SparkException.internalError(s"Unable to decorrelate subquery: " +


Why this change?

oh I saw Wenchen's comment above.

See #38050 (comment)

amaliujia · 2022-10-07T18:48:52Z

core/src/main/resources/error/error-classes.json

@@ -879,6 +879,11 @@
          "Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses<treeNode>"
        ]
      },
+      "UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE" : {
+        "message" : [
+          "Correlated column reference '<expr>' cannot be <dataType> type"


+1 this is better!

amaliujia · 2022-10-07T19:03:54Z

LGTM

cloud-fan · 2022-10-18T14:24:51Z

thanks, merging to master!

…bled ### What changes were proposed in this pull request? This PR proposes the make the tests added in #38050 pass with ANSI mode enabled by avoiding string binary operations. ### Why are the changes needed? To make the tests pass with ANSI enabled on. Currently, it fails as below (https://github.com/apache/spark/actions/runs/3286184541/jobs/5414029918): ``` [info] - SPARK-40615: Check unsupported data type when decorrelating subqueries *** FAILED *** (118 milliseconds) [info] "[DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE] Cannot resolve "(a + a)" due to data type mismatch: the binary operator requires the input type ("NUMERIC" or "INTERVAL DAY TO SECOND" or "INTERVAL YEAR TO MONTH" or "INTERVAL"), not "STRING".; line 1 pos 15; [info] 'Project [unresolvedalias(scalar-subquery#426412 [], None)] [info] : +- 'Project [unresolvedalias((a#426411 + a#426411), None)] [info] : +- SubqueryAlias __auto_generated_subquery_name [info] : +- Project [upper(cast(outer(x#426413)[a] as string)) AS a#426411] [info] : +- OneRowRelation [info] +- SubqueryAlias v1 [info] +- View (`v1`, [x#426413]) [info] +- Project [cast(x#426414 as map<string,int>) AS x#426413] [info] +- SubqueryAlias t [info] +- LocalRelation [x#426414] [info] " did not contain "Correlated column reference 'v1.x' cannot be map type" (SubquerySuite.scala:2480) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.sql.SubquerySuite.$anonfun$new$320(SubquerySuite.scala:2480) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withTempView(SQLTestUtils.scala:276) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withTempView$(SQLTestUtils.scala:274) [info] at org.apache.spark.sql.SubquerySuite.withTempView(SubquerySuite.scala:32) [info] at org.apache.spark.sql.SubquerySuite.$anonfun$new$319(SubquerySuite.scala:2459) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ran the tests and verified that it passes. Closes #38325 from HyukjinKwon/SPARK-40615-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…bqueries ### What changes were proposed in this pull request? This PR checks unsupported data types when decorrelating subqueries, and throws a more user-friendly error message if so. ### Why are the changes needed? There are certain data types (e.g MapType) that do not support ordering. This will cause the join conditions added by `DecorrelateInnerQuery` unresolved. We want to improve the error messages in this case. ### Does this PR introduce _any_ user-facing change? Yes. This PR introduces a new error message. For example: ```sql -- Suppose x is a map type column select (select a + a from (select upper(x['a']) as a)) from v1 ``` Before this PR, this will throw an exception: ``` After applying rule org.apache.spark.sql.catalyst.optimizer.PullupCorrelatedPredicates in batch Pullup Correlated Expressions, the structural integrity of the plan is broken. ``` After this PR, this will throw an exception with a better error message: ``` Correlated column reference 'v1.x' cannot be map type ``` ### How was this patch tested? Unit tests Closes apache#38050 from allisonwang-db/spark-40615-check-subquery-data-type. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…bled ### What changes were proposed in this pull request? This PR proposes the make the tests added in apache#38050 pass with ANSI mode enabled by avoiding string binary operations. ### Why are the changes needed? To make the tests pass with ANSI enabled on. Currently, it fails as below (https://github.com/apache/spark/actions/runs/3286184541/jobs/5414029918): ``` [info] - SPARK-40615: Check unsupported data type when decorrelating subqueries *** FAILED *** (118 milliseconds) [info] "[DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE] Cannot resolve "(a + a)" due to data type mismatch: the binary operator requires the input type ("NUMERIC" or "INTERVAL DAY TO SECOND" or "INTERVAL YEAR TO MONTH" or "INTERVAL"), not "STRING".; line 1 pos 15; [info] 'Project [unresolvedalias(scalar-subquery#426412 [], None)] [info] : +- 'Project [unresolvedalias((a#426411 + a#426411), None)] [info] : +- SubqueryAlias __auto_generated_subquery_name [info] : +- Project [upper(cast(outer(x#426413)[a] as string)) AS a#426411] [info] : +- OneRowRelation [info] +- SubqueryAlias v1 [info] +- View (`v1`, [x#426413]) [info] +- Project [cast(x#426414 as map<string,int>) AS x#426413] [info] +- SubqueryAlias t [info] +- LocalRelation [x#426414] [info] " did not contain "Correlated column reference 'v1.x' cannot be map type" (SubquerySuite.scala:2480) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.sql.SubquerySuite.$anonfun$new$320(SubquerySuite.scala:2480) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withTempView(SQLTestUtils.scala:276) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withTempView$(SQLTestUtils.scala:274) [info] at org.apache.spark.sql.SubquerySuite.withTempView(SubquerySuite.scala:32) [info] at org.apache.spark.sql.SubquerySuite.$anonfun$new$319(SubquerySuite.scala:2459) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ran the tests and verified that it passes. Closes apache#38325 from HyukjinKwon/SPARK-40615-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added CORE SQL labels Sep 29, 2022

amaliujia reviewed Sep 29, 2022

View reviewed changes

allisonwang-db added 2 commits October 6, 2022 11:44

check

be0d942

address comments

dacf544

allisonwang-db force-pushed the spark-40615-check-subquery-data-type branch from 078a1a2 to dacf544 Compare October 6, 2022 19:25

retrigger build

5126921

cloud-fan reviewed Oct 7, 2022

View reviewed changes

cloud-fan approved these changes Oct 7, 2022

View reviewed changes

address comments and fix golden file

f4a1a25

amaliujia reviewed Oct 7, 2022

View reviewed changes

address comments

1db4093

cloud-fan closed this in 480ca17 Oct 18, 2022

HyukjinKwon mentioned this pull request Oct 21, 2022

[SPARK-40615][SQL][TESTS][FOLLOW-UP] Make the test pass with ANSI enabled #38325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40615][SQL] Check unsupported data types when decorrelating subqueries #38050

[SPARK-40615][SQL] Check unsupported data types when decorrelating subqueries #38050

allisonwang-db commented Sep 29, 2022 •

edited

amaliujia Sep 29, 2022

amaliujia Sep 29, 2022

cloud-fan Oct 5, 2022

allisonwang-db Oct 6, 2022

allisonwang-db commented Oct 3, 2022

cloud-fan Oct 7, 2022

amaliujia Oct 7, 2022 •

edited

amaliujia Oct 7, 2022

amaliujia Oct 7, 2022

allisonwang-db Oct 7, 2022

amaliujia Oct 7, 2022

amaliujia commented Oct 7, 2022

cloud-fan commented Oct 18, 2022

[SPARK-40615][SQL] Check unsupported data types when decorrelating subqueries #38050

[SPARK-40615][SQL] Check unsupported data types when decorrelating subqueries #38050

Conversation

allisonwang-db commented Sep 29, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonwang-db commented Oct 3, 2022

Choose a reason for hiding this comment

amaliujia Oct 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia commented Oct 7, 2022

cloud-fan commented Oct 18, 2022

allisonwang-db commented Sep 29, 2022 •

edited

amaliujia Oct 7, 2022 •

edited