Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40615][SQL] Check unsupported data types when decorrelating subqueries #38050

Conversation

allisonwang-db
Copy link
Contributor

@allisonwang-db allisonwang-db commented Sep 29, 2022

What changes were proposed in this pull request?

This PR checks unsupported data types when decorrelating subqueries, and throws a more user-friendly error message if so.

Why are the changes needed?

There are certain data types (e.g MapType) that do not support ordering. This will cause the join conditions added by DecorrelateInnerQuery unresolved. We want to improve the error messages in this case.

Does this PR introduce any user-facing change?

Yes. This PR introduces a new error message. For example:

-- Suppose x is a map type column
select (select a + a from (select upper(x['a']) as a)) from v1

Before this PR, this will throw an exception:

After applying rule org.apache.spark.sql.catalyst.optimizer.PullupCorrelatedPredicates in batch Pullup Correlated Expressions, the structural integrity of the plan is broken.

After this PR, this will throw an exception with a better error message:

Correlated column reference 'v1.x' cannot be map type

How was this patch tested?

Unit tests

},
"UNSUPPORTED_OUTER_REFERENCE_DATA_TYPE" : {
"message" : [
"Correlated column references do not support data type <dataType>: <expr>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I found it was a bit confusing to read the error message instance: Correlated column references do not support data type map<string,int>: v1.x.

Basically v1.x as an expression was not clear in the error message.

For example to me this is more clear: Correlated column references do not support expression v1.x due to unsupported data type map<string,int>. Something like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even Correlated column references do not support data type map<string,int> of expression v1.x is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just say:

The data type of correlated column references cannot be/contain map type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amaliujia @cloud-fan I will update the error message.

@allisonwang-db
Copy link
Contributor Author

cc @cloud-fan

@allisonwang-db allisonwang-db force-pushed the spark-40615-check-subquery-data-type branch from 078a1a2 to dacf544 Compare October 6, 2022 19:25
throw QueryCompilationErrors.unsupportedCorrelatedReferenceDataTypeError(
o, a.dataType, plan.origin)
} else {
throw new IllegalStateException(s"Unable to decorrelate subquery: " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this should never happen, let's use SparkException.internalError

@@ -3138,4 +3138,4 @@
"<className> must override either <m1> or <m2>"
]
}
}
}
Copy link
Contributor

@amaliujia amaliujia Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: revert this.

@@ -369,7 +370,7 @@ object DecorrelateInnerQuery extends PredicateHelper {
throw QueryCompilationErrors.unsupportedCorrelatedReferenceDataTypeError(
o, a.dataType, plan.origin)
} else {
throw new IllegalStateException(s"Unable to decorrelate subquery: " +
throw SparkException.internalError(s"Unable to decorrelate subquery: " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I saw Wenchen's comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -879,6 +879,11 @@
"Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses<treeNode>"
]
},
"UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE" : {
"message" : [
"Correlated column reference '<expr>' cannot be <dataType> type"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this is better!

@amaliujia
Copy link
Contributor

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 480ca17 Oct 18, 2022
HyukjinKwon added a commit that referenced this pull request Oct 21, 2022
…bled

### What changes were proposed in this pull request?

This PR proposes the make the tests added in #38050 pass with ANSI mode enabled by avoiding string binary operations.

### Why are the changes needed?

To make the tests pass with ANSI enabled on. Currently, it fails as below (https://github.com/apache/spark/actions/runs/3286184541/jobs/5414029918):

```
[info] - SPARK-40615: Check unsupported data type when decorrelating subqueries *** FAILED *** (118 milliseconds)
[info]   "[DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE] Cannot resolve "(a + a)" due to data type mismatch: the binary operator requires the input type ("NUMERIC" or "INTERVAL DAY TO SECOND" or "INTERVAL YEAR TO MONTH" or "INTERVAL"), not "STRING".; line 1 pos 15;
[info]   'Project [unresolvedalias(scalar-subquery#426412 [], None)]
[info]   :  +- 'Project [unresolvedalias((a#426411 + a#426411), None)]
[info]   :     +- SubqueryAlias __auto_generated_subquery_name
[info]   :        +- Project [upper(cast(outer(x#426413)[a] as string)) AS a#426411]
[info]   :           +- OneRowRelation
[info]   +- SubqueryAlias v1
[info]      +- View (`v1`, [x#426413])
[info]         +- Project [cast(x#426414 as map<string,int>) AS x#426413]
[info]            +- SubqueryAlias t
[info]               +- LocalRelation [x#426414]
[info]   " did not contain "Correlated column reference 'v1.x' cannot be map type" (SubquerySuite.scala:2480)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at org.apache.spark.sql.SubquerySuite.$anonfun$new$320(SubquerySuite.scala:2480)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
[info]   at org.apache.spark.sql.test.SQLTestUtilsBase.withTempView(SQLTestUtils.scala:276)
[info]   at org.apache.spark.sql.test.SQLTestUtilsBase.withTempView$(SQLTestUtils.scala:274)
[info]   at org.apache.spark.sql.SubquerySuite.withTempView(SubquerySuite.scala:32)
[info]   at org.apache.spark.sql.SubquerySuite.$anonfun$new$319(SubquerySuite.scala:2459)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
```

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually ran the tests and verified that it passes.

Closes #38325 from HyukjinKwon/SPARK-40615-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…bqueries

### What changes were proposed in this pull request?

This PR checks unsupported data types when decorrelating subqueries, and throws a more user-friendly error message if so.

### Why are the changes needed?

There are certain data types (e.g MapType) that do not support ordering. This will cause the join conditions added by `DecorrelateInnerQuery` unresolved. We want to improve the error messages in this case.

### Does this PR introduce _any_ user-facing change?

Yes. This PR introduces a new error message. For example:
```sql
-- Suppose x is a map type column
select (select a + a from (select upper(x['a']) as a)) from v1
```
Before this PR, this will throw an exception:
```
After applying rule org.apache.spark.sql.catalyst.optimizer.PullupCorrelatedPredicates in batch Pullup Correlated Expressions, the structural integrity of the plan is broken.
```

After this PR, this will throw an exception with a better error message:
```
Correlated column reference 'v1.x' cannot be map type
```

### How was this patch tested?

Unit tests

Closes apache#38050 from allisonwang-db/spark-40615-check-subquery-data-type.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…bled

### What changes were proposed in this pull request?

This PR proposes the make the tests added in apache#38050 pass with ANSI mode enabled by avoiding string binary operations.

### Why are the changes needed?

To make the tests pass with ANSI enabled on. Currently, it fails as below (https://github.com/apache/spark/actions/runs/3286184541/jobs/5414029918):

```
[info] - SPARK-40615: Check unsupported data type when decorrelating subqueries *** FAILED *** (118 milliseconds)
[info]   "[DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE] Cannot resolve "(a + a)" due to data type mismatch: the binary operator requires the input type ("NUMERIC" or "INTERVAL DAY TO SECOND" or "INTERVAL YEAR TO MONTH" or "INTERVAL"), not "STRING".; line 1 pos 15;
[info]   'Project [unresolvedalias(scalar-subquery#426412 [], None)]
[info]   :  +- 'Project [unresolvedalias((a#426411 + a#426411), None)]
[info]   :     +- SubqueryAlias __auto_generated_subquery_name
[info]   :        +- Project [upper(cast(outer(x#426413)[a] as string)) AS a#426411]
[info]   :           +- OneRowRelation
[info]   +- SubqueryAlias v1
[info]      +- View (`v1`, [x#426413])
[info]         +- Project [cast(x#426414 as map<string,int>) AS x#426413]
[info]            +- SubqueryAlias t
[info]               +- LocalRelation [x#426414]
[info]   " did not contain "Correlated column reference 'v1.x' cannot be map type" (SubquerySuite.scala:2480)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at org.apache.spark.sql.SubquerySuite.$anonfun$new$320(SubquerySuite.scala:2480)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
[info]   at org.apache.spark.sql.test.SQLTestUtilsBase.withTempView(SQLTestUtils.scala:276)
[info]   at org.apache.spark.sql.test.SQLTestUtilsBase.withTempView$(SQLTestUtils.scala:274)
[info]   at org.apache.spark.sql.SubquerySuite.withTempView(SubquerySuite.scala:32)
[info]   at org.apache.spark.sql.SubquerySuite.$anonfun$new$319(SubquerySuite.scala:2459)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
```

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually ran the tests and verified that it passes.

Closes apache#38325 from HyukjinKwon/SPARK-40615-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants