[SPARK-38977][SQL] Fix schema pruning with correlated subqueries #36303

aokolnychyi · 2022-04-21T16:41:03Z

What changes were proposed in this pull request?

This PR fixes schema pruning for queries with multiple correlated subqueries. Previously, Spark would throw an exception trying to determine root fields in SchemaPruning$identifyRootFields. That was happening because expressions in predicates that referenced attributes in subqueries were not ignored. That's why attributes from multiple subqueries could conflict with each other (e.g. incompatible types) even though they should be ignored.

For instance, the following query would throw a runtime exception.

SELECT name FROM contacts c
WHERE
 EXISTS (SELECT 1 FROM ids i WHERE i.value = c.id)
 AND
 EXISTS (SELECT 1 FROM first_names n WHERE c.name.first = n.value)

[info]   org.apache.spark.SparkException: Failed to merge fields 'value' and 'value'. Failed to merge incompatible data types int and string
[info]   at org.apache.spark.sql.errors.QueryExecutionErrors$.failedMergingFieldsError(QueryExecutionErrors.scala:936)

Why are the changes needed?

These changes are needed to avoid exceptions for some queries with multiple correlated subqueries.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR comes with tests.

aokolnychyi · 2022-04-21T16:45:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala

@@ -152,6 +152,10 @@ object SchemaPruning extends SQLConfHelper {
        RootField(field, derivedFromAtt = false, prunedIfAnyChildAccessed = true) :: Nil
      case IsNotNull(_: Attribute) | IsNull(_: Attribute) =>
        expr.children.flatMap(getRootFields).map(_.copy(prunedIfAnyChildAccessed = true))
+      case s: SubqueryExpression =>


Initially, I tried another approach. I was passing AttributeSet with table attributes and checking above if an attribute belongs to the table output. However, that required changing many places. This change is much smaller. Let me know if there are cases when this will not work.

This change looks reasonable to me. I am not aware of cases when this will not work. Let's wait for feedback from others.

aokolnychyi · 2022-04-21T16:46:24Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

@@ -935,4 +935,106 @@ abstract class SchemaPruningSuite
      .count()
    assert(count == 0)
  }
+
+  testSchemaPruning("SPARK-38977: schema pruning with correlated EXISTS subquery") {


All of these queries would previously fail for V2 tables.

this bug only happens for v2 tables, not file source tables?

I guess it will fail for both as the same method is used. Tests cover V1 and V2 so it should work for both now.

aokolnychyi · 2022-04-21T17:30:28Z

@huaxingao @sunchao @viirya @HyukjinKwon @cloud-fan @dongjoon-hyun @parthchandra, could you take a look whenever you have a minute?

viirya · 2022-04-21T18:36:08Z

This looks similar to #36216?

aokolnychyi · 2022-04-21T19:21:44Z

@viirya, it looks similar but I am afraid #36216 does not address the problem that fails queries in this PR.

As far as I see, it updates ProjectionOverSchema that's being used after calling SchemaPruning$identifyRootFields. In my case, the failure happened while merging the schema in identifyRootFields. I am not sure whether my fix covers the other case, though.

aokolnychyi · 2022-04-21T19:46:23Z

We may need both. Let me quickly check.

aokolnychyi · 2022-04-21T20:49:23Z

Alright, I think we will need both.

PR [SPARK-38918][SQL] Nested column pruning should filter out attributes that do not belong to the current relation #36216 does not solve the problem in identifyRootFields and tests from this PR fail.
This PR does not address the problem of conflicting names, which is solved by PR [SPARK-38918][SQL] Nested column pruning should filter out attributes that do not belong to the current relation #36216.

After this PR, the output returned by PushDownUtils$pruneColumns will only include columns from one relation. However, we still apply ProjectionOverSchema on filters with subqueries that may reference other relations too. That's why we need both changes.

@allisonwang-db @viirya, what do you think?

huaxingao · 2022-04-21T22:26:10Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+      df2.createOrReplaceTempView("first_names")
+
+      val query = sql(
+        s"""SELECT name FROM contacts c


nit: remove s?

huaxingao · 2022-04-21T22:26:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+
+      val query = sql(
+        s"""SELECT name FROM contacts c
+           |WHERE


nit: This seems to be a 3-space indentation?

huaxingao · 2022-04-21T22:27:17Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+      val query = sql(
+        s"""SELECT name FROM contacts c
+           |WHERE
+           | EXISTS (SELECT 1 FROM ids i WHERE i.value = c.id)


nit: 2-space after |?

viirya

lgtm, I agree that we need both PRs. Thanks @aokolnychyi

viirya · 2022-04-22T17:01:08Z

I'm going to merge this once CI passes.

aokolnychyi · 2022-04-22T17:01:22Z

I think I addressed the indentation comments in all tests. @huaxingao, could you double check, please?

aokolnychyi · 2022-04-22T17:03:05Z

Thanks for reviewing, @huaxingao @viirya @cloud-fan!

allisonwang-db

Thanks for the fix!

viirya · 2022-04-22T21:11:08Z

Thanks @aokolnychyi and all. Merging to master/3.3/3.2.

### What changes were proposed in this pull request? This PR fixes schema pruning for queries with multiple correlated subqueries. Previously, Spark would throw an exception trying to determine root fields in `SchemaPruning$identifyRootFields`. That was happening because expressions in predicates that referenced attributes in subqueries were not ignored. That's why attributes from multiple subqueries could conflict with each other (e.g. incompatible types) even though they should be ignored. For instance, the following query would throw a runtime exception. ``` SELECT name FROM contacts c WHERE EXISTS (SELECT 1 FROM ids i WHERE i.value = c.id) AND EXISTS (SELECT 1 FROM first_names n WHERE c.name.first = n.value) ``` ``` [info] org.apache.spark.SparkException: Failed to merge fields 'value' and 'value'. Failed to merge incompatible data types int and string [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.failedMergingFieldsError(QueryExecutionErrors.scala:936) ``` ### Why are the changes needed? These changes are needed to avoid exceptions for some queries with multiple correlated subqueries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests. Closes #36303 from aokolnychyi/spark-38977. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 0c9947d) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

### What changes were proposed in this pull request? This PR fixes schema pruning for queries with multiple correlated subqueries. Previously, Spark would throw an exception trying to determine root fields in `SchemaPruning$identifyRootFields`. That was happening because expressions in predicates that referenced attributes in subqueries were not ignored. That's why attributes from multiple subqueries could conflict with each other (e.g. incompatible types) even though they should be ignored. For instance, the following query would throw a runtime exception. ``` SELECT name FROM contacts c WHERE EXISTS (SELECT 1 FROM ids i WHERE i.value = c.id) AND EXISTS (SELECT 1 FROM first_names n WHERE c.name.first = n.value) ``` ``` [info] org.apache.spark.SparkException: Failed to merge fields 'value' and 'value'. Failed to merge incompatible data types int and string [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.failedMergingFieldsError(QueryExecutionErrors.scala:936) ``` ### Why are the changes needed? These changes are needed to avoid exceptions for some queries with multiple correlated subqueries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests. Closes apache#36303 from aokolnychyi/spark-38977. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 0c9947d) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

[SPARK-38977][SQL] Fix schema pruning with correlated subqueries

48b49a2

aokolnychyi commented Apr 21, 2022

View reviewed changes

github-actions bot added the SQL label Apr 21, 2022

aokolnychyi mentioned this pull request Apr 21, 2022

[SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands #36304

Closed

huaxingao reviewed Apr 21, 2022

View reviewed changes

cloud-fan approved these changes Apr 22, 2022

View reviewed changes

viirya approved these changes Apr 22, 2022

View reviewed changes

Review

011039e

huaxingao approved these changes Apr 22, 2022

View reviewed changes

allisonwang-db approved these changes Apr 22, 2022

View reviewed changes

viirya closed this in 0c9947d Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38977][SQL] Fix schema pruning with correlated subqueries #36303

[SPARK-38977][SQL] Fix schema pruning with correlated subqueries #36303

aokolnychyi commented Apr 21, 2022

aokolnychyi Apr 21, 2022

huaxingao Apr 21, 2022

aokolnychyi Apr 21, 2022 •

edited

Loading

cloud-fan Apr 22, 2022

aokolnychyi Apr 22, 2022

aokolnychyi commented Apr 21, 2022

viirya commented Apr 21, 2022

aokolnychyi commented Apr 21, 2022

aokolnychyi commented Apr 21, 2022

aokolnychyi commented Apr 21, 2022 •

edited

Loading

huaxingao Apr 21, 2022

huaxingao Apr 21, 2022

huaxingao Apr 21, 2022

viirya left a comment

viirya commented Apr 22, 2022

aokolnychyi commented Apr 22, 2022

aokolnychyi commented Apr 22, 2022

allisonwang-db left a comment

viirya commented Apr 22, 2022

[SPARK-38977][SQL] Fix schema pruning with correlated subqueries #36303

[SPARK-38977][SQL] Fix schema pruning with correlated subqueries #36303

Conversation

aokolnychyi commented Apr 21, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

aokolnychyi Apr 21, 2022

Choose a reason for hiding this comment

huaxingao Apr 21, 2022

Choose a reason for hiding this comment

aokolnychyi Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

cloud-fan Apr 22, 2022

Choose a reason for hiding this comment

aokolnychyi Apr 22, 2022

Choose a reason for hiding this comment

aokolnychyi commented Apr 21, 2022

viirya commented Apr 21, 2022

aokolnychyi commented Apr 21, 2022

aokolnychyi commented Apr 21, 2022

aokolnychyi commented Apr 21, 2022 • edited Loading

huaxingao Apr 21, 2022

Choose a reason for hiding this comment

huaxingao Apr 21, 2022

Choose a reason for hiding this comment

huaxingao Apr 21, 2022

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

viirya commented Apr 22, 2022

aokolnychyi commented Apr 22, 2022

aokolnychyi commented Apr 22, 2022

allisonwang-db left a comment

Choose a reason for hiding this comment

viirya commented Apr 22, 2022

aokolnychyi Apr 21, 2022 •

edited

Loading

aokolnychyi commented Apr 21, 2022 •

edited

Loading