[SPARK-30761]SQL] Nested column pruning should not prune on required child outputs in Generate #27503

viirya · 2020-02-09T04:12:29Z

What changes were proposed in this pull request?

We prune nested fields from Generate. If a child output is required in a top operator of Generate, we should not prune nested fields on it. Otherwise, the accessors on top operator could be unresolved.

Why are the changes needed?

A required child output means it is referred as a whole or by its nested fields on top of operator of Generate. If the rule prunes other nested fields from it, the accessors on top operator will be unresolved.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

viirya · 2020-02-09T04:16:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

+      val requiredAttrs = AttributeSet(g.requiredChildOutput)
+      NestedColumnAliasing.getAliasSubMap(g.generator.children, requiredAttrs).map {


This case normally should be treated by above case pattern (Project + Generate). But if all nested fields are selected at top Project, the above case won't prune. Then when Optimizer transforms down to the underlying Generate, only the referred nested column are kept and others are pruned from the child. It causes the accessors at top Project unresolved.

gatorsmile · 2020-02-09T05:08:20Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

@@ -333,6 +333,14 @@ abstract class SchemaPruningSuite
    }
  }

+  testSchemaPruning("select explode of nested field of array of struct and " +
+      "all remaining nested fields") {


Instead of fixing case by case, can we try to find all the possible cases and ensure we can cover all the possible query plans? Includes negative and positive cases.

Also, we need to have the unit test cases for these optimizer rules.

gatorsmile · 2020-02-09T05:11:15Z

If a child output is required in a top operator of Generate, we should not prune nested fields on it. Otherwise, the accessors on top operator could be unresolved.

Do we traverse all the ancestor nodes?

gatorsmile · 2020-02-09T05:13:09Z

@viirya How about reverting the original commit a0e63b6 and then consider how to improve the rule?

viirya · 2020-02-09T05:20:58Z

@gatorsmile I'm ok to revert it.

SparkQA · 2020-02-09T08:05:02Z

Test build #118085 has finished for PR 27503 at commit 2e2302f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

For required child outputs, we don't want to prune it from Generate.

2e2302f

viirya mentioned this pull request Feb 9, 2020

[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project #26978

Closed

viirya commented Feb 9, 2020

View reviewed changes

gatorsmile reviewed Feb 9, 2020

View reviewed changes

gatorsmile closed this Feb 9, 2020

gatorsmile reopened this Feb 9, 2020

viirya closed this Feb 9, 2020

viirya deleted the fix-generate-pruning branch December 27, 2023 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30761]SQL] Nested column pruning should not prune on required child outputs in Generate #27503

[SPARK-30761]SQL] Nested column pruning should not prune on required child outputs in Generate #27503

viirya commented Feb 9, 2020

viirya Feb 9, 2020

gatorsmile Feb 9, 2020 •

edited

gatorsmile commented Feb 9, 2020

gatorsmile commented Feb 9, 2020

viirya commented Feb 9, 2020

SparkQA commented Feb 9, 2020

		val requiredAttrs = AttributeSet(g.requiredChildOutput)
		NestedColumnAliasing.getAliasSubMap(g.generator.children, requiredAttrs).map {

[SPARK-30761]SQL] Nested column pruning should not prune on required child outputs in Generate #27503

[SPARK-30761]SQL] Nested column pruning should not prune on required child outputs in Generate #27503

Conversation

viirya commented Feb 9, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

viirya Feb 9, 2020

Choose a reason for hiding this comment

gatorsmile Feb 9, 2020 • edited

Choose a reason for hiding this comment

gatorsmile commented Feb 9, 2020

gatorsmile commented Feb 9, 2020

viirya commented Feb 9, 2020

SparkQA commented Feb 9, 2020

gatorsmile Feb 9, 2020 •

edited