[SPARK-34331][SQL] Speed up DS v2 metadata col resolution #31440

cloud-fan · 2021-02-02T15:53:04Z

What changes were proposed in this pull request?

This is a follow-up of #28027

#28027 added a DS v2 API that allows data sources to produce metadata/hidden columns that can only be seen when it's explicitly selected. The way we integrate this API into Spark is:

The v2 relation gets normal output and metadata output from the data source, and the metadata output is excluded from the plan output by default.
column resolution can resolve UnresolvedAttribute with metadata columns, even if the child plan doesn't output metadata columns.
An analyzer rule searches the query plan, trying to find a node that has missing inputs. If such node is found, transform the sub-plan of this node, and update the v2 relation to include the metadata output.

The analyzer rule in step 3 brings a perf regression, for queries that do not read v2 tables at all. This rule will calculate QueryPlan.inputSet (which builds an AttributeSet from outputs of all children) and QueryPlan.missingInput (which does a set exclusion and creates a new AttributeSet) for every plan node in the query plan. In our benchmark, the TPCDS query compilation time gets increased by more than 10%

This PR proposes a simple way to improve it: we add a special metadata entry to the metadata attribute, which allows us to quickly check if a plan needs to add metadata columns: we just check all the references of this plan, and see if the attribute contains the special metadata entry, instead of calculating QueryPlan.missingInput.

This PR also fixes one bug: we should not change the final output schema of the plan, if we only use metadata columns in operators like filter, sort, etc.

Why are the changes needed?

Fix perf regression in SQL query compilation, and fix a bug.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Run org.apache.spark.sql.TPCDSQuerySuite, before this PR, AddMetadataColumns is the top 4 rule ranked by running time

=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 407641
Total time: 47.257239779 seconds

Rule                                  Effective Time / Total Time                     Effective Runs / Total Runs

OptimizeSubqueries                      4157690003 / 8485444626                         49 / 2778
Analyzer$ResolveAggregateFunctions      1238968711 / 3369351761                         49 / 2141
ColumnPruning                           660038236 / 2924755292                          338 / 6391
Analyzer$AddMetadataColumns             0 / 2918352992                                  0 / 2151

after this PR:

Analyzer$AddMetadataColumns             0 / 122885629                                   0 / 2151

This rule is 20 times faster and is negligible to the total compilation time.

This PR also add new tests to verify the bug fix.

cloud-fan · 2021-02-02T15:53:34Z

cc @rdblue @brkyvz @HyukjinKwon @viirya @imback82

SparkQA · 2021-02-02T16:55:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39373/

SparkQA · 2021-02-02T18:26:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39373/

imback82 · 2021-02-02T18:37:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      case a: AppendData => a.withNewTable(removeMetaCol(a.table))
+      case o: OverwriteByExpression => o.withNewTable(removeMetaCol(o.table))
+      case o: OverwritePartitionsDynamic => o.withNewTable(removeMetaCol(o.table))


Can these be replaced with case v: V2WriteCommand => v.withNewTable(removeMetaCol(v.table)), or do we need to match these specific types?

good point.

viirya · 2021-02-02T20:00:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+ * This rule removes metadata columns from `DataSourceV2Relation` under 2 cases:
+ *   - A single v2 scan (can be produced by `spark.table`), which is similar to star expansion, and
+ *     metadata columns should only be picked by explicit references.
+ *   - V2 scans under writing commands, as we can't insert into metadata columns.


This is for the table in InsertIntoStatement. How about the query? E.g. spark.table(...).write.insertInto(...). Do we need to remove metadata columns for the query here if it also is a v2 scan?

SparkQA · 2021-02-02T20:02:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39384/

SparkQA · 2021-02-02T20:44:14Z

Test build #134786 has finished for PR 31440 at commit 461f42a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
implicit class MetadataColumnHelper(attr: Attribute)

SparkQA · 2021-02-02T21:34:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39384/

SparkQA · 2021-02-02T23:16:48Z

Test build #134796 has finished for PR 31440 at commit 51086e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2021-02-03T15:15:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+
+  private def removeMetaCol(tbl: NamedRelation): NamedRelation = tbl match {
+    case r: DataSourceV2Relation =>
+      if (r.output.exists(_.isMetadataCol)) {


Why to guard r.copy(output = r.output.filterNot(_.isMetadataCol))? Why not to do it always?

cloud-fan · 2021-02-03T15:43:57Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

@@ -61,7 +61,7 @@ class InMemoryTable(

  private object IndexColumn extends MetadataColumn {
    override def name: String = "index"
-    override def dataType: DataType = StringType
+    override def dataType: DataType = IntegerType


The actual data is int.

cloud-fan · 2021-02-03T15:58:29Z

My new approach doesn't work for dataframe queries, so I went back to the original approach, with some improvement to fix the perf regression. The patch is much smaller now, please take another look, thanks!

SparkQA · 2021-02-03T17:13:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39428/

SparkQA · 2021-02-03T18:44:37Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39428/

SparkQA · 2021-02-03T21:46:57Z

Test build #134842 has finished for PR 31440 at commit be4aefe.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
implicit class MetadataColumnHelper(attr: Attribute)

cloud-fan · 2021-02-04T17:23:48Z

Tests pass and it's ready for review. @rdblue @brkyvz @viirya

HyukjinKwon

This change makes sense to me.

viirya · 2021-02-05T03:33:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        if (metaCols.isEmpty) {
+          node
+        } else {
+          val newNode = addMetadataCol(node)


No matter how many meta cols we actually refer, we always add all meta cols, right?

Yea, it's good enough because:

We guarantee that the outer plan will project out extra columns, so the final output schema won't change

Column pruning will work and eventually the data source doesn't need to produce un-referenced columns.

viirya

The current approach looks reasonable. Just one question.

cloud-fan · 2021-02-05T08:37:17Z

thanks for the review, merging to master/3.1!

### What changes were proposed in this pull request? This is a follow-up of #28027 #28027 added a DS v2 API that allows data sources to produce metadata/hidden columns that can only be seen when it's explicitly selected. The way we integrate this API into Spark is: 1. The v2 relation gets normal output and metadata output from the data source, and the metadata output is excluded from the plan output by default. 2. column resolution can resolve `UnresolvedAttribute` with metadata columns, even if the child plan doesn't output metadata columns. 3. An analyzer rule searches the query plan, trying to find a node that has missing inputs. If such node is found, transform the sub-plan of this node, and update the v2 relation to include the metadata output. The analyzer rule in step 3 brings a perf regression, for queries that do not read v2 tables at all. This rule will calculate `QueryPlan.inputSet` (which builds an `AttributeSet` from outputs of all children) and `QueryPlan.missingInput` (which does a set exclusion and creates a new `AttributeSet`) for every plan node in the query plan. In our benchmark, the TPCDS query compilation time gets increased by more than 10% This PR proposes a simple way to improve it: we add a special metadata entry to the metadata attribute, which allows us to quickly check if a plan needs to add metadata columns: we just check all the references of this plan, and see if the attribute contains the special metadata entry, instead of calculating `QueryPlan.missingInput`. This PR also fixes one bug: we should not change the final output schema of the plan, if we only use metadata columns in operators like filter, sort, etc. ### Why are the changes needed? Fix perf regression in SQL query compilation, and fix a bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run `org.apache.spark.sql.TPCDSQuerySuite`, before this PR, `AddMetadataColumns` is the top 4 rule ranked by running time ``` === Metrics of Analyzer/Optimizer Rules === Total number of runs: 407641 Total time: 47.257239779 seconds Rule Effective Time / Total Time Effective Runs / Total Runs OptimizeSubqueries 4157690003 / 8485444626 49 / 2778 Analyzer$ResolveAggregateFunctions 1238968711 / 3369351761 49 / 2141 ColumnPruning 660038236 / 2924755292 338 / 6391 Analyzer$AddMetadataColumns 0 / 2918352992 0 / 2151 ``` after this PR: ``` Analyzer$AddMetadataColumns 0 / 122885629 0 / 2151 ``` This rule is 20 times faster and is negligible to the total compilation time. This PR also add new tests to verify the bug fix. Closes #31440 from cloud-fan/metadata-col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 989eb68) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

imback82

Late +1

imback82 · 2021-02-05T17:15:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          node
+        } else {
+          val newNode = addMetadataCol(node)
+          // We should not change the output schema of the plan. We should project away the extr


nit: extr -> extra

imback82 · 2021-02-05T17:28:03Z

...yst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Implicits.scala

@@ -83,7 +85,8 @@ object DataSourceV2Implicits {
  implicit class MetadataColumnsHelper(metadata: Array[MetadataColumn]) {
    def asStruct: StructType = {
      val fields = metadata.map { metaCol =>
-        val field = StructField(metaCol.name, metaCol.dataType, metaCol.isNullable)
+        val fieldMeta = new MetadataBuilder().putBoolean(METADATA_COL_ATTR_KEY, true).build()


nit: can be created outside the loop (or even at object level - new metadata related object to go with METADATA_COL_ATTR_KEY).

imback82 reviewed Feb 2, 2021

View reviewed changes

viirya reviewed Feb 2, 2021

View reviewed changes

github-actions bot added the SQL label Feb 2, 2021

jaceklaskowski reviewed Feb 3, 2021

View reviewed changes

cloud-fan force-pushed the metadata-col branch from 51086e2 to f9c0bf7 Compare February 3, 2021 15:43

cloud-fan commented Feb 3, 2021

View reviewed changes

cloud-fan force-pushed the metadata-col branch 2 times, most recently from 849862b to bd7b479 Compare February 3, 2021 15:52

speed up DS v2 metadata col resolution

be4aefe

cloud-fan force-pushed the metadata-col branch from bd7b479 to be4aefe Compare February 3, 2021 15:56

HyukjinKwon approved these changes Feb 5, 2021

View reviewed changes

viirya reviewed Feb 5, 2021

View reviewed changes

cloud-fan closed this in 989eb68 Feb 5, 2021

viirya approved these changes Feb 5, 2021

View reviewed changes

imback82 reviewed Feb 5, 2021

View reviewed changes

cloud-fan mentioned this pull request Mar 30, 2021

[SPARK-34527][SQL] Resolve duplicated common columns from USING/NATURAL JOIN #31666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34331][SQL] Speed up DS v2 metadata col resolution #31440

[SPARK-34331][SQL] Speed up DS v2 metadata col resolution #31440

cloud-fan commented Feb 2, 2021 •

edited

Loading

cloud-fan commented Feb 2, 2021

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

imback82 Feb 2, 2021

cloud-fan Feb 2, 2021

viirya Feb 2, 2021

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

jaceklaskowski Feb 3, 2021

cloud-fan Feb 3, 2021

cloud-fan commented Feb 3, 2021

SparkQA commented Feb 3, 2021

SparkQA commented Feb 3, 2021

SparkQA commented Feb 3, 2021

cloud-fan commented Feb 4, 2021

HyukjinKwon left a comment

viirya Feb 5, 2021

cloud-fan Feb 5, 2021

viirya left a comment

cloud-fan commented Feb 5, 2021

imback82 left a comment

imback82 Feb 5, 2021

imback82 Feb 5, 2021 •

edited

Loading

[SPARK-34331][SQL] Speed up DS v2 metadata col resolution #31440

[SPARK-34331][SQL] Speed up DS v2 metadata col resolution #31440

Conversation

cloud-fan commented Feb 2, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Feb 2, 2021

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 3, 2021

SparkQA commented Feb 3, 2021

SparkQA commented Feb 3, 2021

SparkQA commented Feb 3, 2021

cloud-fan commented Feb 4, 2021

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

cloud-fan commented Feb 5, 2021

imback82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 Feb 5, 2021 • edited Loading

Choose a reason for hiding this comment

cloud-fan commented Feb 2, 2021 •

edited

Loading

imback82 Feb 5, 2021 •

edited

Loading