[SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables by imback82 · Pull Request #30881 · apache/spark

imback82 · 2020-12-22T03:22:49Z

What changes were proposed in this pull request?

This PR proposes to implement DESCRIBE COLUMN for v2 tables.

Note that isExnteded option is not implemented in this PR.

Why are the changes needed?

Parity with v1 tables.

Does this PR introduce any user-facing change?

Yes, now, DESCRIBE COLUMN works for v2 tables.

sql("CREATE TABLE testcat.tbl (id bigint, data string COMMENT 'hello') USING foo")
sql("DESCRIBE testcat.tbl data").show

+---------+----------+
|info_name|info_value|
+---------+----------+
| col_name|      data|
|data_type|    string|
|  comment|     hello|
+---------+----------+

Before this PR, the command would fail with: Describing columns is not supported for v2 tables.

How was this patch tested?

Added new test.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

SparkQA · 2020-12-22T04:36:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37789/

SparkQA · 2020-12-22T05:08:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37789/

SparkQA · 2020-12-22T06:01:25Z

Test build #133191 has finished for PR 30881 at commit ec2b57b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveAttribute.scala

SparkQA · 2020-12-22T07:30:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37795/

SparkQA · 2020-12-22T08:08:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37795/

SparkQA · 2020-12-22T11:03:19Z

Test build #133197 has finished for PR 30881 at commit 66fa611.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-25T03:48:17Z

Test build #133365 has started for PR 30881 at commit 9c26c49.

SparkQA · 2020-12-25T04:04:38Z

Test build #133366 has started for PR 30881 at commit db934c1.

SparkQA · 2020-12-25T04:38:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37957/

SparkQA · 2020-12-25T05:07:00Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37957/

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

cloud-fan · 2020-12-25T08:13:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      case d @ DescribeColumn(rt: ResolvedTable, _, _) =>
+        rt.table match {
+          // References for v1 tables are resolved in DescribeColumnCommand.
+          case _: V1Table => d


Shall we change the v1 command to take resolved Attribute directly? Then we don't need the hack here.

This also reminds me that, for all v1 commands, we will lookup the table twice: once in the framework, and once inside the v1 command. This is kind of a perf regression: previously we simply lookup the catalog and fallback to v1 command if it's the session catalog, so the table lookup still happens once inside the v1 command. We can also update the v1 command and take the resolved table directly.

Or, we don't care too much about the DDL perf, then I think resolving the column twice is also fine and we don't need this hack either.

Since the v1 command supports views as well, updating v1 command to take resolved Attribute doesn't solve the issue completely. We have to pass Attribute and the original column name to v1 command (or have two different v1 commands that take either resolved attribute or column name). Do you still prefer resolving attributes here for v1 tables?

Then we shouldn't change the v1 command.

But seems fine to resolve the column twice?

We can re-construct the multi-part column name from Attribute: attr.qualifier +: attr.name

viirya · 2020-12-25T08:15:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala

+object ResolvedTable {
+  def create(
+      catalog: TableCatalog,
+      identifier: Identifier,
+      table: Table): ResolvedTable = {


Usually this is apply?

This follows DataSourceV2Relation.create

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

SparkQA · 2020-12-25T10:46:04Z

Test build #133373 has finished for PR 30881 at commit 50e7f60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-25T18:42:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37977/

imback82 · 2020-12-29T00:38:24Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+        case a: Attribute =>
+          DescribeColumnCommand(ident.asTableIdentifier, a.qualifier :+ a.name, isExtended)
+        case nested =>
+          throw QueryCompilationErrors.commandNotSupportNestedColumnError("DESC TABLE COLUMN")


One disadvantage of this approach is that the exception message for view will be different when nested column is specified; it will have the original name parts:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

Line 773 in 0617dfc

s"DESC TABLE COLUMN command does not support nested data types: $colName")

We can do one of the following:

Make the exception message the same even for views by dropping column name in DescribeColumnCommand

Store the original column name in DescribeColumn (and there will be no matching logic for column in ResolveSessionCatalog, but seems duplicated because we have UnresolvedAttribute to store the original column name.)

Construct the original name from GetStructField, GetArrayStructFields, etc.

WDYT, @cloud-fan ?

Construct the original name from GetStructField, GetArrayStructFields, etc.

Is it simply nested.sql?

For DESC desc_complex_col_table col.x,
It will be:

DESC TABLE COLUMN command does not support nested data types: col.x

vs.

DESC TABLE COLUMN does not support nested column: spark_catalog.default.desc_complex_col_table.`col`.`x` AS `x`

How about we strip the Alias and then call toPrettySQL which is defined in org.apache.spark.sql.catalyst.util?

That seems to work pretty well. Thanks!

sql/core/src/test/resources/sql-tests/results/describe.sql.out

SparkQA · 2020-12-29T01:46:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38050/

SparkQA · 2020-12-29T02:15:25Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38050/

imback82 · 2020-12-29T05:49:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala

-  override def output: Seq[Attribute] = Nil
+  override def output: Seq[Attribute] = {
+    val qualifier = catalog.name +: identifier.namespace :+ identifier.name
+    outputAttributes.map(_.withQualifier(qualifier))


Or we can wrap this with SubqueryAlias similar to how DataSourceV2Relation is wrapped, but we need to update everywhere ResolvedTable is matched.

viirya · 2020-12-29T06:20:34Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+        case u: UnresolvedAttribute =>
+          // For views, the column will not be resolved by `ResolveReferences` because
+          // `ResolvedView` stores only the identifier.
+          DescribeColumnCommand(ident.asTableIdentifier, u.nameParts, isExtended)


Is it possible there is unresolved attribute but the relation of DescribeColumn is a v1 table?

It's possible when the column name doesn't exist in the table, and we should give a clear error message: Column $colName does not exist

The comment was confusing since DescribeColumnCommand resolves the column again. I updated to separate view and table matching to make the intention clear. Thanks.

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

SparkQA · 2020-12-29T09:22:46Z

Test build #133461 has finished for PR 30881 at commit 8570d39.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

imback82 · 2020-12-30T01:38:47Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

+
+      assertAnalysisError(
+        s"DESCRIBE $t invalid_col",
+        "cannot resolve '`invalid_col`' given input columns: [testcat.tbl.data, testcat.tbl.id]")


The error message is different for v1 / v2 tables when the column does not exist.
v1: Column invalid_col does not exist
v2: cannot resolve '`invalid_col`' given input columns: [testcat.tbl.data, testcat.tbl.id]
CheckAnalysis handles UnresolvedAttribute automatically for v2. Should we make this consistent (i.e., make v2 emit messages like v1)?

I think v2 is better. Let's keep it.

SparkQA · 2020-12-30T02:44:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38102/

SparkQA · 2020-12-30T03:12:43Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38102/

cloud-fan · 2020-12-30T04:39:53Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+          DescribeColumnCommand(ident.asTableIdentifier, a.qualifier :+ a.name, isExtended)
+        case nested =>
+          throw QueryCompilationErrors.commandNotSupportNestedColumnError(
+            "DESC TABLE COLUMN", toPrettySQL(nested, removeAlias = true))


Can we manually remove Alias here instead of changing the toPrettySQL method?

cloud-fan · 2020-12-30T04:55:44Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+          throw QueryCompilationErrors.columnDoesNotExistError(u.name)
+        case a: Attribute =>
+          DescribeColumnCommand(ident.asTableIdentifier, a.qualifier :+ a.name, isExtended)
+        case nested =>


We can match Alias directly here, as other expressions won't be produced when resolving columns. e.g.

case Alias(child, _) => throw ... toPrettySQL(child) case other => throw ...("[BUG] unexpected column expression: " + other)

cloud-fan · 2020-12-30T04:58:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

-      throw new AnalysisException(
-        s"DESC TABLE COLUMN command does not support nested data types: $colName")
+      throw QueryCompilationErrors.commandNotSupportNestedColumnError(
+        "DESC TABLE COLUMN", toPrettySQL(field, removeAlias = true))


We won't hit this branch anymore, as we fail earlier in ResolveSessionCatalog. We can simply use assert(field.isInstanceOf[Attribute])

I think this can still be hit for views.

SparkQA · 2020-12-30T05:50:32Z

Test build #133516 has finished for PR 30881 at commit 0196462.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-30T06:23:52Z

Test build #133513 has finished for PR 30881 at commit ae23ee6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-30T06:39:39Z

retest this please

SparkQA · 2020-12-30T15:05:47Z

Test build #133520 has finished for PR 30881 at commit 0196462.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-30T21:34:34Z

Test build #133540 has finished for PR 30881 at commit a19f5f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class StringTrim(srcStr: Expression, trimStr: Option[Expression] = None)
case class StringTrimLeft(srcStr: Expression, trimStr: Option[Expression] = None)
case class StringTrimRight(srcStr: Expression, trimStr: Option[Expression] = None)

cloud-fan · 2021-01-04T16:14:30Z

thanks, merging to master!

imback82 added 4 commits December 21, 2020 19:12

initial commit

c599737

initial commit

55b4305

Add TODO for isExtended option

b4f576c

Add comments

ec2b57b

github-actions bot added the SQL label Dec 22, 2020

imback82 commented Dec 22, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala Outdated Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala Outdated Show resolved Hide resolved

fix test

66fa611

imback82 commented Dec 22, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveAttribute.scala Outdated Show resolved Hide resolved

imback82 added 2 commits December 24, 2020 19:38

Add output to ResolvedTable

9ddd615

Remove check in CheckAnalysis

9c26c49

clean up

db934c1

imback82 added 2 commits December 24, 2020 20:58

handle nested types

108efda

Merge remote-tracking branch 'upstream/master' into describe_col_v2

50e7f60

cloud-fan reviewed Dec 25, 2020

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 25, 2020

View reviewed changes

viirya reviewed Dec 25, 2020

View reviewed changes

imback82 added 2 commits December 25, 2020 09:15

Address PR comments

a113c8c

Update test

4020101

imback82 added 2 commits December 28, 2020 15:33

Support qualifer

ad2b6db

address PR comments

8570d39

imback82 commented Dec 29, 2020

View reviewed changes

sql/core/src/test/resources/sql-tests/results/describe.sql.out Show resolved Hide resolved

imback82 commented Dec 29, 2020

View reviewed changes

viirya reviewed Dec 29, 2020

View reviewed changes

cloud-fan reviewed Dec 29, 2020

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala Outdated Show resolved Hide resolved

imback82 added 3 commits December 29, 2020 17:23

Address PR comments

cff977b

Merge branch 'master' into describe_col_v2

0b41baa

Merge remote-tracking branch 'upstream/master' into describe_col_v2

ae23ee6

imback82 commented Dec 30, 2020

View reviewed changes

cloud-fan reviewed Dec 30, 2020

View reviewed changes

imback82 added 2 commits December 29, 2020 21:31

Address PR comments

fd1cb45

revert

0196462

Merge remote-tracking branch 'upstream/master' into describe_col_v2

a19f5f0

cloud-fan approved these changes Jan 4, 2021

View reviewed changes

cloud-fan closed this in ddc0d51 Jan 4, 2021

Conversation

imback82 commented Dec 22, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Dec 22, 2020

Uh oh!

SparkQA commented Dec 22, 2020

Uh oh!

SparkQA commented Dec 22, 2020

Uh oh!

Uh oh!

SparkQA commented Dec 22, 2020

Uh oh!

SparkQA commented Dec 22, 2020

Uh oh!

SparkQA commented Dec 22, 2020

Uh oh!

SparkQA commented Dec 25, 2020

Uh oh!

SparkQA commented Dec 25, 2020

Uh oh!

SparkQA commented Dec 25, 2020

Uh oh!

SparkQA commented Dec 25, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Dec 25, 2020

Uh oh!

SparkQA commented Dec 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Dec 29, 2020

Uh oh!

SparkQA commented Dec 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Dec 29, 2020

Uh oh!

cloud-fan Dec 28, 2020 •

edited

Loading