Skip to content

[SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables#30881

Closed
imback82 wants to merge 20 commits intoapache:masterfrom
imback82:describe_col_v2
Closed

[SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables#30881
imback82 wants to merge 20 commits intoapache:masterfrom
imback82:describe_col_v2

Conversation

@imback82
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to implement DESCRIBE COLUMN for v2 tables.

Note that isExnteded option is not implemented in this PR.

Why are the changes needed?

Parity with v1 tables.

Does this PR introduce any user-facing change?

Yes, now, DESCRIBE COLUMN works for v2 tables.

sql("CREATE TABLE testcat.tbl (id bigint, data string COMMENT 'hello') USING foo")
sql("DESCRIBE testcat.tbl data").show
+---------+----------+
|info_name|info_value|
+---------+----------+
| col_name|      data|
|data_type|    string|
|  comment|     hello|
+---------+----------+

Before this PR, the command would fail with: Describing columns is not supported for v2 tables.

How was this patch tested?

Added new test.

@github-actions github-actions bot added the SQL label Dec 22, 2020
@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37789/

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37789/

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Test build #133191 has finished for PR 30881 at commit ec2b57b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37795/

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37795/

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Test build #133197 has finished for PR 30881 at commit 66fa611.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Test build #133365 has started for PR 30881 at commit 9c26c49.

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Test build #133366 has started for PR 30881 at commit db934c1.

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37957/

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37957/

case d @ DescribeColumn(rt: ResolvedTable, _, _) =>
rt.table match {
// References for v1 tables are resolved in DescribeColumnCommand.
case _: V1Table => d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we change the v1 command to take resolved Attribute directly? Then we don't need the hack here.

This also reminds me that, for all v1 commands, we will lookup the table twice: once in the framework, and once inside the v1 command. This is kind of a perf regression: previously we simply lookup the catalog and fallback to v1 command if it's the session catalog, so the table lookup still happens once inside the v1 command. We can also update the v1 command and take the resolved table directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, we don't care too much about the DDL perf, then I think resolving the column twice is also fine and we don't need this hack either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the v1 command supports views as well, updating v1 command to take resolved Attribute doesn't solve the issue completely. We have to pass Attribute and the original column name to v1 command (or have two different v1 commands that take either resolved attribute or column name). Do you still prefer resolving attributes here for v1 tables?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we shouldn't change the v1 command.

Copy link
Contributor

@cloud-fan cloud-fan Dec 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But seems fine to resolve the column twice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can re-construct the multi-part column name from Attribute: attr.qualifier +: attr.name

Comment on lines +108 to +112
object ResolvedTable {
def create(
catalog: TableCatalog,
identifier: Identifier,
table: Table): ResolvedTable = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually this is apply?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows DataSourceV2Relation.create

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Test build #133373 has finished for PR 30881 at commit 50e7f60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37977/

case a: Attribute =>
DescribeColumnCommand(ident.asTableIdentifier, a.qualifier :+ a.name, isExtended)
case nested =>
throw QueryCompilationErrors.commandNotSupportNestedColumnError("DESC TABLE COLUMN")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One disadvantage of this approach is that the exception message for view will be different when nested column is specified; it will have the original name parts:

s"DESC TABLE COLUMN command does not support nested data types: $colName")

We can do one of the following:

  • Make the exception message the same even for views by dropping column name in DescribeColumnCommand
  • Store the original column name in DescribeColumn (and there will be no matching logic for column in ResolveSessionCatalog, but seems duplicated because we have UnresolvedAttribute to store the original column name.)
  • Construct the original name from GetStructField, GetArrayStructFields, etc.

WDYT, @cloud-fan ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Construct the original name from GetStructField, GetArrayStructFields, etc.

Is it simply nested.sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DESC desc_complex_col_table col.x,
It will be:

DESC TABLE COLUMN command does not support nested data types: col.x

vs.

DESC TABLE COLUMN does not support nested column: spark_catalog.default.desc_complex_col_table.`col`.`x` AS `x`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we strip the Alias and then call toPrettySQL which is defined in org.apache.spark.sql.catalyst.util?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to work pretty well. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38050/

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38050/

override def output: Seq[Attribute] = Nil
override def output: Seq[Attribute] = {
val qualifier = catalog.name +: identifier.namespace :+ identifier.name
outputAttributes.map(_.withQualifier(qualifier))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can wrap this with SubqueryAlias similar to how DataSourceV2Relation is wrapped, but we need to update everywhere ResolvedTable is matched.

Comment on lines 241 to 244
case u: UnresolvedAttribute =>
// For views, the column will not be resolved by `ResolveReferences` because
// `ResolvedView` stores only the identifier.
DescribeColumnCommand(ident.asTableIdentifier, u.nameParts, isExtended)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible there is unresolved attribute but the relation of DescribeColumn is a v1 table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible when the column name doesn't exist in the table, and we should give a clear error message: Column $colName does not exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was confusing since DescribeColumnCommand resolves the column again. I updated to separate view and table matching to make the intention clear. Thanks.

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Test build #133461 has finished for PR 30881 at commit 8570d39.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.


assertAnalysisError(
s"DESCRIBE $t invalid_col",
"cannot resolve '`invalid_col`' given input columns: [testcat.tbl.data, testcat.tbl.id]")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is different for v1 / v2 tables when the column does not exist.
v1: Column invalid_col does not exist
v2: cannot resolve '`invalid_col`' given input columns: [testcat.tbl.data, testcat.tbl.id]
CheckAnalysis handles UnresolvedAttribute automatically for v2. Should we make this consistent (i.e., make v2 emit messages like v1)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think v2 is better. Let's keep it.

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38102/

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38102/

DescribeColumnCommand(ident.asTableIdentifier, a.qualifier :+ a.name, isExtended)
case nested =>
throw QueryCompilationErrors.commandNotSupportNestedColumnError(
"DESC TABLE COLUMN", toPrettySQL(nested, removeAlias = true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we manually remove Alias here instead of changing the toPrettySQL method?

throw QueryCompilationErrors.columnDoesNotExistError(u.name)
case a: Attribute =>
DescribeColumnCommand(ident.asTableIdentifier, a.qualifier :+ a.name, isExtended)
case nested =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can match Alias directly here, as other expressions won't be produced when resolving columns. e.g.

case Alias(child, _) => throw ... toPrettySQL(child)
case other => throw ...("[BUG] unexpected column expression: " + other)

throw new AnalysisException(
s"DESC TABLE COLUMN command does not support nested data types: $colName")
throw QueryCompilationErrors.commandNotSupportNestedColumnError(
"DESC TABLE COLUMN", toPrettySQL(field, removeAlias = true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't hit this branch anymore, as we fail earlier in ResolveSessionCatalog. We can simply use assert(field.isInstanceOf[Attribute])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can still be hit for views.

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Test build #133516 has finished for PR 30881 at commit 0196462.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Test build #133513 has finished for PR 30881 at commit ae23ee6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Test build #133520 has finished for PR 30881 at commit 0196462.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 30, 2020

Test build #133540 has finished for PR 30881 at commit a19f5f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class StringTrim(srcStr: Expression, trimStr: Option[Expression] = None)
  • case class StringTrimLeft(srcStr: Expression, trimStr: Option[Expression] = None)
  • case class StringTrimRight(srcStr: Expression, trimStr: Option[Expression] = None)

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in ddc0d51 Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants