[SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables#30881
[SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables#30881imback82 wants to merge 20 commits intoapache:masterfrom
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #133191 has finished for PR 30881 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveAttribute.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #133197 has finished for PR 30881 at commit
|
|
Test build #133365 has started for PR 30881 at commit |
|
Test build #133366 has started for PR 30881 at commit |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
Outdated
Show resolved
Hide resolved
| case d @ DescribeColumn(rt: ResolvedTable, _, _) => | ||
| rt.table match { | ||
| // References for v1 tables are resolved in DescribeColumnCommand. | ||
| case _: V1Table => d |
There was a problem hiding this comment.
Shall we change the v1 command to take resolved Attribute directly? Then we don't need the hack here.
This also reminds me that, for all v1 commands, we will lookup the table twice: once in the framework, and once inside the v1 command. This is kind of a perf regression: previously we simply lookup the catalog and fallback to v1 command if it's the session catalog, so the table lookup still happens once inside the v1 command. We can also update the v1 command and take the resolved table directly.
There was a problem hiding this comment.
Or, we don't care too much about the DDL perf, then I think resolving the column twice is also fine and we don't need this hack either.
There was a problem hiding this comment.
Since the v1 command supports views as well, updating v1 command to take resolved Attribute doesn't solve the issue completely. We have to pass Attribute and the original column name to v1 command (or have two different v1 commands that take either resolved attribute or column name). Do you still prefer resolving attributes here for v1 tables?
There was a problem hiding this comment.
Then we shouldn't change the v1 command.
There was a problem hiding this comment.
But seems fine to resolve the column twice?
There was a problem hiding this comment.
We can re-construct the multi-part column name from Attribute: attr.qualifier +: attr.name
| object ResolvedTable { | ||
| def create( | ||
| catalog: TableCatalog, | ||
| identifier: Identifier, | ||
| table: Table): ResolvedTable = { |
There was a problem hiding this comment.
This follows DataSourceV2Relation.create
...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
Show resolved
Hide resolved
|
Test build #133373 has finished for PR 30881 at commit
|
|
Kubernetes integration test starting |
| case a: Attribute => | ||
| DescribeColumnCommand(ident.asTableIdentifier, a.qualifier :+ a.name, isExtended) | ||
| case nested => | ||
| throw QueryCompilationErrors.commandNotSupportNestedColumnError("DESC TABLE COLUMN") |
There was a problem hiding this comment.
One disadvantage of this approach is that the exception message for view will be different when nested column is specified; it will have the original name parts:
We can do one of the following:
- Make the exception message the same even for views by dropping column name in
DescribeColumnCommand - Store the original column name in
DescribeColumn(and there will be no matching logic for column inResolveSessionCatalog, but seems duplicated because we haveUnresolvedAttributeto store the original column name.) - Construct the original name from
GetStructField,GetArrayStructFields, etc.
WDYT, @cloud-fan ?
There was a problem hiding this comment.
Construct the original name from GetStructField, GetArrayStructFields, etc.
Is it simply nested.sql?
There was a problem hiding this comment.
For DESC desc_complex_col_table col.x,
It will be:
DESC TABLE COLUMN command does not support nested data types: col.x
vs.
DESC TABLE COLUMN does not support nested column: spark_catalog.default.desc_complex_col_table.`col`.`x` AS `x`
There was a problem hiding this comment.
How about we strip the Alias and then call toPrettySQL which is defined in org.apache.spark.sql.catalyst.util?
There was a problem hiding this comment.
That seems to work pretty well. Thanks!
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
| override def output: Seq[Attribute] = Nil | ||
| override def output: Seq[Attribute] = { | ||
| val qualifier = catalog.name +: identifier.namespace :+ identifier.name | ||
| outputAttributes.map(_.withQualifier(qualifier)) |
There was a problem hiding this comment.
Or we can wrap this with SubqueryAlias similar to how DataSourceV2Relation is wrapped, but we need to update everywhere ResolvedTable is matched.
| case u: UnresolvedAttribute => | ||
| // For views, the column will not be resolved by `ResolveReferences` because | ||
| // `ResolvedView` stores only the identifier. | ||
| DescribeColumnCommand(ident.asTableIdentifier, u.nameParts, isExtended) |
There was a problem hiding this comment.
Is it possible there is unresolved attribute but the relation of DescribeColumn is a v1 table?
There was a problem hiding this comment.
It's possible when the column name doesn't exist in the table, and we should give a clear error message: Column $colName does not exist
There was a problem hiding this comment.
The comment was confusing since DescribeColumnCommand resolves the column again. I updated to separate view and table matching to make the intention clear. Thanks.
...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
Outdated
Show resolved
Hide resolved
|
Test build #133461 has finished for PR 30881 at commit
|
|
|
||
| assertAnalysisError( | ||
| s"DESCRIBE $t invalid_col", | ||
| "cannot resolve '`invalid_col`' given input columns: [testcat.tbl.data, testcat.tbl.id]") |
There was a problem hiding this comment.
The error message is different for v1 / v2 tables when the column does not exist.
v1: Column invalid_col does not exist
v2: cannot resolve '`invalid_col`' given input columns: [testcat.tbl.data, testcat.tbl.id]
CheckAnalysis handles UnresolvedAttribute automatically for v2. Should we make this consistent (i.e., make v2 emit messages like v1)?
There was a problem hiding this comment.
I think v2 is better. Let's keep it.
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
| DescribeColumnCommand(ident.asTableIdentifier, a.qualifier :+ a.name, isExtended) | ||
| case nested => | ||
| throw QueryCompilationErrors.commandNotSupportNestedColumnError( | ||
| "DESC TABLE COLUMN", toPrettySQL(nested, removeAlias = true)) |
There was a problem hiding this comment.
Can we manually remove Alias here instead of changing the toPrettySQL method?
| throw QueryCompilationErrors.columnDoesNotExistError(u.name) | ||
| case a: Attribute => | ||
| DescribeColumnCommand(ident.asTableIdentifier, a.qualifier :+ a.name, isExtended) | ||
| case nested => |
There was a problem hiding this comment.
We can match Alias directly here, as other expressions won't be produced when resolving columns. e.g.
case Alias(child, _) => throw ... toPrettySQL(child)
case other => throw ...("[BUG] unexpected column expression: " + other)
| throw new AnalysisException( | ||
| s"DESC TABLE COLUMN command does not support nested data types: $colName") | ||
| throw QueryCompilationErrors.commandNotSupportNestedColumnError( | ||
| "DESC TABLE COLUMN", toPrettySQL(field, removeAlias = true)) |
There was a problem hiding this comment.
We won't hit this branch anymore, as we fail earlier in ResolveSessionCatalog. We can simply use assert(field.isInstanceOf[Attribute])
There was a problem hiding this comment.
I think this can still be hit for views.
|
Test build #133516 has finished for PR 30881 at commit
|
|
Test build #133513 has finished for PR 30881 at commit
|
|
retest this please |
|
Test build #133520 has finished for PR 30881 at commit
|
|
Test build #133540 has finished for PR 30881 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
This PR proposes to implement
DESCRIBE COLUMNfor v2 tables.Note that
isExntededoption is not implemented in this PR.Why are the changes needed?
Parity with v1 tables.
Does this PR introduce any user-facing change?
Yes, now,
DESCRIBE COLUMNworks for v2 tables.Before this PR, the command would fail with:
Describing columns is not supported for v2 tables.How was this patch tested?
Added new test.