[SPARK-35415][SQL] Change `information` to map type for SHOW TABLE EXTENDED command #32563

wangyum · 2021-05-16T15:33:42Z

What changes were proposed in this pull request?

Change information column to map type for SHOW TABLE EXTENDED command.

Why are the changes needed?

Usually not all information is what we need and it has poor readability. After SPARK-35283 and this PR. We can get the need information by key:

WITH s AS (SHOW TABLE EXTENDED LIKE '*') SELECT tableName, information['Provider'] FROM s

+------------+---------------------+
|tableName   |information[Provider]|
+------------+---------------------+
|test_delta  |delta                |
|test_parquet|parquet              |
+------------+---------------------+

Does this PR introduce any user-facing change?

Yes. The information column type changed.

How was this patch tested?

Unit test.

wangyum · 2021-05-16T15:34:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

Hive output:

hive> SHOW TABLE EXTENDED LIKE '*'; OK tableName:spark_32976 owner:yumwang location:file:/tmp/spark/spark_32976 inputformat:org.apache.hadoop.mapred.TextInputFormat outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat columns:struct columns { i32 id, string name} partitioned:true partitionColumns:struct partition_columns { string part} totalNumberFiles:unknown totalFileSize:unknown maxFileSize:unknown minFileSize:unknown lastAccessTime:unknown lastUpdateTime:unknown tableName:t1 owner:yumwang location:file:/tmp/hive/t1 inputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat outputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat columns:struct columns { string id} partitioned:true partitionColumns:struct partition_columns { date part} totalNumberFiles:unknown totalFileSize:unknown maxFileSize:unknown minFileSize:unknown lastAccessTime:unknown lastUpdateTime:unknown

SparkQA · 2021-05-16T16:20:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43110/

SparkQA · 2021-05-16T16:25:50Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43110/

SparkQA · 2021-05-16T16:35:19Z

Test build #138589 has finished for PR 32563 at commit d2e87b8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Given the change in JDBCSuite, we need to add this change into the SQL migration guide. Could you add some, please, @wangyum ?

docs/sql-migration-guide.md

SparkQA · 2021-05-17T08:29:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43136/

SparkQA · 2021-05-17T08:29:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43136/

SparkQA · 2021-05-17T10:29:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43142/

SparkQA · 2021-05-17T10:29:50Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43142/

SparkQA · 2021-05-17T11:02:01Z

Test build #138615 has finished for PR 32563 at commit f936eff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-17T13:54:30Z

Test build #138622 has finished for PR 32563 at commit c8c4853.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-18T15:11:04Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43207/

SparkQA · 2021-05-18T15:11:50Z

Test build #138686 has finished for PR 32563 at commit 0ffbce4.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-18T16:52:05Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43208/

SparkQA · 2021-05-18T18:19:47Z

Test build #138687 has finished for PR 32563 at commit ee70e0b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-19T03:41:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43219/

SparkQA · 2021-05-19T04:12:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43219/

wangyum · 2021-05-19T05:33:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

Move the spark.sql.legacy.keepCommandOutputSchema logic from ResolveSessionCatalog to v2Commands.

SparkQA · 2021-05-19T07:08:07Z

Test build #138698 has finished for PR 32563 at commit 8c1e5ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-05-20T04:43:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

We shouldn't do this, as the output expr ID becomes unstable and it changes after the plan is copied.

This is the reason why we created object DescribeNamespace and others. We shouldn't revert it.

cloud-fan · 2021-05-20T04:47:46Z

sql/core/src/test/resources/sql-tests/results/show-tables.sql.out

which is the information column?

The output only contains information column.

This is the change of HiveResult.

Example of Hive output :

hive> SHOW TABLE EXTENDED LIKE '*'; OK tableName:spark_32976 owner:yumwang location:file:/tmp/spark/spark_32976 inputformat:org.apache.hadoop.mapred.TextInputFormat outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat columns:struct columns { i32 id, string name} partitioned:true partitionColumns:struct partition_columns { string part} totalNumberFiles:unknown totalFileSize:unknown maxFileSize:unknown minFileSize:unknown lastAccessTime:unknown lastUpdateTime:unknown tableName:t1 owner:yumwang location:file:/tmp/hive/t1 inputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat outputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat columns:struct columns { string id} partitioned:true partitionColumns:struct partition_columns { date part} totalNumberFiles:unknown totalFileSize:unknown maxFileSize:unknown minFileSize:unknown lastAccessTime:unknown lastUpdateTime:unknown

The result doesn't match the command output schema?

Yes. Hive only contains one column:
https://github.com/apache/hive/blob/45b48d5fdc3527e56347b297d1a41902d1313220/ql/src/java/org/apache/hadoop/hive/ql/metadata/formatting/TextMetaDataFormatter.java#L248-L263.
The output similar to Spark's CatalogTable.toLinkedHashMap.

Sorry I'm confused. How can we allow a mismatch between the schema and data?

Only get the last column: https://github.com/apache/spark/blob/ca976cdb091c2ba308db37f25968318c3f942ed0/sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala#L69-L72

It's only for thrift-server and the SQL shell, but not sql(...).show?

this is only used in the SQL shell, not by the thrift server. Otherwise, the JDBC ResultSet will get wrong for mapping the metadata and column results.

If we turn on the spark.sql.cli.print.header=true for SQL shell

w/o this PR, the information column matches the first element of the map, then EOL. The rest of the map will print line by line, with wrong/no indentations.

w/ this PR, I guess the schema header and the result more inappropriate matched

spark-sql before:

spark-sql> set spark.sql.legacy.keepCommandOutputSchema=true; spark.sql.legacy.keepCommandOutputSchema true Time taken: 0.012 seconds, Fetched 1 row(s) spark-sql> SHOW TABLE EXTENDED LIKE '*'; default test_parquet false CatalogTable( Database: default Table: test_parquet Owner: yumwang Created Time: Mon May 24 11:16:33 CST 2021 Last Access: UNKNOWN Created By: Spark 3.2.0-SNAPSHOT Type: MANAGED Provider: hive Table Properties: [transient_lastDdlTime=1621826201] Statistics: 290 bytes Location: file:/Users/yumwang/tmp/xxxx/spark/spark-warehouse/test_parquet Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties: [serialization.format=1] Partition Provider: Catalog Schema: root |-- id: long (nullable = false) ) Time taken: 0.031 seconds, Fetched 1 row(s)

spark-sql after:

spark-sql> SHOW TABLE EXTENDED LIKE '*'; default test_parquet false {"Created By":"Spark 3.2.0-SNAPSHOT","Created Time":"Mon May 24 11:16:33 CST 2021","Database":"default","InputFormat":"org.apache.hadoop.mapred.TextInputFormat","Last Access":"UNKNOWN","Location":"file:/Users/yumwang/tmp/xxxx/spark/spark-warehouse/test_parquet","OutputFormat":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","Owner":"yumwang","Partition Provider":"Catalog","Provider":"hive","Schema":"root |-- id: long (nullable = false) ","Serde Library":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","Statistics":"290 bytes","Storage Properties":"[serialization.format=1]","Table Properties":"[transient_lastDdlTime=1621826201]","Table":"test_parquet","Type":"MANAGED"} Time taken: 0.043 seconds, Fetched 1 row(s)

beeline before:

0: jdbc:hive2://localhost:10000> set spark.sql.legacy.keepCommandOutputSchema=true; +-------------------------------------------+--------+ | key | value | +-------------------------------------------+--------+ | spark.sql.legacy.keepCommandOutputSchema | true | +-------------------------------------------+--------+ 1 row selected (0.055 seconds) 0: jdbc:hive2://localhost:10000> SHOW TABLE EXTENDED LIKE '*'; +-----------+---------------+--------------+----------------------------------------------------+ | database | tableName | isTemporary | information | +-----------+---------------+--------------+----------------------------------------------------+ | default | test_parquet | false | CatalogTable( Database: default Table: test_parquet Owner: yumwang Created Time: Mon May 24 11:16:33 CST 2021 Last Access: UNKNOWN Created By: Spark 3.2.0-SNAPSHOT Type: MANAGED Provider: hive Table Properties: [transient_lastDdlTime=1621826201] Statistics: 290 bytes Location: file:/Users/yumwang/tmp/xxxx/spark/spark-warehouse/test_parquet Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties: [serialization.format=1] Partition Provider: Catalog Schema: root |-- id: long (nullable = false) ) | +-----------+---------------+--------------+----------------------------------------------------+ 1 row selected (0.086 seconds)

beeline after:

0: jdbc:hive2://localhost:10000> SHOW TABLE EXTENDED LIKE '*'; +------------+---------------+--------------+----------------------------------------------------+ | namespace | tableName | isTemporary | information | +------------+---------------+--------------+----------------------------------------------------+ | default | test_parquet | false | {"Created By":"Spark 3.2.0-SNAPSHOT","Created Time":"Mon May 24 11:16:33 CST 2021","Database":"default","InputFormat":"org.apache.hadoop.mapred.TextInputFormat","Last Access":"UNKNOWN","Location":"file:/Users/yumwang/tmp/xxxx/spark/spark-warehouse/test_parquet","OutputFormat":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","Owner":"yumwang","Partition Provider":"Catalog","Provider":"hive","Schema":"root |-- id: long (nullable = false) ","Serde Library":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","Statistics":"290 bytes","Storage Properties":"[serialization.format=1]","Table Properties":"[transient_lastDdlTime=1621826201]","Table":"test_parquet","Type":"MANAGED"} | +------------+---------------+--------------+----------------------------------------------------+ 1 row selected (0.903 seconds)

SparkQA · 2021-05-26T07:17:39Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43485/

SparkQA · 2021-05-26T08:27:13Z

Test build #138966 has finished for PR 32563 at commit 44265e7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-26T10:33:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43499/

SparkQA · 2021-05-26T11:13:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43499/

SparkQA · 2021-05-26T14:13:37Z

Test build #138980 has finished for PR 32563 at commit ede029c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-05-27T09:20:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

      command.executeCollect().map(_.getString(1))
+    // SHOW TABLE EXTENDED in Hive only output the information column.
+    case command @ ExecutedCommandExec(s: ShowTablesCommand) if s.isExtended =>
+      if (s.conf.getConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA)) {


nit: I think it's more robust to check the data type of s.output(3)

cloud-fan · 2021-05-27T09:20:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

    isExtended: Boolean = false,
    partitionSpec: Option[TablePartitionSpec] = None) extends LeafRunnableCommand {

+  private val keepLegacySchema = conf.getConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA)


cloud-fan · 2021-05-27T09:29:51Z

sql/core/src/test/resources/sql-tests/results/show-tables.sql.out

-Provider: parquet
-Location [not included in comparison]/{warehouse_dir}/showdb.db/show_t2
-Schema: root
- |-- b: string (nullable = true)


can we check other databases and see how they indicate nullable columns?

SparkQA · 2021-05-27T15:19:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43540/

SparkQA · 2021-05-27T15:57:16Z

Test build #139023 has finished for PR 32563 at commit 521396e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-27T16:47:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43540/

SparkQA · 2021-05-27T17:20:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43542/

SparkQA · 2021-05-27T17:53:10Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43542/

SparkQA · 2021-05-27T21:04:27Z

Test build #139027 has finished for PR 32563 at commit 320e72b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-06-02T13:53:52Z

sql/core/src/test/resources/sql-tests/results/explain-aqe.sql.out

-- key: integer (nullable = true)
-- val: integer (nullable = true)
-), false
+Schema: struct<key:int,val:int>), false


My last concern is the loss of the nullable info. How does other databases do?

It seems other databases do not include Schema when SHOW EXTENDED TABLES:

hive> SHOW TABLE EXTENDED LIKE '*';
OK
tableName:spark_32976
owner:yumwang
location:file:/tmp/spark/spark_32976
inputformat:org.apache.hadoop.mapred.TextInputFormat
outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
columns:struct columns { i32 id, string name}
partitioned:true
partitionColumns:struct partition_columns { string part}
totalNumberFiles:unknown
totalFileSize:unknown
maxFileSize:unknown
minFileSize:unknown
lastAccessTime:unknown
lastUpdateTime:unknown

SparkQA · 2021-06-02T16:50:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43757/

SparkQA · 2021-06-02T17:25:16Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43757/

SparkQA · 2021-06-02T20:12:18Z

Test build #139234 has finished for PR 32563 at commit 7b1520e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-06-07T05:00:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

    if (tracksPartitionsInCatalog) map.put("Partition Provider", "Catalog")
    if (partitionColumnNames.nonEmpty) map.put("Partition Columns", partitionColumns)
-    if (schema.nonEmpty) map.put("Schema", schema.treeString)
+    if (schema.nonEmpty) map.put("Schema", schema.catalogString)


How about removing this line. It seems other DBs does not contains Schema.

It's no harm to keep it

cloud-fan · 2021-06-07T08:00:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -419,7 +419,7 @@ case class CatalogTable(
    map ++= storage.toLinkedHashMap
    if (tracksPartitionsInCatalog) map.put("Partition Provider", "Catalog")


CatalogTable.toLinkedHashMap was used for display only, not lookup. Shall we revisit all the key names here?

cloud-fan · 2021-06-07T08:01:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

    AttributeReference("tableName", StringType, nullable = false)(),
    AttributeReference("isTemporary", BooleanType, nullable = false)(),
-    AttributeReference("information", StringType, nullable = false)())
+    AttributeReference("information", MapType(StringType, StringType, false), nullable = false)())


I feel it's better to output a struct type here, so that users can easily know what fields they can query.

I think it's a general problem that most command outputs are not query-able. ESCRIBE EXTENDED tbl is also for display only.

github-actions · 2021-09-16T00:09:13Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label May 16, 2021

wangyum commented May 16, 2021

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-35415][SQL] Change information to map type for SHOW TABLE EXTENDED command~~ [SPARK-35415][SQL] Change information to map type for SHOW TABLE EXTENDED command May 16, 2021

dongjoon-hyun reviewed May 16, 2021

View reviewed changes

github-actions bot added the DOCS label May 17, 2021

wangyum commented May 17, 2021

View reviewed changes

docs/sql-migration-guide.md Outdated Show resolved Hide resolved

wangyum commented May 19, 2021

View reviewed changes

wangyum requested review from HyukjinKwon, MaxGekk, cloud-fan, dongjoon-hyun and yaooqinn May 20, 2021 00:10

cloud-fan reviewed May 20, 2021

View reviewed changes

Fix

44265e7

Fix

ede029c

cloud-fan reviewed May 27, 2021

View reviewed changes

Fix

521396e

Fix

320e72b

cloud-fan reviewed Jun 2, 2021

View reviewed changes

Merge branch 'master' into SPARK-35415

7b1520e

wangyum commented Jun 7, 2021

View reviewed changes

cloud-fan reviewed Jun 7, 2021

View reviewed changes

AngersZhuuuu mentioned this pull request Jun 13, 2021

[SPARK-35415][SQL] Change information to struct type for SHOW TABLE EXTENDED command #32897

Closed

github-actions bot added the Stale label Sep 16, 2021

github-actions bot closed this Sep 17, 2021

		@@ -419,7 +419,7 @@ case class CatalogTable(
		map ++= storage.toLinkedHashMap
		if (tracksPartitionsInCatalog) map.put("Partition Provider", "Catalog")

[SPARK-35415][SQL] Change information to map type for SHOW TABLE EXTENDED command #32563

[SPARK-35415][SQL] Change information to map type for SHOW TABLE EXTENDED command #32563

Uh oh!

Conversation

wangyum commented May 16, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 16, 2021

Uh oh!

SparkQA commented May 16, 2021

Uh oh!

SparkQA commented May 16, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

SparkQA commented May 17, 2021

Uh oh!

SparkQA commented May 18, 2021

Uh oh!

SparkQA commented May 18, 2021

Uh oh!

SparkQA commented May 18, 2021

Uh oh!

SparkQA commented May 18, 2021

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum May 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 26, 2021

Uh oh!

[SPARK-35415][SQL] Change `information` to map type for SHOW TABLE EXTENDED command #32563

[SPARK-35415][SQL] Change `information` to map type for SHOW TABLE EXTENDED command #32563

wangyum May 24, 2021 •

edited

Loading