Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables #31804

Closed
wants to merge 13 commits into from
Closed

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Mar 11, 2021

What changes were proposed in this pull request?

SHOW TABLES has an output column isTemporary which only indicates whether a view is a local temp view or not, that is not enough for users if they want a pipeline of Spark commands, such as SHOW TABLES foreach { case view => DROP VIEW; case table => DROP TABLE}

Why are the changes needed?

distinguish view and tables

Usually, most modern databases store tableType as a string column in INFORMATION_SCHEMA.TABLES.

On the reading side, they also return a string value for it.

FYI, https://docs.google.com/spreadsheets/d/1LeHYbGCDjgr-rYwQMlHBhJbeDxjOVuUatozgvKdxq6Y/edit#gid=0

Besides, as SHOW TABLES is not ANSI-standard, so it might be good for us to follow the JDBC standard. Then we can make our command and JDBC meta operation consistent

Does this PR introduce any user-facing change?

yes, show tables and show table extended will have one new column called tableType at the end

SHOW TABLES ...

# before
struct<namespace,tableName:string,isTemporary:boolean>

#after
struct<namespace:string,tableName:string,isTemporary:boolean,tableType:string>

SHOW TABLE EXTENDED ...

# before
struct<namespace:string,tableName:string,isTemporary:boolean,information:string>

#after
struct<namespace:string,tableName:string,isTemporary:boolean,information:string,tableType:string>

How was this patch tested?

ut modified and added

@github-actions github-actions bot added the SQL label Mar 11, 2021
@@ -0,0 +1,2 @@
--SET spark.sql.legacy.keepCommandOutputSchema=true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be useful to verify the legacy schema

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it relevant to this PR's isView column addition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code change is related, spark.sql.legacy.keepCommandOutputSchema=true will omit the isView column

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn
Copy link
Member Author

cc @cloud-fan @dongjoon-hyun @HyukjinKwon @maropu thanks for the review.

@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40542/

@@ -533,7 +534,8 @@ object ShowTableExtended {
AttributeReference("namespace", StringType, nullable = false)(),
AttributeReference("tableName", StringType, nullable = false)(),
AttributeReference("isTemporary", BooleanType, nullable = false)(),
AttributeReference("information", StringType, nullable = false)())
AttributeReference("information", StringType, nullable = false)(),
AttributeReference("isView", BooleanType, nullable = false)())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it added at the last to reduce the breaking change effect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -366,13 +366,13 @@ class ResolveSessionCatalog(val catalogManager: CatalogManager)
partitionSpec @ (None | Some(UnresolvedPartitionSpec(_, _))),
output) =>
val newOutput = if (conf.getConf(SQLConf.LEGACY_KEEP_COMMAND_OUTPUT_SCHEMA)) {
assert(output.length == 4)
output.head.withName("database") +: output.tail
assert(output.length == 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this seems inconsistent with the doc. The current document means spark.sql.legacy.keepCommandOutputSchema means the 3.1 or earlier schema, doesn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing a new legacy conf for this behavior change seems kind of trivial and might bring cognition burdens for users. So the config is reused for now and the doc will be updated if it is the right way to go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not reused technically. If we reuse the existing conf, this should be output.length == 4 because it disable this PR and the previous commit simultaneously.

So the config is reused for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be output.length == 4

indeed, this is true. output.head.withName("database") +: output.slice(1, 4) will cut the isView off

if (output.size == 4) {
Row(database, tableName, isTemp, isView)
} else {
Row(database, tableName, isTemp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still have a test coverage for this line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the new show-tables-legacy.sql will import the corresponding tests to cover. I can add some cases in v1.ShowTablesSuite if show-tables-legacy.sql is unintuitive

@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40543/

@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40542/

@@ -514,7 +514,8 @@ object ShowTables {
def getOutputAttrs: Seq[Attribute] = Seq(
AttributeReference("namespace", StringType, nullable = false)(),
AttributeReference("tableName", StringType, nullable = false)(),
AttributeReference("isTemporary", BooleanType, nullable = false)())
AttributeReference("isTemporary", BooleanType, nullable = false)(),
AttributeReference("isView", BooleanType, nullable = false)())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new column isView: Boolean is more efficient, but I'm wondering if a new column tableType: String is more user-friendly. The value can be TABLE or VIEW.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, most modern databases store tableType as a string column in INFORMATION_SCHEMA.TABLES.

On the reading side, they also return a string value for it.

FYI, https://docs.google.com/spreadsheets/d/1LeHYbGCDjgr-rYwQMlHBhJbeDxjOVuUatozgvKdxq6Y/edit#gid=0

Besides, as SHOW TABLES is not ANSI standard, so it might be good for us to follow the JDBC standard. Then we can make our command and JDBC meta operation consistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDBC protocol also use table type, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, another benefit for using string is if we decide to subdivide tables to something like SYSTEM TABLE/BASE TABLE, xxx TABLE e.t.c, we won't break then

@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40543/

@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Test build #135958 has finished for PR 31804 at commit 2546812.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn yaooqinn changed the title [SPARK-34710][SQL] Add isView for SHOW TABLES to distinguish view and tables [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables Mar 11, 2021
@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Test build #135975 has finished for PR 31804 at commit 6bc1614.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ShowTablesLegacyHelper

@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40561/

@SparkQA
Copy link

SparkQA commented Mar 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40561/

@maropu
Copy link
Member

maropu commented Mar 12, 2021

Could you add some query examples for this behaviour change in the PR description? Also, I think we need to update the migration guide, too.

@yaooqinn
Copy link
Member Author

Could you add some query examples for this behaviour change in the PR description? Also, I think we need to update the migration guide, too.

Thanks, @maropu, your suggestions sound good~

@github-actions github-actions bot added the DOCS label Mar 12, 2021
@SparkQA
Copy link

SparkQA commented Mar 12, 2021

Test build #136002 has finished for PR 31804 at commit 5e9dbcc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40589/

@SparkQA
Copy link

SparkQA commented Mar 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40589/

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 12, 2021

Test build #136005 has finished for PR 31804 at commit bee8cbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -22,7 +22,7 @@ log4j.rootLogger=INFO, CA, FA
log4j.appender.CA=org.apache.log4j.ConsoleAppender
log4j.appender.CA.layout=org.apache.log4j.PatternLayout
log4j.appender.CA.layout.ConversionPattern=%d{HH:mm:ss.SSS} %p %c: %m%n
log4j.appender.CA.Threshold = WARN
log4j.appender.CA.Threshold = FATAL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oversized log for GA console output cause truncation fo useful test errors, andthe unit-tests.log is enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, Is it better to backport this change into the previous branches? cc: @HyukjinKwon @dongjoon-hyun

Copy link
Member Author

@yaooqinn yaooqinn Mar 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I guess so. we can mute console output for most modules. we call o.s.Assertions.intercept frequently which produces a lot of unnecessary error logs.

When we mute them, the error stacktraces for failed tests can still be kept.

I can make a separate PR to fix it if it makes senses to the CCers too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the current threshold (WARN) can help catch a bug like the following early: #31273 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the current threshold (WARN) can help catch a bug like the following early: #31273 (comment)

These warning messages still can be found in the unit-tests.logs. I don't see much difference as a warning message can still be simply ignored.

But when we encounter test failures but got omitted by the CI, it is hard for us to locate to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay as long as unit-tests.log contains.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll send a PR then

@@ -510,11 +510,21 @@ case class ShowTables(
override def children: Seq[LogicalPlan] = Seq(namespace)
}

object ShowTables {
trait ShowTablesLegacyHelper {
def getOutputAttrs: Seq[Attribute]
Copy link
Member

@maropu maropu Mar 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good idea that the code to handle this legacy behaviour depends on trait? If we remove this legacy behaviour in far future and we remove this trait, the change can lead to binary-incompatibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callers are bundled in the catalyst module, I guess it's safe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for my late reply...) How about simply inlining getLegacyOutputAttrs? It seems there are the only two places where getLegacyOutputAttrs is used, so I'm not sure that we need this trait.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41127/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41128/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41128/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Test build #136544 has finished for PR 31804 at commit eaba257.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41127/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Test build #136543 has finished for PR 31804 at commit bee8cbe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41129/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41129/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Test build #136545 has finished for PR 31804 at commit f50d87b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine otherwise.

@SparkQA
Copy link

SparkQA commented Mar 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41286/

@SparkQA
Copy link

SparkQA commented Mar 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41286/

@SparkQA
Copy link

SparkQA commented Mar 30, 2021

Test build #136705 has finished for PR 31804 at commit 3ef5b5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

cc @cloud-fan @HyukjinKwon PTAL, thanks

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Test build #137568 has finished for PR 31804 at commit 3ef5b5f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Test build #137604 has finished for PR 31804 at commit 3ef5b5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 29, 2021
@github-actions github-actions bot closed this Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants