[SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view by cloud-fan · Pull Request #27550 · apache/spark

cloud-fan · 2020-02-12T15:02:17Z

What changes were proposed in this pull request?

No v2 command supports temp views and the ResolveCatalogs/ResolveSessionCatalog framework is designed with this assumption.

However, ResolveSessionCatalog needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in CatalogAndIdentifier, which does not expand the given identifier with current namespace if the catalog is session catalog.

This works fine in most cases, as temp views should take precedence over tables during lookup. So if CatalogAndIdentifier returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist.

However, if users write spark_catalog.t, it shouldn't be resolved to temp views as temp views don't belong to any catalog. CatalogAndIdentifier can't distinguish between spark_catalog.t and t, so the caller side may mistakenly resolve spark_catalog.t to a temp view.

This PR proposes to fix this issue by

remove the hack in CatalogAndIdentifier, and clearly document that this shouldn't be used to resolve temp views.
update ResolveSessionCatalog to explicitly look up temp views first before calling CatalogAndIdentifier, for v1 commands that support temp views.

Why are the changes needed?

To avoid releasing a behavior that we should not support.

Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937

Does this PR introduce any user-facing change?

yes, now it's not allowed to refer to a temp view with spark_catalog prefix.

How was this patch tested?

new tests

cloud-fan · 2020-02-12T15:03:21Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

This is dead code, val v1TableName = parseV1Table(tbl, sql).asTableIdentifier fails earlier if length > 2

cloud-fan · 2020-02-12T15:05:05Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

this check should be done earlier, otherwise it's non-sense to check if the database names are specified consistently.

cloud-fan · 2020-02-12T15:06:54Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

if you look at the test, t refers to spark_catalog.default.t, testcat.t refers to testcat.t, the namespace is different, so the error message is also slightly different.

cloud-fan · 2020-02-12T15:07:33Z

cc @imback82 @viirya @brkyvz @HyukjinKwon

viirya · 2020-02-12T20:07:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/LookupCatalog.scala

Hmm, this adds current namespace to the returned identifier? So we cannot tell if a returned identifier is for a temp view?

I think ResolveTempViews needs to be updated to look up twice (one with namespace and one without).

if you look at SessionCatalog.setCurrentDatabase, we forbid to set global_temp as the current database.

This is to make the resolution of temp view simple: we look up temp view via the name parts given by users literally. No name extension is allowed.

The general rule of relation resolution is: we try to look up temp view first, then tables/permanent views. When we call CatalogAndIdentifier, the temp view lookup should already be tried, or the caller side don't want to resolve temp views.

When we call CatalogAndIdentifier, the temp view lookup should already be tried, or the caller side don't want to resolve temp views.

So, ResolveCatalogs should be applied after ResolveTables and ResolveRelations?

these 2 are different frameworks so I don't see a way to guarantee the order. I think what we should do is to migrate to the new framework.

viirya · 2020-02-12T20:10:27Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

Should we mention this is happened for SHOW COLUMNS command?

SparkQA · 2020-02-12T20:50:12Z

Test build #118318 has finished for PR 27550 at commit 9d59db4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-13T09:32:27Z

I've updated the PR description as I found a real bug.

cloud-fan · 2020-02-13T13:15:32Z

sql/core/src/test/resources/sql-tests/results/describe.sql.out

This is the consequence of removing the hack. Now ResolvedTable can get qualified table name even from session catalog.

SparkQA · 2020-02-13T18:40:01Z

Test build #118360 has finished for PR 27550 at commit a9aa831.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-13T19:08:33Z

Test build #118361 has finished for PR 27550 at commit 53db6f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-14T03:16:34Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+        if (nameParts.length == 1) {
+          // If there is only one name part, it means the current catalog is the session catalog.
+          // Here we return the original name part, to keep the error message unchanged for
+          // v1 commands.


I can remove this, but then I need to update many tests slightly to update the error message. We can leave it to 3.1.

imback82 · 2020-02-14T03:41:04Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

          DescribeColumnCommand(tbl.asTableIdentifier, colNameParts, isExtended)
      }.getOrElse {
-        if (isView(tbl)) {
+        if (isTempView(tbl)) {


Since tbl is from SessionCatalogAndTable, tbl may have the current namespace? And if the current namespace is not empty, this will always return false?

You have a good point here. I think this is an existing bug. Here we look up table first (call loadTable) then look up temp view. We should look up temp view first to retain the behavior of Spark 2.4. @imback82 can you help fix it later?

Yes, will do after this PR.

imback82 · 2020-02-14T03:49:55Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+      case _ if isTempView(nameParts) => Some(nameParts)
+      case SessionCatalogAndTable(_, tbl) =>
+        if (nameParts.head == CatalogManager.SESSION_CATALOG_NAME && tbl.length == 1) {
+          // For name parts like `spark_catalog.t`, we need to fill in the default database so


Unrelated question, what was the reason to allow spark_catalog.t to be spark_catalog.<cur_db>.t? (different behavior from v2 catalogs)

I noticed this well and think this is not intentional. But this is a different topic and we have many tests using spark_catalog.t.

We can open another PR to forbid it if we think we shouldn't support this feature.

Got it, thanks for the explanation. It makes more sense to forbid it to keep the behavior consistent.

imback82

+1, LGTM

viirya · 2020-02-14T08:43:39Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+      if (isTemp) {
+        // temp func doesn't belong to any catalog and we shouldn't resolve catalog in the name.
+        val database = if (nameParts.length > 2) {
+          throw new AnalysisException(s"${nameParts.quoted} is not a valid function name.")


... is not a valid function name sounds confusing? Maybe we add temporary?

viirya · 2020-02-14T09:04:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/LookupCatalog.scala

   * precedence over catalog name.
+   *
+   * Note that, this pattern is used to look up catalog objects like table, function, permanent
+   * view, etc. If you need to look up temp views, please do it separately before calling this


temp functions are also not to be looked up, right?

viirya · 2020-02-14T09:30:13Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+    }
+
+    nameParts match {
+      case SessionCatalogAndTable(_, funcName) => funcName match {


nit: we use SessionCatalogAndTable to extract a function name, sounds a bit confusing.

viirya

Looks good with few minor comments.

cloud-fan · 2020-02-14T11:01:36Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+      if (isTemp) {
+        // temp func doesn't belong to any catalog and we shouldn't resolve catalog in the name.
+        val database = if (nameParts.length > 2) {
+          throw new AnalysisException(s"Unsupported function name '${nameParts.quoted}'")


This is consistent with https://github.com/apache/spark/pull/27550/files#diff-2e07be4d73605cb1941153441a0c0c14L536

SparkQA · 2020-02-14T16:03:46Z

Test build #118422 has finished for PR 27550 at commit f4a3d08.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

imback82 · 2020-02-14T22:07:44Z

retest this please

SparkQA · 2020-02-15T02:42:38Z

Test build #118454 has finished for PR 27550 at commit f4a3d08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-17T04:08:08Z

thanks for the review, merging to master/3.0!

### What changes were proposed in this pull request? No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption. However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog. This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist. However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view. This PR proposes to fix this issue by 1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views. 2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views. ### Why are the changes needed? To avoid releasing a behavior that we should not support. Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937 ### Does this PR introduce any user-facing change? yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix. ### How was this patch tested? new tests Closes #27550 from cloud-fan/ns. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ab07c63) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

gatorsmile · 2020-02-18T18:54:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/LookupCatalog.scala

+   *
+   * Note that, this pattern is used to look up permanent catalog objects like table, view,
+   * function, etc. If you need to look up temp objects like temp view, please do it separately
+   * before calling this pattern, as temp objects don't belong to any catalog.


The term Catalog is being used in many places. For example, our internal SessionCatalog also manages temp objects.

our internal SessionCatalog also manages temp objects

We can move them out and put it in a temp view manager like the GlobalTempViewManager. I'm talking more about the theory.

### What changes were proposed in this pull request? No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption. However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog. This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist. However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view. This PR proposes to fix this issue by 1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views. 2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views. ### Why are the changes needed? To avoid releasing a behavior that we should not support. Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937 ### Does this PR introduce any user-facing change? yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix. ### How was this patch tested? new tests Closes apache#27550 from cloud-fan/ns. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan commented Feb 12, 2020

View reviewed changes

cloud-fan mentioned this pull request Feb 12, 2020

[SPARK-30782][SQL] Column resolution doesn't respect current catalog/namespace for v2 tables. #27532

Closed

cloud-fan force-pushed the ns branch 2 times, most recently from c358e5a to 9d59db4 Compare February 12, 2020 18:43

viirya reviewed Feb 12, 2020

View reviewed changes

cloud-fan force-pushed the ns branch 2 times, most recently from ff0388e to d4df542 Compare February 13, 2020 09:28

cloud-fan force-pushed the ns branch from d4df542 to de078c2 Compare February 13, 2020 09:35

apache deleted a comment from SparkQA Feb 13, 2020

cloud-fan force-pushed the ns branch from de078c2 to a9aa831 Compare February 13, 2020 13:10

apache deleted a comment from SparkQA Feb 13, 2020

cloud-fan commented Feb 13, 2020

View reviewed changes

CatalogAndIdentifier shouldn't return wrong namespace

53db6f0

cloud-fan force-pushed the ns branch from a9aa831 to 53db6f0 Compare February 13, 2020 13:17

cloud-fan changed the title ~~[SPARK-30799][SQL] CatalogAndIdentifier shouldn't return wrong namespace~~ [SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view Feb 14, 2020

cloud-fan commented Feb 14, 2020

View reviewed changes

imback82 reviewed Feb 14, 2020

View reviewed changes

imback82 approved these changes Feb 14, 2020

View reviewed changes

viirya reviewed Feb 14, 2020

View reviewed changes

viirya approved these changes Feb 14, 2020

View reviewed changes

address comments

f4a3d08

cloud-fan commented Feb 14, 2020

View reviewed changes

cloud-fan closed this in ab07c63 Feb 17, 2020

gatorsmile reviewed Feb 18, 2020

View reviewed changes

RussellSpitzer mentioned this pull request Mar 24, 2021

SparkSessionCatalog Drop Issues apache/iceberg#2374

Closed

Conversation

cloud-fan commented Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

cloud-fan commented Feb 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 13, 2020

Uh oh!

SparkQA commented Feb 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 14, 2020

Uh oh!

imback82 commented Feb 14, 2020

Uh oh!

SparkQA commented Feb 15, 2020

Uh oh!

cloud-fan commented Feb 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 12, 2020 •

edited

Loading

cloud-fan Feb 12, 2020 •

edited

Loading