[SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view#27550
[SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view#27550cloud-fan wants to merge 2 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
This is dead code, val v1TableName = parseV1Table(tbl, sql).asTableIdentifier fails earlier if length > 2
There was a problem hiding this comment.
this check should be done earlier, otherwise it's non-sense to check if the database names are specified consistently.
There was a problem hiding this comment.
if you look at the test, t refers to spark_catalog.default.t, testcat.t refers to testcat.t, the namespace is different, so the error message is also slightly different.
c358e5a to
9d59db4
Compare
There was a problem hiding this comment.
Hmm, this adds current namespace to the returned identifier? So we cannot tell if a returned identifier is for a temp view?
There was a problem hiding this comment.
I think ResolveTempViews needs to be updated to look up twice (one with namespace and one without).
There was a problem hiding this comment.
if you look at SessionCatalog.setCurrentDatabase, we forbid to set global_temp as the current database.
This is to make the resolution of temp view simple: we look up temp view via the name parts given by users literally. No name extension is allowed.
The general rule of relation resolution is: we try to look up temp view first, then tables/permanent views. When we call CatalogAndIdentifier, the temp view lookup should already be tried, or the caller side don't want to resolve temp views.
There was a problem hiding this comment.
When we call
CatalogAndIdentifier, the temp view lookup should already be tried, or the caller side don't want to resolve temp views.
So, ResolveCatalogs should be applied after ResolveTables and ResolveRelations?
There was a problem hiding this comment.
these 2 are different frameworks so I don't see a way to guarantee the order. I think what we should do is to migrate to the new framework.
There was a problem hiding this comment.
Should we mention this is happened for SHOW COLUMNS command?
|
Test build #118318 has finished for PR 27550 at commit
|
ff0388e to
d4df542
Compare
|
I've updated the PR description as I found a real bug. |
There was a problem hiding this comment.
This is the consequence of removing the hack. Now ResolvedTable can get qualified table name even from session catalog.
|
Test build #118360 has finished for PR 27550 at commit
|
|
Test build #118361 has finished for PR 27550 at commit
|
| if (nameParts.length == 1) { | ||
| // If there is only one name part, it means the current catalog is the session catalog. | ||
| // Here we return the original name part, to keep the error message unchanged for | ||
| // v1 commands. |
There was a problem hiding this comment.
I can remove this, but then I need to update many tests slightly to update the error message. We can leave it to 3.1.
| DescribeColumnCommand(tbl.asTableIdentifier, colNameParts, isExtended) | ||
| }.getOrElse { | ||
| if (isView(tbl)) { | ||
| if (isTempView(tbl)) { |
There was a problem hiding this comment.
Since tbl is from SessionCatalogAndTable, tbl may have the current namespace? And if the current namespace is not empty, this will always return false?
There was a problem hiding this comment.
You have a good point here. I think this is an existing bug. Here we look up table first (call loadTable) then look up temp view. We should look up temp view first to retain the behavior of Spark 2.4. @imback82 can you help fix it later?
There was a problem hiding this comment.
Yes, will do after this PR.
| case _ if isTempView(nameParts) => Some(nameParts) | ||
| case SessionCatalogAndTable(_, tbl) => | ||
| if (nameParts.head == CatalogManager.SESSION_CATALOG_NAME && tbl.length == 1) { | ||
| // For name parts like `spark_catalog.t`, we need to fill in the default database so |
There was a problem hiding this comment.
Unrelated question, what was the reason to allow spark_catalog.t to be spark_catalog.<cur_db>.t? (different behavior from v2 catalogs)
There was a problem hiding this comment.
I noticed this well and think this is not intentional. But this is a different topic and we have many tests using spark_catalog.t.
We can open another PR to forbid it if we think we shouldn't support this feature.
There was a problem hiding this comment.
Got it, thanks for the explanation. It makes more sense to forbid it to keep the behavior consistent.
| if (isTemp) { | ||
| // temp func doesn't belong to any catalog and we shouldn't resolve catalog in the name. | ||
| val database = if (nameParts.length > 2) { | ||
| throw new AnalysisException(s"${nameParts.quoted} is not a valid function name.") |
There was a problem hiding this comment.
... is not a valid function name sounds confusing? Maybe we add temporary?
| * precedence over catalog name. | ||
| * | ||
| * Note that, this pattern is used to look up catalog objects like table, function, permanent | ||
| * view, etc. If you need to look up temp views, please do it separately before calling this |
There was a problem hiding this comment.
temp functions are also not to be looked up, right?
| } | ||
|
|
||
| nameParts match { | ||
| case SessionCatalogAndTable(_, funcName) => funcName match { |
There was a problem hiding this comment.
nit: we use SessionCatalogAndTable to extract a function name, sounds a bit confusing.
viirya
left a comment
There was a problem hiding this comment.
Looks good with few minor comments.
| if (isTemp) { | ||
| // temp func doesn't belong to any catalog and we shouldn't resolve catalog in the name. | ||
| val database = if (nameParts.length > 2) { | ||
| throw new AnalysisException(s"Unsupported function name '${nameParts.quoted}'") |
There was a problem hiding this comment.
This is consistent with https://github.com/apache/spark/pull/27550/files#diff-2e07be4d73605cb1941153441a0c0c14L536
|
Test build #118422 has finished for PR 27550 at commit
|
|
retest this please |
|
Test build #118454 has finished for PR 27550 at commit
|
|
thanks for the review, merging to master/3.0! |
### What changes were proposed in this pull request? No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption. However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog. This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist. However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view. This PR proposes to fix this issue by 1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views. 2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views. ### Why are the changes needed? To avoid releasing a behavior that we should not support. Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937 ### Does this PR introduce any user-facing change? yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix. ### How was this patch tested? new tests Closes #27550 from cloud-fan/ns. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ab07c63) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
| * | ||
| * Note that, this pattern is used to look up permanent catalog objects like table, view, | ||
| * function, etc. If you need to look up temp objects like temp view, please do it separately | ||
| * before calling this pattern, as temp objects don't belong to any catalog. |
There was a problem hiding this comment.
The term Catalog is being used in many places. For example, our internal SessionCatalog also manages temp objects.
There was a problem hiding this comment.
our internal SessionCatalog also manages temp objects
We can move them out and put it in a temp view manager like the GlobalTempViewManager. I'm talking more about the theory.
### What changes were proposed in this pull request? No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption. However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog. This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist. However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view. This PR proposes to fix this issue by 1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views. 2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views. ### Why are the changes needed? To avoid releasing a behavior that we should not support. Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937 ### Does this PR introduce any user-facing change? yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix. ### How was this patch tested? new tests Closes apache#27550 from cloud-fan/ns. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
No v2 command supports temp views and the
ResolveCatalogs/ResolveSessionCatalogframework is designed with this assumption.However,
ResolveSessionCatalogneeds to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack inCatalogAndIdentifier, which does not expand the given identifier with current namespace if the catalog is session catalog.This works fine in most cases, as temp views should take precedence over tables during lookup. So if
CatalogAndIdentifierreturns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist.However, if users write
spark_catalog.t, it shouldn't be resolved to temp views as temp views don't belong to any catalog.CatalogAndIdentifiercan't distinguish betweenspark_catalog.tandt, so the caller side may mistakenly resolvespark_catalog.tto a temp view.This PR proposes to fix this issue by
CatalogAndIdentifier, and clearly document that this shouldn't be used to resolve temp views.ResolveSessionCatalogto explicitly look up temp views first before callingCatalogAndIdentifier, for v1 commands that support temp views.Why are the changes needed?
To avoid releasing a behavior that we should not support.
Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937
Does this PR introduce any user-facing change?
yes, now it's not allowed to refer to a temp view with
spark_catalogprefix.How was this patch tested?
new tests