-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29412][SQL] refine the document of v2 session catalog config #26071
Conversation
"so that it can delegate calls to Spark Catalog. Otherwise, the implementation " + | ||
"should figure out a way to access the Spark Catalog or its underlying meta-store " + | ||
"by itself. It's important to make the implementation share the underlying meta-store " + | ||
"of the Spark Catalog and act as an extension, instead of a separated catalog.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a complicated config and I'm trying my best to explain it.
cc some SQL guys to get fresh eyes and see if they can understand it. @dongjoon-hyun @viirya @maropu @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you've identified the right things to point out:
- This controls the implementation that is a v2 interface to the built-in v1 catalog
- This catalog and the built-in v1 catalog share an identifier namespace and must be consistent
- To delegate to the default implementation, use
CatalogExtension
I'd change the wording to be a bit shorter, though. This doesn't need to explain what the catalog interface does, or what the built-in catalog can do. How about this?
A catalog implementation that will be used as the v2 interface to Spark's built-in v1 catalog, spark_catalog. This catalog shares its identifier namespace with the v1 Spark catalog and must be consistent with it; for example, if a table can be loaded by the v1 catalog, this catalog must also return the table metadata. To delegate operations to the built-in catalog, implementations can extend
CatalogExtension
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion!
@@ -36,8 +36,7 @@ import org.apache.spark.sql.internal.SQLConf | |||
*/ | |||
// TODO: all commands should look up table from the current catalog. The `SessionCatalog` doesn't | |||
// need to track current database at all. | |||
private[sql] | |||
class CatalogManager( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this isn't a related change.
Looks good to me. I'd shorten the config description a bit by removing information that doesn't need to be there and focusing on how the v2 implementation is used by Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
+1 Unless there are other comments, I'll merge this later today. |
LGTM as well |
Retest this please. |
@@ -144,13 +144,13 @@ private [connector] trait SessionCatalogTest[T <: Table, Catalog <: TestV2Sessio | |||
protected val catalogClassName: String = classOf[InMemoryTableSessionCatalog].getName | |||
|
|||
before { | |||
spark.conf.set(V2_SESSION_CATALOG.key, catalogClassName) | |||
spark.conf.set(V2_SESSION_CATALOG_IMPLEMENTATION.key, catalogClassName) | |||
} | |||
|
|||
override def afterEach(): Unit = { | |||
super.afterEach() | |||
catalog("session").asInstanceOf[Catalog].clearTables() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be last remaining instance.
- catalog("session").asInstanceOf[Catalog].clearTables()
+ catalog(SESSION_CATALOG_NAME).asInstanceOf[Catalog].clearTables()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test suite fails without this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
@@ -26,7 +26,7 @@ import org.apache.spark.sql.catalyst.TableIdentifier | |||
import org.apache.spark.sql.catalyst.analysis.{NoSuchTableException, TableAlreadyExistsException} | |||
import org.apache.spark.sql.connector.catalog._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ import org.apache.spark.sql.connector.catalog.CatalogManager.SESSION_CATALOG_NAME
I slightly updated the PR description, @cloud-fan . |
@@ -146,5 +146,5 @@ class CatalogManager( | |||
} | |||
|
|||
private[sql] object CatalogManager { | |||
val SESSION_CATALOG_NAME: String = "session" | |||
val SESSION_CATALOG_NAME: String = "spark_catalog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we see, this cause a test case failure. Could you describe a little bit why we change this in this PR since this is not a documentation change (refine the document of v2 session catalog config
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config name/document includes the catalog name, so I put them together.
The new name was discussed in a DS v2 community meeting. I'm proposing it in this PR to get more visibility.
"passed the Spark built-in session catalog, so that it may delegate calls to the " + | ||
"built-in session catalog.") | ||
val V2_SESSION_CATALOG_IMPLEMENTATION = | ||
buildConf(s"spark.sql.catalog.$SESSION_CATALOG_NAME") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side note, eventually, from user perspective, the exposed configuration names are different.
- spark.sql.catalog.session
+ spark.sql.catalog.spark_catalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and it will be hard to change after 3.0 is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this change, @cloud-fan . Thank you.
I left a few comments including the test failure fixes.
@@ -77,7 +77,7 @@ class CatalogManager( | |||
// If the V2_SESSION_CATALOG config is specified, we try to instantiate the user-specified v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V2_SESSION_CATALOG -? V2_SESSION_CATALOG_IMPLEMENTATION?
.doc("A catalog implementation that will be used as the v2 interface to Spark's built-in " + | ||
s"v1 catalog: $SESSION_CATALOG_NAME. This catalog shares its identifier namespace with " + | ||
s"the $SESSION_CATALOG_NAME and must be consistent with it; for example, if a table can " + | ||
s"be loaded by the $SESSION_CATALOG_NAME, this catalog must also return the table " + | ||
s"metadata. To delegate operations to the $SESSION_CATALOG_NAME, implementations can " + | ||
"extend 'CatalogExtension'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pinging me. Almost missing this one.
Does it mean, any implementation configured here, must be the v2 interface to built-in v1 catalog? Sounds like it can delegate or not delegate operations to $SESSION_CATALOG_NAME? If it does not, is it still be the v2 interface to the built-in v1 catalog?
A bit confused here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be a v2 interface, but we can't fully control how it is implemented. We expect it to delegate to the SESSION_CATALOG_NAME, but there is no way to guarantee it.
If the implementation doesn't delegate, then the behavior is undefined.
Test build #111983 has finished for PR 26071 at commit
|
Test build #112012 has finished for PR 26071 at commit
|
Test build #112031 has finished for PR 26071 at commit
|
thanks for the review, merging to master! |
Thank you! |
What changes were proposed in this pull request?
Refine the document of v2 session catalog config, to clearly explain what it is, when it should be used and how to implement it.
Why are the changes needed?
Make this config more understandable
Does this PR introduce any user-facing change?
No
How was this patch tested?
Pass the Jenkins with the newly updated test cases.