Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29412][SQL] refine the document of v2 session catalog config #26071

Closed
wants to merge 4 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Oct 9, 2019

What changes were proposed in this pull request?

Refine the document of v2 session catalog config, to clearly explain what it is, when it should be used and how to implement it.

Why are the changes needed?

Make this config more understandable

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the Jenkins with the newly updated test cases.

@cloud-fan
Copy link
Contributor Author

cc @rdblue @brkyvz

"so that it can delegate calls to Spark Catalog. Otherwise, the implementation " +
"should figure out a way to access the Spark Catalog or its underlying meta-store " +
"by itself. It's important to make the implementation share the underlying meta-store " +
"of the Spark Catalog and act as an extension, instead of a separated catalog.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complicated config and I'm trying my best to explain it.

cc some SQL guys to get fresh eyes and see if they can understand it. @dongjoon-hyun @viirya @maropu @HyukjinKwon

Copy link
Contributor

@rdblue rdblue Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've identified the right things to point out:

  • This controls the implementation that is a v2 interface to the built-in v1 catalog
  • This catalog and the built-in v1 catalog share an identifier namespace and must be consistent
  • To delegate to the default implementation, use CatalogExtension

I'd change the wording to be a bit shorter, though. This doesn't need to explain what the catalog interface does, or what the built-in catalog can do. How about this?

A catalog implementation that will be used as the v2 interface to Spark's built-in v1 catalog, spark_catalog. This catalog shares its identifier namespace with the v1 Spark catalog and must be consistent with it; for example, if a table can be loaded by the v1 catalog, this catalog must also return the table metadata. To delegate operations to the built-in catalog, implementations can extend CatalogExtension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion!

@@ -36,8 +36,7 @@ import org.apache.spark.sql.internal.SQLConf
*/
// TODO: all commands should look up table from the current catalog. The `SessionCatalog` doesn't
// need to track current database at all.
private[sql]
class CatalogManager(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this isn't a related change.

@rdblue
Copy link
Contributor

rdblue commented Oct 9, 2019

Looks good to me. I'd shorten the config description a bit by removing information that doesn't need to be there and focusing on how the v2 implementation is used by Spark.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rdblue
Copy link
Contributor

rdblue commented Oct 10, 2019

+1

Unless there are other comments, I'll merge this later today.

@brkyvz
Copy link
Contributor

brkyvz commented Oct 10, 2019

LGTM as well

@dongjoon-hyun
Copy link
Member

Retest this please.

@@ -144,13 +144,13 @@ private [connector] trait SessionCatalogTest[T <: Table, Catalog <: TestV2Sessio
protected val catalogClassName: String = classOf[InMemoryTableSessionCatalog].getName

before {
spark.conf.set(V2_SESSION_CATALOG.key, catalogClassName)
spark.conf.set(V2_SESSION_CATALOG_IMPLEMENTATION.key, catalogClassName)
}

override def afterEach(): Unit = {
super.afterEach()
catalog("session").asInstanceOf[Catalog].clearTables()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be last remaining instance.

-    catalog("session").asInstanceOf[Catalog].clearTables()
+    catalog(SESSION_CATALOG_NAME).asInstanceOf[Catalog].clearTables()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test suite fails without this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@@ -26,7 +26,7 @@ import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.analysis.{NoSuchTableException, TableAlreadyExistsException}
import org.apache.spark.sql.connector.catalog._
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+  import org.apache.spark.sql.connector.catalog.CatalogManager.SESSION_CATALOG_NAME

@dongjoon-hyun
Copy link
Member

I slightly updated the PR description, @cloud-fan .

@@ -146,5 +146,5 @@ class CatalogManager(
}

private[sql] object CatalogManager {
val SESSION_CATALOG_NAME: String = "session"
val SESSION_CATALOG_NAME: String = "spark_catalog"
Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we see, this cause a test case failure. Could you describe a little bit why we change this in this PR since this is not a documentation change (refine the document of v2 session catalog config).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config name/document includes the catalog name, so I put them together.

The new name was discussed in a DS v2 community meeting. I'm proposing it in this PR to get more visibility.

"passed the Spark built-in session catalog, so that it may delegate calls to the " +
"built-in session catalog.")
val V2_SESSION_CATALOG_IMPLEMENTATION =
buildConf(s"spark.sql.catalog.$SESSION_CATALOG_NAME")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side note, eventually, from user perspective, the exposed configuration names are different.

- spark.sql.catalog.session
+ spark.sql.catalog.spark_catalog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and it will be hard to change after 3.0 is released.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this change, @cloud-fan . Thank you.
I left a few comments including the test failure fixes.

@@ -77,7 +77,7 @@ class CatalogManager(
// If the V2_SESSION_CATALOG config is specified, we try to instantiate the user-specified v2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V2_SESSION_CATALOG -? V2_SESSION_CATALOG_IMPLEMENTATION?

Comment on lines +1982 to +1987
.doc("A catalog implementation that will be used as the v2 interface to Spark's built-in " +
s"v1 catalog: $SESSION_CATALOG_NAME. This catalog shares its identifier namespace with " +
s"the $SESSION_CATALOG_NAME and must be consistent with it; for example, if a table can " +
s"be loaded by the $SESSION_CATALOG_NAME, this catalog must also return the table " +
s"metadata. To delegate operations to the $SESSION_CATALOG_NAME, implementations can " +
"extend 'CatalogExtension'.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pinging me. Almost missing this one.

Does it mean, any implementation configured here, must be the v2 interface to built-in v1 catalog? Sounds like it can delegate or not delegate operations to $SESSION_CATALOG_NAME? If it does not, is it still be the v2 interface to the built-in v1 catalog?

A bit confused here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be a v2 interface, but we can't fully control how it is implemented. We expect it to delegate to the SESSION_CATALOG_NAME, but there is no way to guarantee it.

If the implementation doesn't delegate, then the behavior is undefined.

@SparkQA
Copy link

SparkQA commented Oct 13, 2019

Test build #111983 has finished for PR 26071 at commit c54b156.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112012 has finished for PR 26071 at commit 18f7ff8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112031 has finished for PR 26071 at commit e973e41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 9407fba Oct 15, 2019
@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@dongjoon-hyun
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants