Skip to content

[SPARK-33430][SQL] Support namespaces in JDBC v2 Table Catalog#30473

Closed
huaxingao wants to merge 5 commits intoapache:masterfrom
huaxingao:name_space
Closed

[SPARK-33430][SQL] Support namespaces in JDBC v2 Table Catalog#30473
huaxingao wants to merge 5 commits intoapache:masterfrom
huaxingao:name_space

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add namespaces support in JDBC v2 Table Catalog by making JDBCTableCatalog extendsSupportsNamespaces

Why are the changes needed?

make v2 JDBC implementation complete

Does this PR introduce any user-facing change?

Yes. Add the following to JDBCTableCatalog

  • listNamespaces
  • listNamespaces(String[] namespace)
  • namespaceExists(String[] namespace)
  • loadNamespaceMetadata(String[] namespace)
  • createNamespace
  • alterNamespace
  • dropNamespace

How was this patch tested?

Add new docker tests

@github-actions github-actions bot added the SQL label Nov 23, 2020
@huaxingao
Copy link
Contributor Author

I have problem testing the new namespace APIs using H2: after creating a new namespace, somehow I can't see it in listNamespaces. Testing against Postgres works OK. That's why I didn't add a regular JDBC test. I added a Postgres docker test instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me the only property we can support for name space is schema comment. I don't have a good way to retrieve schema comment, so I will return an empty map for now.

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36168/

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have problem testing the new namespace APIs using H2

Does Derby have the same issue? if not, you could put your tests in a separate test suite.

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36168/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Test build #131567 has finished for PR 30473 at commit ec5f279.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JDBCTableCatalog extends TableCatalog with SupportsNamespaces with Logging

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131580 has finished for PR 30473 at commit ec5f279.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JDBCTableCatalog extends TableCatalog with SupportsNamespaces with Logging

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131620 has finished for PR 30473 at commit ec5f279.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JDBCTableCatalog extends TableCatalog with SupportsNamespaces with Logging

@huaxingao
Copy link
Contributor Author

retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we check that catalogs only have one element? otherwise it's weird to see we pick the first catalog only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems expensive. is there a better way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean it's expensive to fetch namespaces info from the underlying databases, right? We can't save the previously fetched info and reuse it, because somebody else might have created or dropped namespaces after the last fetch. I guess we have to fetch again every time we need the info?

Copy link
Contributor

@cloud-fan cloud-fan Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way is to use conn.getMetaData.getSchemas(catalog, db) to avoid listing all databases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we make sure it exists?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CREATE NAMESPACE

@SparkQA
Copy link

SparkQA commented Nov 30, 2020

Test build #131966 has finished for PR 30473 at commit ec5f279.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JDBCTableCatalog extends TableCatalog with SupportsNamespaces with Logging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid hardcoding it? SupportsNamespaces.PROP_COMMENT

override def namespaceExists(namespace: Array[String]): Boolean = namespace match {
case Array(db) =>
withConnection { conn =>
val rs = conn.getMetaData.getSchemas(null, db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it return schemas only from the current catalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB2 jdbc driver implements this to return schemas from the current catalog (the current database jdbc driver connects to). Not sure how other jdbc drivers implement this. I tested with postgres jdbc driver, it also returns schemas from the current catalog. In listTables (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala#L62), we also assume getTables with null as catalog value returns tables from the current catalog.

case Array() =>
listNamespaces()
case Array(db) if namespaceExists(namespace) =>
Array()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we return the input namespace here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am following the implementation in V2SessionCatalog (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala#L211). According to the method definition List namespaces in a namespace, the method returns namespaces inside the input namespace, which is empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we fail to match the behavior of unknown table properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Fixed this along with the others

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, shall we fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and shall we fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fail. Fixed.
I was following the implementation in V2SessionCatalog (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala#L273). Do I need to change this too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems pgsql specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36672/

@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36672/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Test build #132073 has finished for PR 30473 at commit 8317cd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about def builtinNamespaces: Seq[Seq[String]]
then in test

assert(catalog.listNamespaces() === Array(Array("foo")) ++ builtinNamespaces)

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comment

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36746/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36746/

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Test build #132145 has finished for PR 30473 at commit 68838c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 15579ba Dec 4, 2020
@huaxingao
Copy link
Contributor Author

Thank you!

@huaxingao huaxingao deleted the name_space branch December 4, 2020 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments