-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive: Support connecting to multiple Hive-Catalog #7441
Conversation
@@ -64,10 +64,15 @@ public class HiveCatalog extends BaseMetastoreCatalog implements SupportsNamespa | |||
public static final String LIST_ALL_TABLES = "list-all-tables"; | |||
public static final String LIST_ALL_TABLES_DEFAULT = "false"; | |||
|
|||
public static final String HMS_CATALOG = "hive-catalog"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to document this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will do in follow-up pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually decided not to add this extra config for now, as user can use spark.sql.catalog.$myCatalog.hadoop.metastore.catalog.default. Can revisit this later, but for now there's no new flag introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. I just have quick question, when we use multiple hive-catalog, we can perform complex queries between tables (even in different clusters) that are completely isolated by catalog, such as join\union, right?
Thanks. Yes, it depends on engine, I believe Spark has this support. |
Test failed looks unrelated, kicking off again |
Last test cleanup is not related to failure (is just good to clear for next test). Thanks @pvary @hililiwei for review |
Note: As the term Hive-Catalog is overloaded across, I'll use "Iceberg-HiveCatalog" to refer to Iceberg's HiveCatalog class and "HMS Catalog" to refer to Hive specific concept.
https://issues.apache.org/jira/browse/HIVE-18685 added support for HMS catalogs, ie different namespaces for dbs/tables in the metastore.
However, the Iceberg-HiveCatalog uses a global cache of HMS connections based on Hive-Metastore-URI. This prevents different Iceberg-HiveCatalogs existing in same JVM from talking to different HMS Catalogs on same HMS. See issue: #5378. This is important for example in Spark, where user may try to configure different SparkCatalog pointing to separate catalog on same HMS.
This change takes advantage of #6698 to fix this. It does two things:
From Spark point of view, user can configure 'spark.sql.catalog.$cat1.hive-catalog=foo', 'spark.sql.catalog.$cat2.hive-catalog=bar'.
Note; @RussellSpitzer pointed to me that setting spark.sql.catalog.$myCatalog.hadoop.metastore.catalog.default=foo seems to work to configure each SparkCatalog. But it is a bit clunky and undocumented, and hence the proposal to add 'hive_catalog' shorthand.
In either case, the second change is needed to allow different cached connections on same HMS to go to different HMS-catalogs.