Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive: unify catalog experience across engines #2565

Merged
merged 9 commits into from
May 18, 2021

Conversation

jackye1995
Copy link
Contributor

As discussed in #2544, make Hive catalog experience consistent with Spark and Flink, specifically:

  1. when type is null, look at catalog-impl, if that is also null then use HiveCatalog as default.
  2. there is also no need to have templates CATALOG_WAREHOUSE_TEMPLATE and CATALOG_CLASS_TEMPLATE, because they are unified in CatalogProperties.

I also rewrote all the tests to make each test case clear in separated tests instead of a single giant test, please take a look if I missed any edge case in testing.

@pvary @marton-bod @lcspinter @aokolnychyi @rdblue

Assert.assertTrue(hadoopCatalog.isPresent());
Assert.assertTrue(hadoopCatalog.get() instanceof HadoopCatalog);
Assert.assertEquals("HadoopCatalog{name=barCatalog, location=/tmp/mylocation}", hadoopCatalog.get().toString());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hardcoding the toString in the assert can become brittle and difficult to maintain over time. Shouldn't we add a warehouseLocation() method to HadoopCatalog, and then assert for catalog.name() and ((HadoopCatalog) catalog).warehouseLocation() for the same effect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in the original test, so sorry I did not look too much into it. I agree that it's better to test with those methods, but unfortunately HadoopCatalog.warehouseLocation() does not exist, so we cannot really use that, so I suppose that is why this test is written in this way to get the warehouse location.

String catalogType = conf.get(String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName));
if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME) || catalogType == null) {
String catalogType = conf.get(InputFormatConfig.catalogTypeConfigKey(catalogName));
if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update the javadocs of this method according to this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the doc of this method to Catalogs class level, because private method docs are not generated in html.

* @return Hadoop config key of catalog type for the catalog name
*/
public static String catalogTypeConfigKey(String catalogName) {
return String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add type to the CatalogProperties as well so we can reuse catalogPropertyConfigKey here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not added because type is not consistent across engines. Flink has to use catalog-type instead of type because type is a reserved config key.

With that being said, we can probably add it and specify that Flink has to use catalog-type. Let me do that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Which one does Spark use? Would it make sense to standardize w/ the other engines now and use catalog-type here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's actually one way we can consider, @rdblue @aokolnychyi , would it be acceptable if we unify all engines to ause catalog-type instead, and type is just an alias for catalog-type in Spark and Hive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while Ryan and Anton have not replied, I will keep it like this, we can always add catalog-type as a new catalog properties in a new PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, I am treating CatalogUtil.ICEBERG_CATALOG_TYPE as the catalog property to use, and removed reference to InputFormatConfig.CATALOG_TYPE_TEMPLATE. I also added documentation inside CatalogUtil to make things clear.

I don't personally like the fact that CatalogUtil.ICEBERG_CATALOG_TYPE is in CatalogUtil instead of CatalogProperties, but we already released it in 0.11 as public variables so I decide to just keep it this way.

AssertHelpers.assertThrows(
"should complain about catalog not supported", UnsupportedOperationException.class,
"Unknown catalog type", () -> Catalogs.loadCatalog(conf, null));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need a test case for legacy_custom too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there was no test for that, I will add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, because the CATALOG_LOADER_CLASS (iceberg.mr.catalog.loader.class) is no longer used, we cannot load a custom catalog in the legacy mode, so there is no test for it.

Looking at https://github.com/apache/iceberg/pull/2129/files#diff-c183ea3aa154c2a5012f87d7a06dba3cff3f27975384e9fb4040fe6850a98bd6L192-L193, this seems like a backwards incompatible change. @lcspinter is this intentional, or should we add that part of the logic back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, I have updated the code to allow this legacy behavior, and added the missing test.

@marton-bod
Copy link
Collaborator

Thanks @jackye1995 for this, a few comments.
Also, shouldn't we include the hive docs change (re. catalog loading) in this PR too to keep them in lockstep?

* @param catalogName catalog name
* @return Hadoop config key of catalog warehouse location for the catalog name
*/
public static String catalogWarehouseConfigKey(String catalogName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this in other part of the code than the ones that we will deprecate when we deprecate the old configs?
If so we might want to add @deprecated so nobody will start to use it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I added this method because the warehouse property is commonly used. Technically we can just use catalogPropertyConfigKey(...) method for everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warehouse should be an internal thing for the HadoopCatalog. I think no other Catalog should use that. Do I miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary it is used in all catalogs except HiveCatalog, because Hive uses HiveConf.ConfVars.METASTOREWAREHOUSE instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might missing something, but we use this catalogWarehouseConfigKey only in tests. The reason here is that in the tests we would like to set / get the warehouse for the HadoopCatalog and the CustomCatalog (which is again only a test class). In production code we do not need this key, so I think we should move it from the live code to a test class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can remove that method

@jackye1995
Copy link
Contributor Author

shouldn't we include the hive docs change (re. catalog loading) in this PR too to keep them in lockstep?

@marton-bod depending on which one gets merged first, I will make change in the other PR accordingly.

@jackye1995
Copy link
Contributor Author

We have merged the hive doc PR, so I have added updates based on that.

@jackye1995
Copy link
Contributor Author

@pvary @marton-bod updated and fixed all tests, should be good for another review, thanks!

@pvary
Copy link
Contributor

pvary commented May 18, 2021

@lcspinter: could you please take a look, as you are to one most familiar with this code now?

Thanks,
Peter

Copy link
Contributor

@lcspinter lcspinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jackye1995 for the PR! lgtm

Copy link
Collaborator

@marton-bod marton-bod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this!

@pvary pvary merged commit 324b11a into apache:master May 18, 2021
@pvary
Copy link
Contributor

pvary commented May 18, 2021

Thanks @jackye1995 for the patch and @marton-bod , @lcspinter for the reviews!

cwsteinbach added a commit to cwsteinbach/apache-iceberg that referenced this pull request Aug 17, 2021
- Add blurbs for apache#2565, apache#2583, and apache#2547.
- Make formatting consistent.
cwsteinbach added a commit to cwsteinbach/apache-iceberg that referenced this pull request Aug 17, 2021
cwsteinbach added a commit that referenced this pull request Aug 19, 2021
* Add 0.12.0 release notes pt 2

* Add more blurbs and fix formatting.

- Add blurbs for #2565, #2583, and #2547.
- Make formatting consistent.

* Add blurb for #2613 Hive Vectorized Reader

* Reword blurbs for #2565 and #2365

* More changes based on review comments

* More updates to the 0.12.0 release notes

* Add blurb for #2232 fix parquet row group filters

* Add blurb for #2308
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants