Hive: unify catalog experience across engines #2565

jackye1995 · 2021-05-07T23:21:00Z

As discussed in #2544, make Hive catalog experience consistent with Spark and Flink, specifically:

when type is null, look at catalog-impl, if that is also null then use HiveCatalog as default.
there is also no need to have templates CATALOG_WAREHOUSE_TEMPLATE and CATALOG_CLASS_TEMPLATE, because they are unified in CatalogProperties.

I also rewrote all the tests to make each test case clear in separated tests instead of a single giant test, please take a look if I missed any edge case in testing.

@pvary @marton-bod @lcspinter @aokolnychyi @rdblue

marton-bod · 2021-05-10T10:16:16Z

mr/src/test/java/org/apache/iceberg/mr/TestCatalogs.java

    Assert.assertTrue(hadoopCatalog.isPresent());
    Assert.assertTrue(hadoopCatalog.get() instanceof HadoopCatalog);
+    Assert.assertEquals("HadoopCatalog{name=barCatalog, location=/tmp/mylocation}", hadoopCatalog.get().toString());


I think hardcoding the toString in the assert can become brittle and difficult to maintain over time. Shouldn't we add a warehouseLocation() method to HadoopCatalog, and then assert for catalog.name() and ((HadoopCatalog) catalog).warehouseLocation() for the same effect?

This was in the original test, so sorry I did not look too much into it. I agree that it's better to test with those methods, but unfortunately HadoopCatalog.warehouseLocation() does not exist, so we cannot really use that, so I suppose that is why this test is written in this way to get the warehouse location.

marton-bod · 2021-05-10T10:59:50Z

mr/src/main/java/org/apache/iceberg/mr/Catalogs.java

-      String catalogType = conf.get(String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName));
-      if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME) || catalogType == null) {
+      String catalogType = conf.get(InputFormatConfig.catalogTypeConfigKey(catalogName));
+      if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME)) {


Do we need to update the javadocs of this method according to this change?

I moved the doc of this method to Catalogs class level, because private method docs are not generated in html.

marton-bod · 2021-05-10T11:00:50Z

mr/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java

+   * @return Hadoop config key of catalog type for the catalog name
+   */
+  public static String catalogTypeConfigKey(String catalogName) {
+    return String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName);


shall we add type to the CatalogProperties as well so we can reuse catalogPropertyConfigKey here?

It was not added because type is not consistent across engines. Flink has to use catalog-type instead of type because type is a reserved config key.

With that being said, we can probably add it and specify that Flink has to use catalog-type. Let me do that.

I see. Which one does Spark use? Would it make sense to standardize w/ the other engines now and use catalog-type here too?

Yeah that's actually one way we can consider, @rdblue @aokolnychyi , would it be acceptable if we unify all engines to ause catalog-type instead, and type is just an alias for catalog-type in Spark and Hive?

So while Ryan and Anton have not replied, I will keep it like this, we can always add catalog-type as a new catalog properties in a new PR.

In addition, I am treating CatalogUtil.ICEBERG_CATALOG_TYPE as the catalog property to use, and removed reference to InputFormatConfig.CATALOG_TYPE_TEMPLATE. I also added documentation inside CatalogUtil to make things clear.

I don't personally like the fact that CatalogUtil.ICEBERG_CATALOG_TYPE is in CatalogUtil instead of CatalogProperties, but we already released it in 0.11 as public variables so I decide to just keep it this way.

marton-bod · 2021-05-10T11:02:31Z

mr/src/test/java/org/apache/iceberg/mr/TestCatalogs.java

+    AssertHelpers.assertThrows(
+            "should complain about catalog not supported", UnsupportedOperationException.class,
+            "Unknown catalog type", () -> Catalogs.loadCatalog(conf, null));
+  }


Don't we need a test case for legacy_custom too?

Yeah there was no test for that, I will add it.

Actually, because the CATALOG_LOADER_CLASS (iceberg.mr.catalog.loader.class) is no longer used, we cannot load a custom catalog in the legacy mode, so there is no test for it.

Looking at https://github.com/apache/iceberg/pull/2129/files#diff-c183ea3aa154c2a5012f87d7a06dba3cff3f27975384e9fb4040fe6850a98bd6L192-L193, this seems like a backwards incompatible change. @lcspinter is this intentional, or should we add that part of the logic back?

for now, I have updated the code to allow this legacy behavior, and added the missing test.

mr/src/test/java/org/apache/iceberg/mr/TestCatalogs.java

marton-bod · 2021-05-10T12:41:04Z

Thanks @jackye1995 for this, a few comments.
Also, shouldn't we include the hive docs change (re. catalog loading) in this PR too to keep them in lockstep?

pvary · 2021-05-10T13:24:26Z

mr/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java

+   * @param catalogName catalog name
+   * @return Hadoop config key of catalog warehouse location for the catalog name
+   */
+  public static String catalogWarehouseConfigKey(String catalogName) {


Do we use this in other part of the code than the ones that we will deprecate when we deprecate the old configs?
If so we might want to add @deprecated so nobody will start to use it

No. I added this method because the warehouse property is commonly used. Technically we can just use catalogPropertyConfigKey(...) method for everything.

The warehouse should be an internal thing for the HadoopCatalog. I think no other Catalog should use that. Do I miss something?

@pvary it is used in all catalogs except HiveCatalog, because Hive uses HiveConf.ConfVars.METASTOREWAREHOUSE instead.

I might missing something, but we use this catalogWarehouseConfigKey only in tests. The reason here is that in the tests we would like to set / get the warehouse for the HadoopCatalog and the CustomCatalog (which is again only a test class). In production code we do not need this key, so I think we should move it from the live code to a test class

Sure, I can remove that method

jackye1995 · 2021-05-10T17:31:37Z

shouldn't we include the hive docs change (re. catalog loading) in this PR too to keep them in lockstep?

@marton-bod depending on which one gets merged first, I will make change in the other PR accordingly.

jackye1995 · 2021-05-16T22:14:41Z

We have merged the hive doc PR, so I have added updates based on that.

jackye1995 · 2021-05-17T04:20:12Z

@pvary @marton-bod updated and fixed all tests, should be good for another review, thanks!

pvary · 2021-05-18T09:40:14Z

@lcspinter: could you please take a look, as you are to one most familiar with this code now?

Thanks,
Peter

lcspinter

Thanks, @jackye1995 for the PR! lgtm

marton-bod

LGTM, thanks for fixing this!

pvary · 2021-05-18T11:27:24Z

Thanks @jackye1995 for the patch and @marton-bod , @lcspinter for the reviews!

- Add blurbs for apache#2565, apache#2583, and apache#2547. - Make formatting consistent.

* Add 0.12.0 release notes pt 2 * Add more blurbs and fix formatting. - Add blurbs for #2565, #2583, and #2547. - Make formatting consistent. * Add blurb for #2613 Hive Vectorized Reader * Reword blurbs for #2565 and #2365 * More changes based on review comments * More updates to the 0.12.0 release notes * Add blurb for #2232 fix parquet row group filters * Add blurb for #2308

github-actions bot added core MR labels May 7, 2021

marton-bod reviewed May 10, 2021

View reviewed changes

mr/src/test/java/org/apache/iceberg/mr/TestCatalogs.java Outdated Show resolved Hide resolved

pvary reviewed May 10, 2021

View reviewed changes

mr/src/test/java/org/apache/iceberg/mr/TestCatalogs.java Outdated Show resolved Hide resolved

pvary reviewed May 10, 2021

View reviewed changes

Hive: unify catalog experience across engines

771da92

jackye1995 force-pushed the hive-catalog-no-custom branch from b3cd1ac to 771da92 Compare May 16, 2021 22:54

github-actions bot added the docs label May 16, 2021

jackye1995 added 5 commits May 16, 2021 16:34

improve Catalogs javadoc

d70e7f9

fix tests

ac95852

fix tests

7d5277c

fix checkstyle

d9d9d6f

fix tests

88bf36d

jackye1995 added 3 commits May 17, 2021 09:02

remove unnecessary config methods

8e17725

fix checkstyle

cdfae46

update aws doc

894d671

lcspinter approved these changes May 18, 2021

View reviewed changes

marton-bod approved these changes May 18, 2021

View reviewed changes

pvary merged commit 324b11a into apache:master May 18, 2021

jackye1995 mentioned this pull request Aug 17, 2021

Add 0.12.0 release notes pt 2 #2986

Merged

cwsteinbach added a commit to cwsteinbach/apache-iceberg that referenced this pull request Aug 17, 2021

Add more blurbs and fix formatting.

aedee9a

- Add blurbs for apache#2565, apache#2583, and apache#2547. - Make formatting consistent.

cwsteinbach added a commit to cwsteinbach/apache-iceberg that referenced this pull request Aug 17, 2021

Reword blurbs for apache#2565 and apache#2365

6c798f5

pvary mentioned this pull request Oct 21, 2021

Hive: Fix Catalogs.hiveCatalog method for default catalogs #3338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive: unify catalog experience across engines #2565

Hive: unify catalog experience across engines #2565

jackye1995 commented May 7, 2021

marton-bod May 10, 2021

jackye1995 May 16, 2021

marton-bod May 10, 2021

jackye1995 May 16, 2021

jackye1995 May 16, 2021

marton-bod May 10, 2021

jackye1995 May 10, 2021

marton-bod May 10, 2021

jackye1995 May 10, 2021

jackye1995 May 16, 2021

jackye1995 May 16, 2021

marton-bod May 10, 2021

jackye1995 May 10, 2021

jackye1995 May 16, 2021

jackye1995 May 16, 2021

marton-bod commented May 10, 2021

pvary May 10, 2021

jackye1995 May 10, 2021

pvary May 14, 2021

jackye1995 May 16, 2021

pvary May 17, 2021

jackye1995 May 17, 2021

jackye1995 commented May 10, 2021

jackye1995 commented May 16, 2021

jackye1995 commented May 17, 2021

pvary commented May 18, 2021

lcspinter left a comment

marton-bod left a comment

pvary commented May 18, 2021

Hive: unify catalog experience across engines #2565

Hive: unify catalog experience across engines #2565

Conversation

jackye1995 commented May 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marton-bod commented May 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackye1995 commented May 10, 2021

jackye1995 commented May 16, 2021

jackye1995 commented May 17, 2021

pvary commented May 18, 2021

lcspinter left a comment

Choose a reason for hiding this comment

marton-bod left a comment

Choose a reason for hiding this comment

pvary commented May 18, 2021