-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive: unify catalog experience across engines #2565
Conversation
Assert.assertTrue(hadoopCatalog.isPresent()); | ||
Assert.assertTrue(hadoopCatalog.get() instanceof HadoopCatalog); | ||
Assert.assertEquals("HadoopCatalog{name=barCatalog, location=/tmp/mylocation}", hadoopCatalog.get().toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hardcoding the toString in the assert can become brittle and difficult to maintain over time. Shouldn't we add a warehouseLocation()
method to HadoopCatalog
, and then assert for catalog.name()
and ((HadoopCatalog) catalog).warehouseLocation()
for the same effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was in the original test, so sorry I did not look too much into it. I agree that it's better to test with those methods, but unfortunately HadoopCatalog.warehouseLocation()
does not exist, so we cannot really use that, so I suppose that is why this test is written in this way to get the warehouse location.
String catalogType = conf.get(String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName)); | ||
if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME) || catalogType == null) { | ||
String catalogType = conf.get(InputFormatConfig.catalogTypeConfigKey(catalogName)); | ||
if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update the javadocs of this method according to this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the doc of this method to Catalogs
class level, because private method docs are not generated in html.
* @return Hadoop config key of catalog type for the catalog name | ||
*/ | ||
public static String catalogTypeConfigKey(String catalogName) { | ||
return String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add type
to the CatalogProperties
as well so we can reuse catalogPropertyConfigKey
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not added because type
is not consistent across engines. Flink has to use catalog-type
instead of type
because type
is a reserved config key.
With that being said, we can probably add it and specify that Flink has to use catalog-type
. Let me do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Which one does Spark use? Would it make sense to standardize w/ the other engines now and use catalog-type
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's actually one way we can consider, @rdblue @aokolnychyi , would it be acceptable if we unify all engines to ause catalog-type
instead, and type
is just an alias for catalog-type
in Spark and Hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So while Ryan and Anton have not replied, I will keep it like this, we can always add catalog-type
as a new catalog properties in a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, I am treating CatalogUtil.ICEBERG_CATALOG_TYPE
as the catalog property to use, and removed reference to InputFormatConfig.CATALOG_TYPE_TEMPLATE
. I also added documentation inside CatalogUtil
to make things clear.
I don't personally like the fact that CatalogUtil.ICEBERG_CATALOG_TYPE
is in CatalogUtil
instead of CatalogProperties
, but we already released it in 0.11 as public variables so I decide to just keep it this way.
AssertHelpers.assertThrows( | ||
"should complain about catalog not supported", UnsupportedOperationException.class, | ||
"Unknown catalog type", () -> Catalogs.loadCatalog(conf, null)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need a test case for legacy_custom
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there was no test for that, I will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, because the CATALOG_LOADER_CLASS
(iceberg.mr.catalog.loader.class
) is no longer used, we cannot load a custom catalog in the legacy mode, so there is no test for it.
Looking at https://github.com/apache/iceberg/pull/2129/files#diff-c183ea3aa154c2a5012f87d7a06dba3cff3f27975384e9fb4040fe6850a98bd6L192-L193, this seems like a backwards incompatible change. @lcspinter is this intentional, or should we add that part of the logic back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, I have updated the code to allow this legacy behavior, and added the missing test.
Thanks @jackye1995 for this, a few comments. |
* @param catalogName catalog name | ||
* @return Hadoop config key of catalog warehouse location for the catalog name | ||
*/ | ||
public static String catalogWarehouseConfigKey(String catalogName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use this in other part of the code than the ones that we will deprecate when we deprecate the old configs?
If so we might want to add @deprecated
so nobody will start to use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I added this method because the warehouse
property is commonly used. Technically we can just use catalogPropertyConfigKey(...)
method for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warehouse
should be an internal thing for the HadoopCatalog. I think no other Catalog should use that. Do I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pvary it is used in all catalogs except HiveCatalog
, because Hive uses HiveConf.ConfVars.METASTOREWAREHOUSE
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might missing something, but we use this catalogWarehouseConfigKey
only in tests. The reason here is that in the tests we would like to set / get the warehouse for the HadoopCatalog and the CustomCatalog (which is again only a test class). In production code we do not need this key, so I think we should move it from the live code to a test class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can remove that method
@marton-bod depending on which one gets merged first, I will make change in the other PR accordingly. |
We have merged the hive doc PR, so I have added updates based on that. |
b3cd1ac
to
771da92
Compare
@pvary @marton-bod updated and fixed all tests, should be good for another review, thanks! |
@lcspinter: could you please take a look, as you are to one most familiar with this code now? Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jackye1995 for the PR! lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this!
Thanks @jackye1995 for the patch and @marton-bod , @lcspinter for the reviews! |
- Add blurbs for apache#2565, apache#2583, and apache#2547. - Make formatting consistent.
* Add 0.12.0 release notes pt 2 * Add more blurbs and fix formatting. - Add blurbs for #2565, #2583, and #2547. - Make formatting consistent. * Add blurb for #2613 Hive Vectorized Reader * Reword blurbs for #2565 and #2365 * More changes based on review comments * More updates to the 0.12.0 release notes * Add blurb for #2232 fix parquet row group filters * Add blurb for #2308
As discussed in #2544, make Hive catalog experience consistent with Spark and Flink, specifically:
type
isnull
, look atcatalog-impl
, if that is alsonull
then useHiveCatalog
as default.CATALOG_WAREHOUSE_TEMPLATE
andCATALOG_CLASS_TEMPLATE
, because they are unified inCatalogProperties
.I also rewrote all the tests to make each test case clear in separated tests instead of a single giant test, please take a look if I missed any edge case in testing.
@pvary @marton-bod @lcspinter @aokolnychyi @rdblue