Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive: Configure catalog type on table level. #2129

Merged
merged 13 commits into from Apr 12, 2021
Merged

Conversation

lcspinter
Copy link
Contributor

@lcspinter lcspinter commented Jan 21, 2021

The current catalog configuration is stored in the main hive configuration, by setting the iceberg.mr.catalog. This works perfectly when the user is working on a dataset when all the tables are coming from the same catalog.
In case of operations involving multiple tables from different catalogs, this implementation fails to serve the need.

This PR provides a solution for this, by implementing Spark-like catalog configuration. The catalog configuration is stored in the hive main configuration, the same way as it handled in Spark, and on table level, the name of the catalog and the table identifier is stored. If catalog name is not defined on the table, a default catalog is used.

Here is an example of how to configure a Hadoop-catalog

In the main hive configuration we store the following properties:

  • iceberg.catalog.<catalog_name>.type = hadoop
  • iceberg.catalog.<catalog_name>.warehouse = somelocation

On the table level we have the following properties:

  • iceberg.mr.table.catalog = <catalog_name>
  • iceberg.mr.table.identifier = <database.table_name>

If property iceberg.mr.table.catalog is missing from the table, it starts looking for a catalogue definition with the name "default". If that is also missing, the original implementation is used, where the propertyiceberg.mr.catalog stores the catalog information.

Copy link
Contributor

@rymurr rymurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lcspinter, I was planning a similar change but you beat me to it :-D

My only feedback is to use the same idiom as the rest of iceberg for constructing catalogs, I will test this today with the NessieCatalog and post here if successful.

mr/src/main/java/org/apache/iceberg/mr/Catalogs.java Outdated Show resolved Hide resolved
@lcspinter lcspinter force-pushed the CDPD-20021 branch 5 times, most recently from 5869fa5 to 27f91dc Compare January 25, 2021 19:56
@lcspinter lcspinter force-pushed the CDPD-20021 branch 2 times, most recently from f274a3c to 251598f Compare April 8, 2021 08:14
@pvary
Copy link
Contributor

pvary commented Apr 9, 2021

I am comfortable with the current solution, but I see that you still have some open comments there.

@rymurr, @marton-bod: Could we please finalize the review of this patch?
@rymurr: As a new committer you can even +1 the patch too 😄 (I would prefer not to merge PRs where someone requested a change)

@rdblue: Any more comments before merge?

Thanks,
Peter

Copy link
Contributor

@rymurr rymurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @lcspinter. very exciting!

Copy link
Collaborator

@marton-bod marton-bod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too!

@lcspinter
Copy link
Contributor Author

Thanks, @rymurr @pvary @marton-bod @rdblue for the reviews!

@pvary pvary merged commit db8248c into apache:master Apr 12, 2021
@pvary
Copy link
Contributor

pvary commented Apr 12, 2021

Merged the PR to master.
Thanks for everyone involved!

@lcspinter: Could we please update the docs about the hive catalog configurations?

Thanks,
Peter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants