Skip to content

Hive: Preparation to enable Hive writes with Tez engine#2163

Merged
rdblue merged 3 commits intoapache:masterfrom
marton-bod:tez_write
Feb 4, 2021
Merged

Hive: Preparation to enable Hive writes with Tez engine#2163
rdblue merged 3 commits intoapache:masterfrom
marton-bod:tez_write

Conversation

@marton-bod
Copy link
Collaborator

In order to enable Hive writes using the Tez engine, we have to make a few modifications to the OutputCommitter due to the inner workings of Tez. Couple of main reasons for the changes:

  1. Tez: Arbitrary inclusion/exclusion of vertexId in TaskAttemptID: there's a difference in Tez on how TaskAttemptIDs are constructed - in some places the vertexId is included in the TaskAttemptID, in other places it's not, which leads to differences in the IDs and therefore to issues when retrieving the Writer from the cache based on this ID.
  2. Tez: taskType (reducer/mapper) being part of TaskAttemptID's equals/hashcode: this prevents a reducer from retrieving a Writer object that was previously cached by a mapper for example.

Enabling the unit tests to run on Tez will be done in a future PR. For that work, we'll need to release a new version of Hive and Tez containing the necessary patches (mainly HIVE-24629 and TEZ-4264) and update the dependencies here.

@marton-bod
Copy link
Collaborator Author

@pvary @lcspinter could you please review when you get the chance? Thanks!

@pvary
Copy link
Contributor

pvary commented Feb 1, 2021

+1 from my side, but I would like to see @rdblue's opinion about pushing code which we can not create tests for.

@rdblue: A bit of context: Internally we were able to run write tests with Tez, but there are multiple unreleased fixes needed for Hive and Tez too. The fixes are available on the apache repos but they are not present on any of the releases ATM. Releasing both components would be a slow process because of the dependencies, and would greatly delay adding code here. Shall we wait for those releases or we can put code here which could only be used by patched versions of Hive and Tez?

Thanks,
Peter

@rdblue
Copy link
Contributor

rdblue commented Feb 1, 2021

@pvary, @marton-bod, as long as we are passing the existing tests, I think it is fine to add code that will be more thoroughly tested later. Better to get it in sooner.

@rdblue
Copy link
Contributor

rdblue commented Feb 3, 2021

Will merge when tests pass. Thanks for updating this, @marton-bod!

@marton-bod marton-bod force-pushed the tez_write branch 2 times, most recently from 8b57ca1 to 5155f5a Compare February 4, 2021 14:37
@marton-bod
Copy link
Collaborator Author

Thanks a lot for your review, @rdblue!

@rdblue rdblue merged commit 3ab15bf into apache:master Feb 4, 2021
coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021
Co-authored-by: Marton Bod <mbod@cloudera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants