-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVRO-3266: Work with hadoop 3.x PathOutputComitters #1618
AVRO-3266: Work with hadoop 3.x PathOutputComitters #1618
Conversation
In hadoop 3.x the abstract class PathOutputCommitter defines the method `getWorkPath()`, but in hadoop 2.x it only defined on FileOutputCommitter. So to be compatible with both hadoop 2.x and 3.x and support committers that does only implements PathOutputComitter and not FileOutputCommitter we make the call to getWorkPath using reflection.
lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroOutputFormatBase.java
Outdated
Show resolved
Hide resolved
@martin-g You approved this a week ago. Is there something more that should happend before this can be merged? |
I'm not using Avro Java SDK since a while, so I prefer if an active Java user/developer approves it too. |
Understandable. Should I ask for one of the mailing lists or somewhere else? Or just wait and in due time some one will give a review? |
1.11.1 is on the horizon. I guess @RyanSkraba will take a look soon! |
@RyanSkraba Any update on when or if you will have time to have a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like this change; code LGTM. as i am not a committer to this project i can't give a final vote on it
Hello! My apologies, I'm getting back to Avro after a pretty constrained month. This looks important for 1.11.1, which should be coming out within the month! I've set the fix version in the JIRA, so you can be sure that this PR will get some attention. |
Looks reasonable good to me as well, but personally I prefer PR #1431 because it doesn't use reflection (there's no need; the On the other hand: I don't use Hadoop, so maybe I'm missing something. |
As discussed in #1431 that approach does not work for hadoop 2.x and since hadoop 2 is still not EOL (Actually made a release last week: https://hadoop.apache.org/release/2.10.2.html) this approach was suggest as a way to be compatible with both versions. |
it pretty much is actually. that is just a CVE fix. it is otherwise frozen for all bug reports, and we don't do anything about transient CVEs from dependent libraries we can't upgrade. as it is required to be java7 compatible, all its dependencies need to be java7, which means they are all pretty frozen. doesn't work on java > 8 either. whereas hadoop 3 is java 8 guaranteed and on branch 3.3.x, java 11 compatible. trunk we are moving to java 11 only, first in builds, then source. please, give up on hadoop 2. |
I'll just pass this by the mailing list to give anyone a change to object strongly! But I think your argument is convincing. If we stop supporting Hadoop 2, then the other PR is the one to pick! But thanks so much for doing the work for this implementation... |
Thanks. i will have some even better news for performance when working with hadoop 3.2 and object storage soon; expect some draft PRs what would be good there is any pointer to some realistic micro benchmark code for listing/reading/writing avro data, even if it is only against the local FS...I could take that and make it retargetable then do some before/after performance measurements Anything I can look at? |
Here is the discussion on the mailing list.
A good starting point for perf (in Java) is https://github.com/apache/avro/blob/master/lang/java/perf/README.md -- if you think the retargeted code that you write could be reusable, please feel free to create a JIRA! |
thanks for the update. moving up the later the hadoop version the better, at least for me. i am also starting work on a shim library to make it easier for libraries to compile against hadoop 3.2.x but get access to features in 3.3.x (openFile()) and soon to come (vectored IO; relevant for parquet, orc and maybe hbase). |
OK -- I'm having some trouble getting all the PRs merged so we can make a release candidate! I don't think many people saw the discussion on the mailing list (and I appreciate the feedback we did get). I think the conservative thing to do is merge this for Thanks so much for your patience, let's get this in 1.11.1! |
* AVRO-3266: Work with hadoop 3.x PathOutputComitters In hadoop 3.x the abstract class PathOutputCommitter defines the method `getWorkPath()`, but in hadoop 2.x it only defined on FileOutputCommitter. So to be compatible with both hadoop 2.x and 3.x and support committers that does only implements PathOutputComitter and not FileOutputCommitter we make the call to getWorkPath using reflection. * AVRO-3266: Add committer class name to exception * Add comment on better fix when hadoop2 is dropped Co-authored-by: Ryan Skraba <ryan@skraba.com>
In hadoop 3.x the abstract class PathOutputCommitter defines the method
getWorkPath()
, but in hadoop 2.x it only defined on FileOutputCommitter. So to be compatible with both hadoop 2.x and 3.x and support committers that only implements PathOutputComitter and not FileOutputCommitter we make the call to getWorkPath using reflection.The implementation is based on the suggestions in #1431 but as that PR has no updates from the author in 3 months I create a new PR.
This fixes AVRO-3266
Tests
Is covered by existing tests.
Documentation
No new public methods.