Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3266: Work with hadoop 3.x PathOutputComitters #1618

Merged
merged 3 commits into from
Jul 1, 2022

Conversation

eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Mar 22, 2022

In hadoop 3.x the abstract class PathOutputCommitter defines the method getWorkPath(), but in hadoop 2.x it only defined on FileOutputCommitter. So to be compatible with both hadoop 2.x and 3.x and support committers that only implements PathOutputComitter and not FileOutputCommitter we make the call to getWorkPath using reflection.

The implementation is based on the suggestions in #1431 but as that PR has no updates from the author in 3 months I create a new PR.

This fixes AVRO-3266

Tests

Is covered by existing tests.

Documentation

No new public methods.

In hadoop 3.x the abstract class PathOutputCommitter defines the method
`getWorkPath()`, but in hadoop 2.x it only defined on
FileOutputCommitter. So to be compatible with both hadoop 2.x and 3.x
and support committers that does only implements PathOutputComitter and
not FileOutputCommitter we make the call to getWorkPath using
reflection.
@github-actions github-actions bot added the Java Pull Requests for Java binding label Mar 22, 2022
@eejbyfeldt
Copy link
Contributor Author

@martin-g You approved this a week ago. Is there something more that should happend before this can be merged?

@martin-g
Copy link
Member

I'm not using Avro Java SDK since a while, so I prefer if an active Java user/developer approves it too.

@eejbyfeldt
Copy link
Contributor Author

I'm not using Avro Java SDK since a while, so I prefer if an active Java user/developer approves it too.

Understandable. Should I ask for one of the mailing lists or somewhere else? Or just wait and in due time some one will give a review?

@martin-g
Copy link
Member

1.11.1 is on the horizon. I guess @RyanSkraba will take a look soon!

@eejbyfeldt
Copy link
Contributor Author

eejbyfeldt commented Apr 26, 2022

1.11.1 is on the horizon. I guess @RyanSkraba will take a look soon!

@RyanSkraba Any update on when or if you will have time to have a look at this?

Copy link

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this change; code LGTM. as i am not a committer to this project i can't give a final vote on it

@RyanSkraba
Copy link
Contributor

Hello! My apologies, I'm getting back to Avro after a pretty constrained month.

This looks important for 1.11.1, which should be coming out within the month! I've set the fix version in the JIRA, so you can be sure that this PR will get some attention.

@opwvhk
Copy link
Contributor

opwvhk commented Jun 3, 2022

Looks reasonable good to me as well, but personally I prefer PR #1431 because it doesn't use reflection (there's no need; the PathOutputComitter interface is generic enough).

On the other hand: I don't use Hadoop, so maybe I'm missing something.

@eejbyfeldt
Copy link
Contributor Author

Looks reasonable good to me as well, but personally I prefer PR #1431 because it doesn't use reflection (there's no need; the PathOutputComitter interface is generic enough).

On the other hand: I don't use Hadoop, so maybe I'm missing something.

As discussed in #1431 that approach does not work for hadoop 2.x and since hadoop 2 is still not EOL (Actually made a release last week: https://hadoop.apache.org/release/2.10.2.html) this approach was suggest as a way to be compatible with both versions. PathOutputComitter was is only introduced into hadoop 3 in apache/hadoop@11390c2 But if the need to support hadoop 2 has changed since this was originally suggested, then I also agree that the approach in #1431 is better.

@steveloughran
Copy link

hadoop 2 is still not EOL

it pretty much is actually. that is just a CVE fix. it is otherwise frozen for all bug reports, and we don't do anything about transient CVEs from dependent libraries we can't upgrade. as it is required to be java7 compatible, all its dependencies need to be java7, which means they are all pretty frozen.

doesn't work on java > 8 either. whereas hadoop 3 is java 8 guaranteed and on branch 3.3.x, java 11 compatible. trunk we are moving to java 11 only, first in builds, then source.

please, give up on hadoop 2.

@RyanSkraba
Copy link
Contributor

I'll just pass this by the mailing list to give anyone a change to object strongly! But I think your argument is convincing.

If we stop supporting Hadoop 2, then the other PR is the one to pick! But thanks so much for doing the work for this implementation...

@steveloughran
Copy link

Thanks. i will have some even better news for performance when working with hadoop 3.2 and object storage soon; expect some draft PRs

what would be good there is any pointer to some realistic micro benchmark code for listing/reading/writing avro data, even if it is only against the local FS...I could take that and make it retargetable then do some before/after performance measurements

Anything I can look at?

@RyanSkraba
Copy link
Contributor

Here is the discussion on the mailing list.

Anything I can look at?

A good starting point for perf (in Java) is https://github.com/apache/avro/blob/master/lang/java/perf/README.md -- if you think the retargeted code that you write could be reusable, please feel free to create a JIRA!

@steveloughran
Copy link

thanks for the update. moving up the later the hadoop version the better, at least for me. i am also starting work on a shim library to make it easier for libraries to compile against hadoop 3.2.x but get access to features in 3.3.x (openFile()) and soon to come (vectored IO; relevant for parquet, orc and maybe hbase).

@RyanSkraba
Copy link
Contributor

OK -- I'm having some trouble getting all the PRs merged so we can make a release candidate! I don't think many people saw the discussion on the mailing list (and I appreciate the feedback we did get).

I think the conservative thing to do is merge this for 1.11.1 so it's no longer blocking anyone, and remove hadoop2 in the next major version. That way, both the work to remove the hadoop2 tests and profile AND the better simpler solution in #1431 can be applied independently.

Thanks so much for your patience, let's get this in 1.11.1!

@RyanSkraba RyanSkraba merged commit 797cec6 into apache:master Jul 1, 2022
RyanSkraba added a commit that referenced this pull request Jul 1, 2022
* AVRO-3266: Work with hadoop 3.x PathOutputComitters

In hadoop 3.x the abstract class PathOutputCommitter defines the method
`getWorkPath()`, but in hadoop 2.x it only defined on
FileOutputCommitter. So to be compatible with both hadoop 2.x and 3.x
and support committers that does only implements PathOutputComitter and
not FileOutputCommitter we make the call to getWorkPath using
reflection.

* AVRO-3266: Add committer class name to exception

* Add comment on better fix when hadoop2 is dropped

Co-authored-by: Ryan Skraba <ryan@skraba.com>
@eejbyfeldt eejbyfeldt deleted the AVRO-3266-using-relection branch July 4, 2022 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding
Projects
None yet
5 participants