Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3266: Cast to PathOutputCommitter #1431

Closed
wants to merge 2 commits into from
Closed

Conversation

mjwiq
Copy link

@mjwiq mjwiq commented Dec 14, 2021

Change the cast of the outputCommitter to a PathOutputCommitter.
This allows the system to use a different outputCommitter than just the
FileOutputCommitter, for example the MagicS3GuardCommitter.

Fixes Output stream incompatible with MagicS3GuardCommitter

Change the cast of the outputCommitter to a PathOutputCommitter.
This allows the system to use a different outputCommitter than just the
FileOutputCommitter, for example the MagicS3GuardCommitter.
@github-actions github-actions bot added the Java Pull Requests for Java binding label Dec 14, 2021
@martin-g
Copy link
Member

LGTM!

@RyanSkraba
Copy link
Contributor

Hello! It looks like PathOutputCommitter is a Hadoop 3.0.0 addition. Is there an EOL for Hadoop 2.x ?

Copy link
Contributor

@RyanSkraba RyanSkraba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! This does end up failing when compiled with the -Phadoop2 profile. I don't think Hadoop 2.x is EOL yet, so we may need to use a different tactic.

I imagine that we could either generate two different artifact profiles for Apache Hadoop 2.x and 3.x, but it might be simpler to make the call via reflection, or behind an "instanceof". If I understand correctly, this method isn't called frequently (once per split, not once per record).

If you can think of a way to support both versions in the same artifact, in a maintainable and readable way, I'll be suitably impressed!

Copy link

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. you will also need it for the fast and safe abfs/gcs committer coming in MAPREDUCE-7341

@steveloughran
Copy link

are you using the new methods in PathOutputCommitter above what OutputCommitter uses? if so, you could just use reflection to invoke them. this will work with old releases and FileOutputCommitter. as well as new code with custom committers for s3a, abfs and gcs

@eejbyfeldt
Copy link
Contributor

I also interested in fixing this, so I create #1618 using reflection to invoke getWorkPath as suggested here. I create a new PR as it seems like this PR has gone stale.

@steveloughran
Copy link

avro has just upgraded its hadoop dependency to 3.3.3 via dependabot. not sure if that was intentional. if java 11 is wanted, it'd have to be 3.2.0+ anyway

@martin-g
Copy link
Member

avro has just upgraded its hadoop dependency to 3.3.3

@steveloughran Do you mean #1697 ?
This is just a bump from 3.3.2 to 3.3.3. I hope Hadoop didn't introduce anything concerning in a minor version.
Avro uses Hadoop 3.2.0 since at least 2020 (Avro 1.9.x, I believe).

But there is also hadoop-2 profile -

<id>hadoop2</id>

I guess this one could be bumped to 2.10.2 before releasing 1.11.1.

@steveloughran
Copy link

Sorry, I've got confused and am now confusing everyone else; been looking at the build and here is my understanding now

  1. Avro builds against hadoop 3.3.3 since Bump hadoop-client from 3.3.2 to 3.3.3 in /lang/java #1697
  2. and it uses java 8 everywhere.
  3. but it also has a branch-2 profile for "tests to validate backwards compatibility with Hadoop 2", and it is that which is keeping things from using the hadoop 3.x apis

I think that it is time to turn off the hadoop 2 support. Because if avro builds are java8+ anyway, they aren't that likely to run on those hadoop 2.x clusters still in production.

I know this as whenever I have to produce something to work on any cluster of that era -I have to use a jdk7 JDK to make the release. Do that and I will provide the shim library to get at the maximum performance IO operations available against HDFS and cloud storage, when present.

@RyanSkraba
Copy link
Contributor

See AVRO-6535 -- we should revisit this implementation as we drop Hadoop 2 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding
Projects
None yet
5 participants