Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHOENIX-6721 CSV bulkload tool fails with FileNotFoundException if --… #1450

Closed
wants to merge 1 commit into from

Conversation

ss77892
Copy link
Contributor

@ss77892 ss77892 commented May 31, 2022

…output points to the S3 location

The problem is that in our code we explicitly set the output committer to FileOutputCommitter which doesn't work well with the S3. HBase had a similar problem covered by HBASE-18885. I've tried to duplicate the approach used there, but it comes that S3 committer doesn't extend FileOuputCommitter, so I have to use the base class for both of those - PathOutputCommitter and use getOutputPath instead of getWorkPath. It works in my manual tests with and without AWS use.

@gjacoby126
Copy link
Contributor

@ss77892 - are there valid use cases for using an outputcommitter that's not a PathOutputCommitter? If not, is there somewhere we can put some validation in, rather than getting a ClassCastException?

@steveloughran
Copy link

actually, FileOutputCommitter.getOutputPath should work if all you want is to know the dest path of a job. you shouldn't use it for a real mr/spark job on s3a (performance, correctness) or gs (correctness as even v1 isn't resilient to failures in task commit), but to work out the final destination, it's good.

@steveloughran
Copy link

you got a stack of the FNFE?

throws IOException {
// Get the path of the temporary output file
final Path outputPath = FileOutputFormat.getOutputPath(context);
final Path outputdir = new FileOutputCommitter(outputPath, context).getWorkPath();
final Path outputdir = ((PathOutputCommitter) committer).getOutputPath();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer a h path, it is the final path of the work, the same value as L128 in the original code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ss77892

This indeed looks incorrect.
We should use .getWorkPath() here.

This works for the Magic S3a committer where the work and output path are the same, but for FileCommitter (i.e. HDFS) and others this breaks the commit mechanism by writing into the output directory directory directly.

@steveloughran
Copy link

now, what exactly are you doing with the output committer? just looking for a temporary directory or actually executing task attempts generating different blocks of data and then committing the output at the end?

@stoty
Copy link
Contributor

stoty commented Jan 2, 2024

Fixed version of this committed from #1765

@stoty stoty closed this Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants