-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PHOENIX-6721 CSV bulkload tool fails with FileNotFoundException if --… #1450
Conversation
…output points to the S3 location
@ss77892 - are there valid use cases for using an outputcommitter that's not a PathOutputCommitter? If not, is there somewhere we can put some validation in, rather than getting a ClassCastException? |
actually, FileOutputCommitter.getOutputPath should work if all you want is to know the dest path of a job. you shouldn't use it for a real mr/spark job on s3a (performance, correctness) or gs (correctness as even v1 isn't resilient to failures in task commit), but to work out the final destination, it's good. |
you got a stack of the FNFE? |
throws IOException { | ||
// Get the path of the temporary output file | ||
final Path outputPath = FileOutputFormat.getOutputPath(context); | ||
final Path outputdir = new FileOutputCommitter(outputPath, context).getWorkPath(); | ||
final Path outputdir = ((PathOutputCommitter) committer).getOutputPath(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer a h path, it is the final path of the work, the same value as L128 in the original code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indeed looks incorrect.
We should use .getWorkPath() here.
This works for the Magic S3a committer where the work and output path are the same, but for FileCommitter (i.e. HDFS) and others this breaks the commit mechanism by writing into the output directory directory directly.
now, what exactly are you doing with the output committer? just looking for a temporary directory or actually executing task attempts generating different blocks of data and then committing the output at the end? |
Fixed version of this committed from #1765 |
…output points to the S3 location
The problem is that in our code we explicitly set the output committer to FileOutputCommitter which doesn't work well with the S3. HBase had a similar problem covered by HBASE-18885. I've tried to duplicate the approach used there, but it comes that S3 committer doesn't extend FileOuputCommitter, so I have to use the base class for both of those - PathOutputCommitter and use getOutputPath instead of getWorkPath. It works in my manual tests with and without AWS use.