New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-18130][hive][fs-connector] File name conflict for different jobs in filesystem/hive sink #12485
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 7fe7332 (Thu Jun 04 13:07:13 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
…bs in filesystem/hive sink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JingsongLi , I reviewed the usage of OutputFileConfig
in FileSystemTableSink
and HiveTableSink
, and I think it is OK from my side.
destPath = new Path(destDir, name); | ||
count++; | ||
} | ||
fs.rename(srcPath, destPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here seems inconsistent with this method's comment: rename does not delete the dest path if it exists on HDFS. Instead, it will keep the srcPath and dest Path unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is out of date, the overwrite is occurred in overwrite(Path)
method instead of here.
Now it is consistent with HadoopRecoverableFsDataOutputStream.commit
.
I will update comments.
Thanks @gaoyunhaii for review, merging... |
…bs in filesystem/hive sink () This closes #12485
…bs in filesystem/hive sink () This closes apache#12485
What is the purpose of the change
The sink of different jobs (or the same SQL runs multiple times) will produce the same file name, the rename will fail, and different file systems will produce different results:
Brief change log
We can add UUID to prefix, make sure there is no conflict between jobs.
Verifying this change
HiveTableSinkITCase.testBatchAppend
HiveTableSinkITCase.testStreamingAppend
FileSystemITCaseBase.testInsertAppend
FileSystemITCaseBase.testInsertOverwrite
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation