New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-25311][core] Fix deal with delimited compressed file not correctly #18299
base: master
Are you sure you want to change the base?
[FLINK-25311][core] Fix deal with delimited compressed file not correctly #18299
Conversation
hi @Airblader , I have reword the commit to fix the issue from #18273, |
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit bc77691 (Fri Jan 07 11:30:32 UTC 2022) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
@AHeise could you check, is falling but with things not related to this PR |
hi @zentol, could you please help me review the PR |
hi @fapaul, could you please help me review the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your patience and finding the bug. Please have a look at my proposed fix.
// compressed format should use splitLength specially | ||
this.splitLength = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix looks odd to me. First, we are modifying a parameter, which is always a sign that this should go to call site. Second, here at this point, I cannot see that we are guaranteed to not have 2 splits on the same file and reading duplicate data if we simply change splitLength here. Third, this should probably use READ_WHOLE_SPLIT_FLAG
.
All in all the proper place to fix it is in createInputSplits
.
The actual bug is in
if (unsplittable) { // should be testForUnsplittable(file)
int splitNum = 0;
for (final FileStatus file : files) {
final FileSystem fs = file.getPath().getFileSystem();
final BlockLocation[] blocks = fs.getFileBlockLocations(file, 0, file.getLen());
Set<String> hosts = new HashSet<String>();
for (BlockLocation block : blocks) {
hosts.addAll(Arrays.asList(block.getHosts()));
}
long len = file.getLen();
if (testForUnsplittable(file)) { // this doesn't make any sense at this point
len = READ_WHOLE_SPLIT_FLAG;
}
FileInputSplit fis =
new FileInputSplit(
splitNum++,
file.getPath(),
0,
len,
hosts.toArray(new String[hosts.size()]));
inputSplits.add(fis);
}
return inputSplits.toArray(new FileInputSplit[inputSplits.size()]);
}
What is the purpose of the change
This pull request fix the issue about DelimitedInputFormat can not deal with compressed file correctly
Brief change log
The splitLength should be set to -1 to read whole split for compressed file
Verifying this change
This change added tests and can be verified as follows:
Add test that compressed file can be deal with correctly by org.apache.flink.api.common.io.DelimitedInputFormat
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: ( no)Documentation