Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Fix skipped file counts in ManifestReader with deleted entries #8432

Merged
merged 1 commit into from
Sep 1, 2023

Conversation

aokolnychyi
Copy link
Contributor

This PR fixes skipped file counts in ManifestReader with deleted entries as discussed here.

@github-actions github-actions bot added the core label Aug 30, 2023
}

private CloseableIterable<ManifestEntry<F>> entries(boolean onlyLive) {
if (hasRowFilter() || hasPartitionFilter() || partitionSet != null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add hasRowFilter() and hasPartitionFilter() to avoid cyclomatic complexity warnings. I also think it makes it more readable.


return CloseableIterable.filter(
content == FileType.DATA_FILES
? scanMetrics.skippedDataFiles()
: scanMetrics.skippedDeleteFiles(),
open(projection(fileSchema, fileProjection, projectColumns, caseSensitive)),
onlyLive ? filterLiveEntries(entries) : entries,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deleted entries must be discarded prior to evaluating stats primarily to have correct skipped counts. It should also help the performance.

@@ -215,19 +215,24 @@ public void scanningWithSkippedDataFiles() throws IOException {
Table table =
TestTables.create(
tableDir, tableName, SCHEMA, SPEC, SortOrder.unsorted(), formatVersion, reporter);
table.newAppend().appendFile(FILE_A).appendFile(FILE_D).commit();
table.newAppend().appendFile(FILE_B).appendFile(FILE_C).commit();
table.newAppend().appendFile(FILE_A).appendFile(FILE_B).appendFile(FILE_D).commit();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add FILE_B here so that we have a manifest with one skipped data file as the original manifest will be rewritten with FILE_A as deleted during the overwrite below.

@@ -236,9 +241,9 @@ public void scanningWithSkippedDataFiles() throws IOException {
assertThat(result.resultDeleteFiles().value()).isEqualTo(0);
assertThat(result.scannedDataManifests().value()).isEqualTo(1);
assertThat(result.scannedDeleteManifests().value()).isEqualTo(0);
assertThat(result.skippedDataManifests().value()).isEqualTo(1);
assertThat(result.skippedDataManifests().value()).isEqualTo(2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overwrite rewrites the original manifest and adds a new one for the new data file.

@aokolnychyi
Copy link
Contributor Author

cc @rdblue @nastra

}

private CloseableIterable<ManifestEntry<F>> entries(boolean onlyLive) {
if (hasRowFilter() || hasPartitionFilter() || partitionSet != null) {
Copy link
Contributor

@jerqi jerqi Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider the condition?

partitionSet.isEmpty()

If the partitionSet is empty, we seems to filter out all entries. Maybe we can return empty entries directly when the partitionSet is empty. I'm not sure whether we have guaranteed the partitionSet is not empty. Correct me if I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe an empty partitionSet indicates there is no match and acts similar to alwaysFalse. We could add a branch for cases when the filters are always false or the matching partition set is empty, but we would have to analyze the usages first to make sure those are possible scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

return entries(true /* only live entries */);
}

private CloseableIterable<ManifestEntry<F>> filterLiveEntries(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: these last two methods could be static.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way so that I don't have to add F extends ContentFile<F> to these static methods and reuse the bound already defined in the class instead. I could use ? in the second one but not here.

@aokolnychyi aokolnychyi merged commit 5996b3d into apache:master Sep 1, 2023
41 checks passed
@aokolnychyi
Copy link
Contributor Author

Thanks, @rdblue @jerqi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants