Skip to content

Add file stats range optimizations for DeleteFileIndex#1338

Merged
aokolnychyi merged 11 commits intoapache:masterfrom
rdblue:delete-index-match-path-stats
Aug 18, 2020
Merged

Add file stats range optimizations for DeleteFileIndex#1338
aokolnychyi merged 11 commits intoapache:masterfrom
rdblue:delete-index-match-path-stats

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Aug 13, 2020

This adds two optimizations for DeleteFileIndex:

  • For position deletes, if the data file path is not in the range of values for a delete file's file_path column, ignore the delete file
  • For equality deletes, if the data file and delete file value ranges for any equality column do not overlap, ignore the delete file

@rdblue rdblue force-pushed the delete-index-match-path-stats branch from e9d0806 to 85128bb Compare August 13, 2020 22:03
@rdblue rdblue marked this pull request as draft August 13, 2020 22:04
@rdblue rdblue marked this pull request as ready for review August 17, 2020 22:19

boolean dropStats = ManifestReader.dropStats(dataFilter, columns);
if (!deleteFiles.isEmpty()) {
select(Streams.concat(columns.stream(), ManifestReader.STATS_COLUMNS.stream()).collect(Collectors.toList()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed to ensure the stats columns are projected for data files when there are delete files, even if the stats columns were not requested by the caller.

private final Iterator<T> items;
private boolean closed;
private boolean hasNext;
private boolean nextReady;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the filter with reused containers, like GenericRecord. The advance call in next would replace values in a reused row, which would in effect return the next matching row.

Preconditions.checkState(createWriterFunc != null,
"Cannot create delete file with deletes rows unless createWriterFunc is set");

if (rowSchema != null && createWriterFunc != null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for createWriterFunc here is needed because forTable sets the row schema. So rows can be included in position deletes when the writer func is added and when there is a row schema.

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch looks good to me overall, just left several comments.

this.keyMetadata = toCopy.keyMetadata == null ? null : Arrays.copyOf(toCopy.keyMetadata, toCopy.keyMetadata.length);
this.splitOffsets = toCopy.splitOffsets == null ? null :
Arrays.copyOf(toCopy.splitOffsets, toCopy.splitOffsets.length);
this.equalityIds = toCopy.equalityIds != null ? Arrays.copyOf(toCopy.equalityIds, toCopy.equalityIds.length) : null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

Preconditions.checkState(createWriterFunc != null,
"Cannot create delete file with deletes rows unless createWriterFunc is set");

if (rowSchema != null && createWriterFunc != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this ? IMO, if rowSchema is not null and createWriteFunc is null, we should throw exception, rather than going to the delete files without row path ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using forTable sets the row schema automatically, so we can't determine that the user intended to write rows that way.

@aokolnychyi
Copy link
Contributor

Looks good to me too, some minor comments.

@aokolnychyi aokolnychyi merged commit c8d9c85 into apache:master Aug 18, 2020
cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020
cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020
szehon-ho pushed a commit to szehon-ho/iceberg that referenced this pull request Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants