Skip to content

feat(update): add MergingSnapshotUpdate#682

Open
gty404 wants to merge 1 commit into
apache:mainfrom
gty404:merging-snapshot-update
Open

feat(update): add MergingSnapshotUpdate#682
gty404 wants to merge 1 commit into
apache:mainfrom
gty404:merging-snapshot-update

Conversation

@gty404
Copy link
Copy Markdown
Contributor

@gty404 gty404 commented May 26, 2026

Abstract base for merge-based snapshot operations (MergeAppend, OverwriteFiles, RowDelta, etc.), implementing the filter → write → merge pipeline consistent with Java's MergingSnapshotProducer.

Also fixes SnapshotSummaryBuilder manifest count fields and a use-after-free bug in SnapshotUpdate::DeleteFile.

@gty404 gty404 force-pushed the merging-snapshot-update branch 6 times, most recently from e950bed to 0eebdbf Compare May 26, 2026 12:39
Abstract base for merge-based snapshot operations (MergeAppend,
OverwriteFiles, RowDelta, etc.), implementing the filter → write →
merge pipeline consistent with Java's MergingSnapshotProducer.

Also fixes SnapshotSummaryBuilder manifest count fields and a
use-after-free bug in SnapshotUpdate::DeleteFile.
@gty404 gty404 force-pushed the merging-snapshot-update branch from 0eebdbf to 3085e61 Compare May 26, 2026 15:40
Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java parity review notes.

return InvalidArgument("Cannot append manifest with deleted files: {}",
manifest.manifest_path);
}
if (manifest.added_snapshot_id != kInvalidSnapshotId) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java copies append manifests that already have a snapshot id (or assigned first_row_id) instead of rejecting them. This should go through CopyManifest for parity.


// Step 1: Filter data manifests.
ICEBERG_ASSIGN_OR_RAISE(auto filtered_data,
data_filter_manager_.FilterManifests(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java uses SnapshotUtil.schemaFor(base, targetBranch()) here. Using metadata.Schema() underneath can evaluate filters with the wrong schema when committing to a branch behind current.

metadata_to_update, snapshot, tracked_factory));

// Track deleted data files in the summary builder.
for (const auto& file : data_filter_manager_.FilesToBeDeleted()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This counts requested deletes, not actual filtered deletes. Java builds the summary from deleted entries in filtered manifests, so missing optional deletes should not affect totals.

}

ICEBERG_ASSIGN_OR_RAISE(auto schema, metadata.Schema());
ICEBERG_ASSIGN_OR_RAISE(auto ancestors,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java validationHistory fails if it cannot reach startingSnapshotId. These helpers should not silently validate a truncated ancestor range.

return !delete_paths_.empty();
// Also open delete manifests when a minimum sequence number is set for cleanup.
return !delete_paths_.empty() || !removed_data_file_paths_.empty() ||
(manifest_content_ == ManifestContent::kDeletes && min_sequence_number_ > 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java does not open delete manifests solely because minSequenceNumber is set; old deletes are dropped only when a delete manifest is already being filtered.

ICEBERG_ASSIGN_OR_RAISE(
auto merged, FlushBin(bin, snapshot_id, metadata, file_io, writer_factory));
// Each manifest consumed into the merged output (beyond the 1 output) is replaced.
replaced_manifests_count_ += static_cast<int32_t>(bin.size()) - 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java counts each previous-snapshot manifest consumed by the merge as replaced, not bin.size() - 1. This can undercount old-only bins and overcount current-snapshot inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants