Skip to content

DELETED manifest entries retain original snapshot_id instead of the deleting snapshot's ID #3236

@lawofcycles

Description

@lawofcycles

Apache Iceberg version

0.11.0 (latest release)

Please describe the bug 🐞

Description

When PyIceberg performs a Copy-on-Write delete (table.delete()) or overwrite, the DELETED manifest entries written to the new manifest retain the snapshot_id of the snapshot that originally added the file, instead of the snapshot that is deleting it.

According to Iceberg spec (Manifest Entry Fields):

snapshot_id: Snapshot id where the file was added, or deleted if status is 2 (deleted).

For entries with status=2 (DELETED), the snapshot_id should be set to the current (deleting) snapshot's ID.

Impact

The snapshot_id field on manifest entries is part of the Iceberg spec and any implementation that relies on it for DELETED entries will produce incorrect results. For example, Iceberg Java's IncrementalChangelogScan filters manifest entries by snapshot_id membership in the changelog snapshot set.

// BaseIncrementalChangelogScan.java
.filterManifestEntries(entry -> changelogSnapshotIds.contains(entry.snapshotId()))

Reproduction

schema = Schema(
   NestedField(1, "id", StringType(), required=True),
   NestedField(2, "value", LongType()),
)
table = catalog.create_table("default.bug_repro", schema=schema)

# Step 1: INSERT
df = pa.table({"id": [f"row-{i}" for i in range(10)], "value": list(range(10))},
             schema=table.schema().as_arrow())
table.append(df)
insert_snap = table.current_snapshot()

# Step 2: DELETE (CoW overwrite)
table.delete(delete_filter="id >= 'row-5'")
delete_snap = table.current_snapshot()

Manifest comparison: PyIceberg vs Spark

To confirm this is a PyIceberg-specific issue, I compared the raw manifest Avro files produced by PyIceberg and Spark for equivalent overwrite operations using fastavro.

PyIceberg (DELETE snapshot_id=1582763037918157070, deleting a file originally added by snapshot 2824688255802688948):

# manifest <uuid>-m0.avro (new file)
[0] status=ADDED   snapshot_id=1582763037918157070  records=5   file=<new>.parquet

# manifest <uuid>-m1.avro (deleted file)
[0] status=DELETED  snapshot_id=2824688255802688948  records=10  file=<old>.parquet
                               ^^^^^^^^^^^^^^^^^^
                               Points to the original INSERT snapshot.
                               Should be 1582763037918157070 (the deleting snapshot).

Spark (overwrite snapshot_id=3532553658031297701):

# manifest <uuid>-m0.avro (deleted files)
[0] status=DELETED  snapshot_id=3532553658031297701  records=70495  file=...-00001.parquet
[1] status=DELETED  snapshot_id=3532553658031297701  records=73361  file=...-00002.parquet
[2] status=DELETED  snapshot_id=3532553658031297701  records=80466  file=...-00003.parquet
...

# manifest <uuid>-m2.avro (new files)
[0] status=ADDED   snapshot_id=3532553658031297701  records=81013  file=...-00001.parquet
...

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions