Skip to content

Remove deleted data files with expire_snapshots #2604

@Anton-Tarazi

Description

@Anton-Tarazi

Feature Request / Improvement

Running an expire snapshots operation will only rewrite the metadata file without the expired snapshots (and refs/ statistics). It does not delete deleted data files referenced only by the expired snapshots. This can be observed by deleting an entire table and calling expire_snapshots - the data files still exist. Trino and spark both clean up deleted data files when all snapshots referencing them are expired.

From the spec:

When a file is replaced or deleted from the dataset, its manifest entry fields store the snapshot ID 
in which the file was deleted and status 2 (deleted). The file may be deleted from the file system 
when the snapshot in which it was deleted is garbage collected, assuming that older snapshots 
have also been garbage collected [1].
...
1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is 
garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It 
is easier to track what files are deleted in a snapshot and delete them when that snapshot expires. 
It is not recommended to add a deleted file back to a table. Adding a deleted file can lead to edge 
cases where incremental deletes can break table snapshots.

Happy to work on this if others agree that it should be added :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions