-
Couldn't load subscription status.
- Fork 380
Open
Description
Feature Request / Improvement
Running an expire snapshots operation will only rewrite the metadata file without the expired snapshots (and refs/ statistics). It does not delete deleted data files referenced only by the expired snapshots. This can be observed by deleting an entire table and calling expire_snapshots - the data files still exist. Trino and spark both clean up deleted data files when all snapshots referencing them are expired.
From the spec:
When a file is replaced or deleted from the dataset, its manifest entry fields store the snapshot ID
in which the file was deleted and status 2 (deleted). The file may be deleted from the file system
when the snapshot in which it was deleted is garbage collected, assuming that older snapshots
have also been garbage collected [1].
...
1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is
garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It
is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
It is not recommended to add a deleted file back to a table. Adding a deleted file can lead to edge
cases where incremental deletes can break table snapshots.
Happy to work on this if others agree that it should be added :)
Metadata
Metadata
Assignees
Labels
No labels