Skip to content

FEATURE: Deduplicate S3 uploads in backups using hardlinks#37261

Merged
ducks merged 1 commit intomainfrom
feature/backup-dedup-hardlinks
Jan 22, 2026
Merged

FEATURE: Deduplicate S3 uploads in backups using hardlinks#37261
ducks merged 1 commit intomainfrom
feature/backup-dedup-hardlinks

Conversation

@ducks
Copy link
Copy Markdown
Contributor

@ducks ducks commented Jan 22, 2026

When generating backups with S3 uploads, identical files (same original_sha1) are now downloaded once and hardlinked for duplicate paths instead of downloading each file separately.

This addresses the backup size explosion caused by secure uploads duplicating files across different security contexts. For example, a site with 432GB of uploads but only 26GB of unique content will now generate a ~26GB backup instead of ~432GB.

How it works:

  • Group uploads by original_sha1 (the real content hash)
  • Download only the first file for each unique hash
  • Create hardlinks for remaining paths with the same content
  • tar preserves hardlinks, storing data only once

Fallback: If hardlinking fails (cross-filesystem, permissions), the file is downloaded normally to ensure backup completion.

When generating backups with S3 uploads, identical files (same original_sha1)
are now downloaded once and hardlinked for duplicate paths instead of
downloading each file separately.

This addresses the backup size explosion caused by secure uploads
duplicating files across different security contexts. For example, a site
with 432GB of uploads but only 26GB of unique content will now generate
a ~26GB backup instead of ~432GB.

How it works:
- Group uploads by original_sha1 (the real content hash)
- Download only the first file for each unique hash
- Create hardlinks for remaining paths with the same content
- tar preserves hardlinks, storing data only once

Fallback: If hardlinking fails (cross-filesystem, permissions), the file
is downloaded normally to ensure backup completion.
@ducks ducks force-pushed the feature/backup-dedup-hardlinks branch from 587950f to 083fb7e Compare January 22, 2026 05:40
@ducks ducks merged commit 2bfeb87 into main Jan 22, 2026
15 checks passed
@ducks ducks deleted the feature/backup-dedup-hardlinks branch January 22, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants