Handling a dynamic (adding/removing/replacing drives) filesystem with snapshots

I've been successfully using bees on my backup server for a long time now. Recently, however, I've had the data amount grow far enough to make hash table size not optimal, attempted to increase its size, and now it seems like I can't use bees anymore, as it is stuck deep in snapshots and seems keen to grow my metadata to uncomfortable amounts.

I turned the service off for now. I don't think any of this is bees's fault, as I brought it all upon myself. But I am wondering whether there's a way to recover (and whether I even should, as bees might simply be a bad fit for my usage scenario).

### Background

Some years ago, I built myself a backup server with a single 8 TB drive (I am only mentioning the storage drives; OS lives elsewhere and is not relevant, as it's not even on btrfs). I figured that, if I rsync my other machines to subvolumes on this server, and then create readonly snapshots of those volumes, I'm going to make great use of btrfs's CoW features. To maximize CoW, I used the `--inplace --no-whole-file -M--no-whole-file` rsync flags, to make sure that, if a file is only partially changed, it will be only partially overwritten.

This seemed to work well enough. I wanted to optimize further, so I enabled compression (`compress-force=zstd:12`) and set up bees. (I also did other things, like using `bcache` with redundancy and a custom kernel with https://github.com/kakra/linux/pull/36, but they don't seem relevant, so I'm not mentioning them here.) When setting up bees, I did some research and accounted for a couple of things:
1. While bees can be used with however many snapshots one wants, the snapshots should be created after dedup. Because of this, I made sure to first do the backup rsync, then let bees run, and only then snapshot once bees is done.
2. The optimal hash table size can be calculated as `unique_data_bytes / (128 * 1024) * 16` (amount of 128K unique hashed extents times 16 bytes per entry). I figured that 8 TiB of unique data is a reasonable expectation with an 8 TB filesystem (that is expected to grow later on), so I went with a 1 GiB hash table.
3. It's pointless to dedup (and, indeed, compress) some files (e.g. VM storage) to avoid insane fragmentation, so I made sure to mark such files nodatacow.

It all went fine and performed about as I expected. Then I grew the filesystem, adding two more 20 TB drives (for a total of 48 TB raw storage and 24 TB usable space in a raid1 configuration) and ran a full balance as is recommended. Then my amount of data grew steadily as I made more backups, at some point reaching current 10.5 TiB, and I wondered whether the hash table size is future proof.

### Failure ~of cognitive function~

Despite 10.5 TiB being still perfectly fine and within tolerance for a 1 GiB hash table, I, being the smart and insightful professional that I am, considered the possibility that it will be too costly to grow the hash table once data amount reaches, say, 40 TiB. So I decided to grow the hash table to 4 GiB preemptively, as I have the RAM to spare. I looked up how to do it, found this issue https://github.com/Zygo/bees/issues/70 (and some other ones, which said basically the same things I believe), and learned that I have to delete `beeshash.dat` and `beescrawl.dat` and change the db size in the config.

Of course, I completely failed to understand or consider the implications. Deleting those files basically reset bees and made it crawl and dedup my whole storage from scratch — that is, 100+ snapshots of 3-4 TiB each. I have also completely missed (or forgot, since I researched this so long ago) that bees will create its own metadata for every snapshot as it dedups.

I got alarmed after my metadata tripled in size, did some better research, stopped bees, and here we are.

### What next?

Accepting the damage (extra metadata) and moving on without bees is the most obvious default choice. My usage scenario probably makes bees less useful than it could be, anyway, as the in-place writes and incremental backups deduplicate things quite a lot as is.

But what if I wanted to keep using bees? I thought a bit and came up with a "replay step-by-step" plan:
1. Making use of LVM features, shrink the existing btrfs filesystem and create a new one beside it on the same set of drives, with same settings.
2. Replicate the backup subvolume structure (1 subvol per machine, separate nodatacow directories for troublesome files, etc.).
3. For each backed-up machine, rsync the oldest history snapshot into the new filesystem. This should fully defragment the data and minimize metadata.
4. Run bees on the new filesystem. Wait for it to finish dedup.
5. Create snapshots of the deduped subvolumes, replicating the oldest snapshots within the new filesystem.
6. Delete these oldest snapshots from the old filesystem.
7. GOTO 3 until there are no snapshots left in the old filesystem, balancing and shrinking it on the way, as needed, to grow the new one.
8. Delete the old filesystem and replace it with the new one, fully deduped with new proper hash table size.

So, basically, I would replicate the steps I took to create each backup in the first place, one by one. This is obviously very involved and will take a very long time, but it seems like the best option I've got (aside from deleting the snapshots and starting anew from the current state).

This can be made potentially faster and more efficient if I segregate the backups. As I don't really need most of this data in versioned history, I could shrink the snapshots quite a lot by storing the non-versioned data separately, in subvolumes without snapshots. Finally, I could even keep the non-versioned data on a separate btrfs filesystem, and run bees only there. Which brings me to...

### What's the _right_ way to handle a dynamic (adding/removing/replacing drives) filesystem, with snapshots, with bees?

Going back to https://github.com/Zygo/bees/issues/70 and stuff mentioned there (once again, there were other issues I looked through, but the info was more or less the same), it seems like:

- The only current way to resize the hash table is to nuke it. Along with it, one must nuke the crawl state, because otherwise dedup won't be efficient, as it won't be aware of any older data (at least that's how I understand it).
  So, the only way to resize the hash table is to basically start bees from scratch, which should not be done if you have a lot (or even a few, really) of huge snapshots.
- Full balance (like one _must_ do when adding/removing drives or replacing faulty ones) invalidates the hash table. Once again, one must drop the crawl state along with it to avoid the same issue (dedup being inefficient because it can't find the older data to reference).
  So, a full balance for any reason also basically forces one to start bees from scratch.

If the hash table sizing issue can be circumvented (predicting the expected data size and picking a hash table size accordingly is a reasonable expectation, especially since there's no hard requirement to keep the optimal 128k dedup extent size), the balance issue seems just completely unavoidable with a multi-drive storage system (even if one never adds or removes drives, a drive will fail at some point, requiring a replacement and a full balance).

So does that mean bees shouldn't be used on multi-drive storage with many (and/or huge) snapshots? As a balance will be required at some point, which will invalidate the hash table, which will either cripple bees or require restarting from scratch (taking ages and exploding metadata due to snapshots).

Or is there a right way that I'm missing? Maybe this info is just stale and balancing is somehow handled without hash invalidation by now? Or maybe there's some way one can run a one-off "hash table fill" that will read through all extents (rather than following snapshots and transactions) and recover the hash table to a usable state in reasonable time?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling a dynamic (adding/removing/replacing drives) filesystem with snapshots #322

Background

Failure of cognitive function

What next?

What's the right way to handle a dynamic (adding/removing/replacing drives) filesystem, with snapshots, with bees?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Handling a dynamic (adding/removing/replacing drives) filesystem with snapshots #322

Description

Background

Failure of cognitive function

What next?

What's the right way to handle a dynamic (adding/removing/replacing drives) filesystem, with snapshots, with bees?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions