Integrated blob garbage collection: relocate blobs #7694

ltamasi · 2020-11-18T21:55:41Z

Summary:
The patch adds basic garbage collection support to the integrated BlobDB
implementation. Valid blobs residing in the oldest blob files are relocated
as they are encountered during compaction. The threshold that determines
which blob files qualify is computed based on the configuration option
blob_garbage_collection_age_cutoff, which was introduced in #7661 .
Once a blob is retrieved for the purposes of relocation, it passes through the
same logic that extracts large values to blob files in general. This means that
if, for instance, the size threshold for key-value separation (min_blob_size)
got changed or writing blob files got disabled altogether, it is possible for the
value to be moved back into the LSM tree. In particular, one way to re-inline
all blob values if needed would be to perform a full manual compaction with
enable_blob_files set to false, enable_blob_garbage_collection set to
true, and blob_file_garbage_collection_age_cutoff set to 1.0.

Some TODOs that I plan to address in separate PRs:

We'll have to measure the amount of new garbage in each blob file and log
BlobFileGarbage entries as part of the compaction job's VersionEdit.
(For the time being, blob files are cleaned up solely based on the
oldest_blob_file_number relationships.)
When compression is used for blobs, the compression type hasn't changed,
and the blob still qualifies for being written to a blob file, we can simply copy
the compressed blob to the new file instead of going through decompression
and compression.
We need to update the formula for computing write amplification to account
for the amount of data read from blob files as part of GC.

Test Plan:
make check

facebook-github-bot

@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

riversand963

LGTM with a few minor comments. Thanks @ltamasi for the PR.

riversand963 · 2020-11-23T23:34:12Z

db/compaction/compaction_iterator.h

  void PrepareOutput();

+  bool ExtractLargeValueImpl();


Maybe it's better to put some comment on these methods?

riversand963 · 2020-11-23T23:40:10Z

db/compaction/compaction_iterator.cc

+  std::advance(
+      it, compaction->blob_garbage_collection_age_cutoff() * blob_files.size());
+
+  return it != blob_files.end() ? it->first


Just curious: if !blob_files.empty() and it == blob_files.end(), will returning blob_files.back()->first + 1 also be correct?

Yes, that would also work.

riversand963 · 2020-11-23T23:44:23Z

db/compaction/compaction_iterator.h

  void PrepareOutput();

+  bool ExtractLargeValueImpl();
+  void ExtractLargeValue();


Neither ExtractLargeValue() nor ExtractLargeValueImpl() guarantee to extract large value. Maybe it's clearer to name MaybeExtractLargeValueIfNeeded() and MaybeExtractLargeValueIfNeededImpl()?

ltamasi · 2020-11-24T00:31:43Z

Thanks so much for the review @riversand963 !

…n is set

…elper methods in CompactionIterator

…blob files before compaction

…e to be met; add comments

facebook-github-bot · 2020-11-24T01:22:55Z

@ltamasi has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot

@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-11-24T05:35:55Z

@ltamasi merged this pull request in 51a8dc6.

Summary: The patch adds basic garbage collection support to the integrated BlobDB implementation. Valid blobs residing in the oldest blob files are relocated as they are encountered during compaction. The threshold that determines which blob files qualify is computed based on the configuration option `blob_garbage_collection_age_cutoff`, which was introduced in facebook#7661 . Once a blob is retrieved for the purposes of relocation, it passes through the same logic that extracts large values to blob files in general. This means that if, for instance, the size threshold for key-value separation (`min_blob_size`) got changed or writing blob files got disabled altogether, it is possible for the value to be moved back into the LSM tree. In particular, one way to re-inline all blob values if needed would be to perform a full manual compaction with `enable_blob_files` set to `false`, `enable_blob_garbage_collection` set to `true`, and `blob_file_garbage_collection_age_cutoff` set to `1.0`. Some TODOs that I plan to address in separate PRs: 1) We'll have to measure the amount of new garbage in each blob file and log `BlobFileGarbage` entries as part of the compaction job's `VersionEdit`. (For the time being, blob files are cleaned up solely based on the `oldest_blob_file_number` relationships.) 2) When compression is used for blobs, the compression type hasn't changed, and the blob still qualifies for being written to a blob file, we can simply copy the compressed blob to the new file instead of going through decompression and compression. 3) We need to update the formula for computing write amplification to account for the amount of data read from blob files as part of GC. Pull Request resolved: facebook#7694 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D25069663 Pulled By: ltamasi fbshipit-source-id: bdfa8feb09afcf5bca3b4eba2ba72ce2f15cd06a

ltamasi requested a review from riversand963 November 18, 2020 21:55

facebook-github-bot added the CLA Signed label Nov 18, 2020

ltamasi requested a review from jay-zhuang November 18, 2020 21:55

facebook-github-bot reviewed Nov 18, 2020

View reviewed changes

riversand963 approved these changes Nov 24, 2020

View reviewed changes

ltamasi added 19 commits November 23, 2020 16:39

Add enable_blob_garbage_collection and input_version to CompactionProxy

c3e1617

Add blob_garbage_collection_age_cutoff too

975867d

Compute and store cutoff file number in CompactionIterator

20ca07d

Introduce a Version::GetBlob variant that takes a parsed BlobIndex

9c54ae4

Relocate blobs from old blob files when enable_blob_garbage_collectio…

778b9a2

…n is set

Factor out the extraction of large values and blob GC into separate h…

d4f4017

…elper methods in CompactionIterator

Add a simple unit test for GC

ebbe22f

Parameterize test using different cutoff ratios; also test disabling …

974c925

…blob files before compaction

Fix type of updated_enable_blob_files_ in test

131d712

Small cleanup in test

8581910

No need for binary_search

8e8778e

Check values after compaction/GC in test

3ce0424

Validate blob_garbage_collection_age_cutoff

0f5682d

Add a test case for ValidateOptions

46853d5

Add a test case for encountering a corrupt blob index during GC

0a7d83f

Add a negative test case for encountering an inlined/TTL blob during GC

510c545

Add a test case for blob file read errors during GC

3fe7a75

Explicitly cast to size_t

825e8ee

Rename helper methods to ...IfNeeded to signal certain conditions hav…

3722d53

…e to be met; add comments

ltamasi force-pushed the blob_integrated_gc branch from e2571b0 to 3722d53 Compare November 24, 2020 01:22

facebook-github-bot reviewed Nov 24, 2020

View reviewed changes

facebook-github-bot closed this in 51a8dc6 Nov 24, 2020

facebook-github-bot added the Merged label Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrated blob garbage collection: relocate blobs #7694

Integrated blob garbage collection: relocate blobs #7694

ltamasi commented Nov 18, 2020

facebook-github-bot left a comment

facebook-github-bot left a comment

riversand963 left a comment

riversand963 Nov 23, 2020

riversand963 Nov 23, 2020

ltamasi Nov 24, 2020

riversand963 Nov 23, 2020

ltamasi commented Nov 24, 2020

facebook-github-bot commented Nov 24, 2020

facebook-github-bot left a comment

facebook-github-bot commented Nov 24, 2020

Integrated blob garbage collection: relocate blobs #7694

Integrated blob garbage collection: relocate blobs #7694

Conversation

ltamasi commented Nov 18, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

riversand963 left a comment

Choose a reason for hiding this comment

riversand963 Nov 23, 2020

Choose a reason for hiding this comment

riversand963 Nov 23, 2020

Choose a reason for hiding this comment

ltamasi Nov 24, 2020

Choose a reason for hiding this comment

riversand963 Nov 23, 2020

Choose a reason for hiding this comment

ltamasi commented Nov 24, 2020

facebook-github-bot commented Nov 24, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 24, 2020