os/bluestore: implement object content recompression/defragmentation when scrubbing #57631

gardran · 2024-05-22T13:26:27Z

This PR adds an ability for BlueStore to perform object content recompression and defragmentation when handling read requests during deep-scrubbing.
After reading object's chunk BlueStore assesses if it can improve space saving by recompressing an object and/or it can optimize physical layout of the object.
Per-pool option 'data_layout_reformatting' has been introduced to control the behavior. It support the following options:

recompress - enables object content recompression
defragment - enabled object defragmentation
both - enables both recompression and defragmentation
- original behavior, no object reformatting

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

markhpc · 2024-05-22T18:44:38Z

src/os/bluestore/BlueStore.cc

@@ -11776,11 +11841,131 @@ int BlueStore::read(
    if (offset == length && offset == 0)
      length = o->onode.size;

-    r = _do_read(c, o, offset, length, bl, op_flags);
+    r = _do_read(c, o, offset, length, bl, op_flags, 0,
+          might_need_reformatting ? &span_stat : nullptr);


If we can potentially defrag on any read, it could be interesting once the extents are read off disk to make a decision about whether or not an object is fragmented enough to warrant defragmentation on-the-fly.

@markhpc I am excited to see it. Opens interesting paths.

dvanders · 2024-05-22T18:45:01Z

I'm not sure "data_layout_reformatting" is very clear to the users what it is doing.

Maybe instead use "optimize_during_scrub" with options "defrag", "recompress", "all", ...
Or perhaps better: a different boolean pool flag for each: defrag_during_scrub, recompress_during_scrub, ...

The second option is flexible if we want to allow additional optimizations or modes in the future.

src/osd/osd_types.cc

markhpc · 2024-05-23T03:04:52Z

src/mon/OSDMonitor.cc

@@ -6156,7 +6157,8 @@ bool OSDMonitor::preprocess_command(MonOpRequestRef op)
      {"dedup_chunk_algorithm", DEDUP_CHUNK_ALGORITHM},
      {"dedup_cdc_chunk_size", DEDUP_CDC_CHUNK_SIZE},
      {"bulk", BULK},
-      {"read_ratio", READ_RATIO}
+      {"read_ratio", READ_RATIO},
+      {"data_layout_reformatting", DATA_LAYOUT_REFORMATTING},


This is missing corresponding "data_reformatting" get and set entries in MonCommands.h (easy to miss!)

markhpc · 2024-05-23T03:28:07Z

I'm not sure "data_layout_reformatting" is very clear to the users what it is doing.

Maybe instead use "optimize_during_scrub" with options "defrag", "recompress", "all", ... Or perhaps better: a different boolean pool flag for each: defrag_during_scrub, recompress_during_scrub, ...

The second option is flexible if we want to allow additional optimizations or modes in the future.

Very much dislike "optimize_during_scrub". I'd favor the boolean flag approach with something like "scrub_defragmentation", and "scrub_recompression". The number of pool options we have and the way we display them in the command help has become pretty unwieldy though. This doesn't do anything to help it. Having said that, I agree that this is best configurable at the pool level, especially if we want to allow lazy compression of objects on scrub.

markhpc · 2024-05-23T04:40:23Z

I did a very minimal initial test where I created 4 128GB RBD images, pre-filled them with 4MB writes, then did 5 minutes of 4K random writes to them in a pool with 512 PGs. This is on NVMe, so it's pretty quick. I then looked at a certain PG 2.1ff:

PG_STAT	OBJECTS	BYTES	OMAP_BYTES*	OMAP_KEYS*	LOG	LOG_DUPS	DISK_LOG
2.1ff	273	1145044992	0	0	618	3000	618

and ran deep-scrub on it 3 times:

Start Time	End Time	Duration
2024-05-23T01:48:36.667+0000	2024-05-23T01:48:43.199+0000	6.532
2024-05-23T01:55:48.326+0000	2024-05-23T01:55:54.804+0000	6.478
2024-05-23T01:59:05.596+0000	2024-05-23T01:59:12.075+0000	6.479

And the fragmentation score and reformat counters at this point is:

"fragmentation_rating": 0.01635269932531749
"reformat_compress_attempted": 0,
"reformat_compress_omitted": 0,
"reformat_defragment_attempted": 0,
"reformat_defragment_omitted": 0,
"reformat_issued": 0

Next set the pool data_reformatting to defragment and rerun deep-scrub on 2.1ff:

Start Time	End Time	Duration
2024-05-23T04:01:18.327+0000	2024-05-23T04:01:24.845+0000	6.518

"fragmentation_rating": 0.016560792018378661
"reformat_compress_attempted": 0,
"reformat_compress_omitted": 0,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2184

Next set the pool data_reformatting to both and rerun deep-scrub on 2.1ff:

Start Time	End Time	Duration
2024-05-23T04:15:42.800+0000	2024-05-23T04:15:49.279+0000	6.479

"fragmentation_rating": 0.016560788158529967
"reformat_compress_attempted": 0,
"reformat_compress_omitted": 0,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2184

Next set the pool compression_mode to passive and rerun deep-scrub on 2.1ff. Results are similar to above.
Next set the pool compression_mode to aggressive and rerun deep-scrub on 2.1ff. (forgot to check results)
Next set the pool compression_mode to force and rerun deep-scrub on 2.1ff.

Start Time	End Time	Duration
2024-05-23T04:26:35.157+0000	2024-05-23T04:26:41.637+0000	6.48

"fragmentation_rating": 0.016575220146748098
"reformat_compress_attempted": 4368,
"reformat_compress_omitted": 4074,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2478

Running another deep scrub on that PG results in an attempt to recompress the objects again:

"reformat_compress_attempted": 6552,
"reformat_compress_omitted": 6111,
"reformat_defragment_attempted": 2184,
"reformat_defragment_omitted": 0,
"reformat_issued": 2625

A couple of thoughts:

It would be nice if we had additional compression modes like "passive_lazy, aggressive_lazy, and force_lazy". The idea would be to allow objects to be written to the pool in an uncompressed state and only compress them on deep-scrub. Not sure how that would work with the global OSD level settings if we don't have the compress on scrub stuff set for the pool though.
It looks like most of the re-compression attempts here were omitted. After re-reading the code, it looks like this should pretty much only happen if need > allocated or need == allocated.
It's not clear to me why the fragmentation_rating goes up slightly after the first defrag pass. I probably should have been looking at onode_extents though. I'll try to do some better testing tomorrow.

Morning update: I tried deep-scrubbing another PG this morning with data_reformatting set to both and watched onode_extents:

pg	onode_extents before	onode_extents after
2.1fe	1259816	1276370

It looks like the number of extents actually went up?

markhpc · 2024-05-23T13:06:04Z

Here's a debug_bluestore = 10 log while deep-scrubbing one of the other PGs (2.1fb).
osd.2.log.gz

gardran · 2024-05-24T13:49:12Z

extents actually went up?

Are you referring to onode_extens perf counter here? IMO that's a wrong metric as it's an amount of logical (not physical) extents in the cache(!)
Proper way would be to check the report from "ceph tell osd.0 bluestore allocator dump block" and see how fragmented it is.
Output of "bluestore allocator fragmentation histogram block" is another mean for that analysis.

gardran · 2024-05-24T13:52:19Z

Here's a debug_bluestore = 10 log while deep-scrubbing one of the other PGs (2.1fb). osd.2.log.gz

Looks good to me, the majority of reads has log output like:
75+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=28}
2024-05-23T12:58:06.176+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000
2024-05-23T12:58:06.176+0000 7fed59ea4700 10 HybridAllocator allocate 0x80000/1000,80000,0
2024-05-23T12:58:06.176+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read preallocated: 0x80000 new frags:1 defragment: 1
Meaning there had been 28 physical fragments originally and a single one to be written out.

gardran · 2024-05-24T14:00:33Z

ted": 4368,
"reformat_compress_omitte

Here's a debug_bluestore = 10 log while deep-scrubbing one of the other PGs (2.1fb). osd.2.log.gz

I could see no initial compression, e.g.
2024-05-23T12:58:04.863+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=11}

sc (aka storage compressed) = 0
ac (aka allocated compressed) = 0
mean that no data has been compressed before scrubbing. And no recompression is applied:

2024-05-23T12:58:04.864+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000

the above line shows no saving due to recompression - recompressed data needs the same 512K of space.

So I presume your input data flow isn't compressible. Or something wrong with compression settings...

markhpc · 2024-05-24T15:16:18Z

extents actually went up?

Are you referring to onode_extens perf counter here? IMO that's a wrong metric as it's an amount of logical (not physical) extents in the cache(!) Proper way would be to check the report from "ceph tell osd.0 bluestore allocator dump block" and see how fragmented it is. Output of "bluestore allocator fragmentation histogram block" is another mean for that analysis.

Yikes! That's confusing. I don't know if I knew that at one point, but clearly I didn't remember it. :) We should rename it. I'll try to take a look with the histogram.

markhpc · 2024-05-24T15:21:42Z

I could see no initial compression, e.g. 2024-05-23T12:58:04.863+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=11}

sc (aka storage compressed) = 0 ac (aka allocated compressed) = 0 mean that no data has been compressed before scrubbing. And no recompression is applied:

2024-05-23T12:58:04.864+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000

the above line shows no saving due to recompression - recompressed data needs the same 512K of space.

So I presume your input data flow isn't compressible. Or something wrong with compression settings...

There was no initial compression. I wrote the data out uncompressed with data_reformatting off, then set it to "defragment", then "both". Next I set the compression mode to passive, then aggressive, then force. Each time, I ran a deep scrub on different PGs to see what would happen.

dvanders · 2024-05-24T15:27:35Z

I could see no initial compression, e.g. 2024-05-23T12:58:04.863+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read layout reformatting = both span stat {s=524288 ch=0 a=524288 sc=0 ac=0 sh=0 e=8 f=11}
sc (aka storage compressed) = 0 ac (aka allocated compressed) = 0 mean that no data has been compressed before scrubbing. And no recompression is applied:
2024-05-23T12:58:04.864+0000 7fed59ea4700 10 bluestore(/tmp/cbt/mnt/osd-device-2-data) read recompress info, 0x80000 vs. 0x80000
the above line shows no saving due to recompression - recompressed data needs the same 512K of space.
So I presume your input data flow isn't compressible. Or something wrong with compression settings...

There was no initial compression. I wrote the data out uncompressed with data_reformatting off, then set it to "defragment", then "both". Next I set the compression mode to passive, then aggressive, then force. Each time, I ran a deep scrub on different PGs to see what would happen.

This is a great point. I think the intuitive user expectation is that this feature could be used to compress old objects written uncompressed before this feature was available.

Also earlier we spoke about a feature like:

pool compression is disabled, so initial writes are written uncompressed
scrub recompress works to compress the objects in the background

src/os/bluestore/BlueStore.cc

aclamk · 2024-05-29T18:34:10Z

src/os/bluestore/BlueStore.cc

@@ -11741,28 +11787,47 @@ int BlueStore::set_collection_opts(
 }

 int BlueStore::read(


Using ObjectStore::read() as interface is problematic. For proof of concept / quality testing is clearly ok.

The current algorithm can be extended, at least in following way:

Data compressed possibly does not need to be recompressed. It could be moved. As a result, we do not need to get it decoded. But having it tied to read() forces us to decode to provide output bufferlist

During scrub we could create lists of objects that need recompression and defragmentation. Actual action to do so could be separate step, applied after.

I have a dream to implement "optimize-snaps" method. For head+snaps in reencodes them so head gets continuous encoding and snaps became diffs to head. This possibly requires passing head+snaps in one go.

My general proposal is to expand ObjectStore interface to allow for object optimization.

Using ObjectStore::read() as interface is problematic. For proof of concept / quality testing is clearly ok.

I would rather disagree - having object reformatting coupled with reading allows to have it optimized - one can omit object's content reading/decompression which are required if alternative/standalone API is utilized.

The current algorithm can be extended, at least in following way:

Data compressed possibly does not need to be recompressed. It could be moved. As a result, we do not need to get it decoded. But having it tied to read() forces us to decode to provide output bufferlist

That's a valid point if object reformatting is triggered independently. But when we piggy back a read we need decompression anyway - so this doesn't create an additional overhead due to that.

During scrub we could create lists of objects that need recompression and defragmentation. Actual action to do
so could be separate step, applied after.

Theoretically this looks good but in practice this would complicate the code (as there will be an additional entity to track and apply deferred reformatting) while the benefits are unclear.

I have a dream to implement "optimize-snaps" method. For head+snaps in reencodes them so head gets continuous encoding and snaps became diffs to head. This possibly requires passing head+snaps in one go.

My general proposal is to expand ObjectStore interface to allow for object optimization.

This sounds like you're suggesting an additional standalone procedure (to a major degree similar to deep-scrub) to perform object layout optimization. IMO this is nice as a design concept but it would provide an additional (and likely severe) burden on the cluster - existing deep-scrubbing tend to be a pretty heavy-weight process itself and having another background process (comparable in terms of performed operations) doesn't look like a good option to me.
With the proposed solution we're trying to decrease that burden by piggy-backing the deep scrub instead.

aclamk · 2024-05-29T18:35:58Z

src/os/bluestore/BlueStore.cc

@@ -11776,11 +11841,134 @@ int BlueStore::read(
    if (offset == length && offset == 0)
      length = o->onode.size;

-    r = _do_read(c, o, offset, length, bl, op_flags);
+    r = _do_read(c, o, offset, length, bl, op_flags, 0,


The implementation is very strongly tied to existing BlueStore execution paths.
This is a problem, because it gives very little room to actually work on optimization algorithms.

That's true but we really want to save some cycles by not doing independent standalone optimization.

src/os/bluestore/BlueStore.cc

aclamk · 2024-05-29T18:44:10Z

src/os/bluestore/BlueStore.cc

-  IOContext* ioc)
-{
+  IOContext* ioc,
+  span_stat_t* res_span_stat)


_prepare_read_ioc is modified to provide scanning for optimization.
We should have a separate function that will scan object and decide action.
For example it might happen that in object we have 3 chunks for total of 4MB,
and in attempt to defragment it we allocated 3 chunks.

_prepare_read_ioc is modified to provide scanning for optimization. We should have a separate function that will scan object and decide action.

That's again an attempt to design things for an independent optimization process. Which still sounds questionable to me.

For example it might happen that in object we have 3 chunks for total of 4MB, and in attempt to defragment it we allocated 3 chunks.

Not sure I understand this example. When speaking of defragmentation only - it would happen if amount of new fragments is less than original one for a specific data span (512K when triggered from deep scrub).
If recompression is enabled as well -reformatting unconditionally occurs when new required space for a data span is less than the original one. Fragmentation is taken into consideration when recompression is not beneficial only. This recompression+defragmentation prioritization might look a bit inflexible - one might want prefer defragmentation over recompression but for now it looks like an overkill to me.

The big reason I think we want to piggyback on deep-scrub is because we're already reading the content of the object to do the scrub, so it's the perfect time to make decisions about de-fragmentation and compression as the most expensive part of this can easily be reading the fragmented extents. I don't think we should scan objects as a separate loop unless we have a very good reason. It's much better in my mind to opportunistically defrag/compress when we've already done the expensive fragmented read.

@markhpc
I (now) agree that performing optimization to object as an extra step during deep-scrub is good.
"I don't think we should scan objects as a separate loop unless we have a very good reason."
The code here is scanning the object - it's just piggybacking _read_cache to do it.
There is no change to behaviour of _read_cache.
The filling of span_stat_t could be done in separate loop at a cost of a loop.
Similarly, we now load shared blobs in _prepare_read_ioc - useless for reading process.
That would made optimization sense, if significant part of processing was reused - but its only the loop.

@gardran My opinion is that whenever we make a new feature we should make it as detached from existing code as possible to minimize accidental bug injection. Within reason of course...

@gardran My opinion is that whenever we make a new feature we should make it as detached from existing code as possible to minimize accidental bug injection. Within reason of course...

Generally I agree with this point when we're talking about a severe intervention into the execution path: changing run flow, having new algorithms, etc...
Instead I just perform stats collection (by using simple arithmetic ops) here. So apart from a trivial null/invalid pointer dereferencing this shouldn't introduce any issues to existing functionality. Moreover if refactoring is disabled this function collects stats into a dummy local var which is an absolutely trivial operation.
So IMO that sort of changes (in this specific function) is quite safe and doesn't worth the separation - basic code review/testing should reveal any potential issues if any.

src/os/bluestore/BlueStore.cc

aclamk · 2024-05-29T19:04:51Z

src/os/bluestore/BlueStore.cc

+	    l_bluestore_allocator_lat,
+	    mono_clock::now() - start,
+	    cct->_conf->bluestore_log_op_age);
+	  might_need_defragment = preallocated >= (int64_t)need && prealloc.size() < span_stat.frags;


We need to create diligent testing against all allocators we have.
My doubt is that we compare fragments in object with fragments in free space.
My thinking is - if I optimized object 5 chunks -> 2 chunks, then I probably created 5 small free chunks and removed 2 larger free chunks. There might be an emergent pattern that ultimately keeps both objects and free space more concise, but its impossible to infer it from above.

One should realize that this PR is intended primarily for object optimization not a whole free space one. So yes hopefully object defragmentation could result in free space defragmentation as well but generally speaking that's rather a different task.

src/os/bluestore/BlueStore.cc

aclamk · 2024-06-12T11:15:33Z

src/os/bluestore/BlueStore.cc

+      _choose_write_options(c, o, op_flags, &wctx);
+    }
+    // do reformat if
+    // - object isn't cached (meaning it's not being written at the moment),


When we write to object, we keep data in Buffers at least until the data is readable from disk.
This is to make sure that if one reads object he will receive the proper data, leaving BlueStore free to delay storing data to disk (aka deferred writes).
As a result one can always read cached data and it will be the proper data.

Hence - there is no need to exclude objects that have unstored data.

The closest scenario to a problem I could imagine would be:

Transaction X to object A submitted

Buffers updated, Onode metadata updated

Data X is not yet written

Deep scrub reformats object A

Onode metadata updated (Buffers not changed)

Data from reformatting gets to the disk before data from transaction

Power-off or similar

In this case we would never signal that transaction X is completed, yet it will be there.
But it's not a problem, it is normal.

Could you elaborate what prompted you to exclude objects that are being written to?

src/os/bluestore/BlueStore.cc

aclamk · 2024-06-12T12:36:35Z

src/compressor/Compressor.cc

+    case COMP_PASSIVE_LAZY: return "passive_lazy";
+    case COMP_AGGRESSIVE_LAZY: return "aggressive_lazy";
+    case COMP_FORCE_LAZY: return "force_lazy";
+
    default: return "???";


I think we need following combination of compressions for regular operation and for scrub.

regular-compression scrub-compression

none none

none passive

none agressive

none force

passive passive

passive agresive

passive force

aggresive agresive

aggresive force

force force

oh... So you think we might need a separate lazy_compression_mode parameter (none/passive/aggressive/force) for background compression, right? Wouldn't that be an overkill for users?

aclamk · 2024-06-12T12:42:25Z

src/compressor/Compressor.cc

+    case COMP_PASSIVE_LAZY: return "passive_lazy";
+    case COMP_AGGRESSIVE_LAZY: return "aggressive_lazy";
+    case COMP_FORCE_LAZY: return "force_lazy";
+
    default: return "???";
  }
 }


The separate discussion should be if we shouldn't give different conf.bluestore_compression_required_ratio for scrub recompression.

Yes, from purist's perspective we might need a completely different set of compression settings for lazy/background mode. Whether we really need that - I'm not sure...

src/os/bluestore/BlueStore.cc

This is borrowed from ceph/ceph#57631; selective cherry-picked from commit: os/bluestore: implement data reformatting on reads Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com> Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>

github-actions · 2024-07-09T11:29:37Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>

Plus a few more PR comment resolutions from Adam. Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>

Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>

This is required to avoid buffer caching which would prevent content reformatting. Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com>

gardran requested a review from a team as a code owner May 22, 2024 13:26

github-actions bot added bluestore core mon tests labels May 22, 2024

markhpc reviewed May 22, 2024

View reviewed changes

dvanders reviewed May 22, 2024

View reviewed changes

src/osd/osd_types.cc Outdated Show resolved Hide resolved

markhpc removed the bluestore label May 23, 2024

markhpc requested changes May 23, 2024

View reviewed changes

markhpc added the performance label May 23, 2024

markhpc added the bluestore label May 24, 2024

gardran force-pushed the wip-gdran-defregment branch from 462da8c to 56f74a1 Compare May 27, 2024 16:48

github-actions bot added the common label May 28, 2024