-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: Blazingly fast new BlueFS WAL disk format 🚀 #56927
base: main
Are you sure you want to change the base?
Conversation
src/os/bluestore/BlueFS.cc
Outdated
@@ -1518,6 +1518,13 @@ int BlueFS::_replay(bool noop, bool to_stdout) | |||
vselector->get_hint_by_dir(dirname); | |||
vselector->add_usage(file->vselector_hint, file->fnode); | |||
|
|||
if (boost::algorithm::ends_with(filename, ".log")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ends_with() is now part of c++ (since c++20, which we are using)
src/os/bluestore/BlueFS.cc
Outdated
bufferlist t; | ||
t.substr_of(buf.bl, flush_offset - buf.bl_off, sizeof(File::WALFlush::WALLength)); | ||
dout(30) << "length dump\n"; | ||
t.hexdump(*_dout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we OK with log lines like this one, without the preamble?
} | ||
} | ||
|
||
int64_t BlueFS::_read_wal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a style/pref comment: this is a pretty long function. Can it be broken down into logical, named, sub-functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might abstract some parts that are reused in different places but imho I prefer reading from top to bottom what this function does instead of going to defitinition in functions.
@pereman2 I was very excited about your PR, and did some librbd fio testing on mako over the weekend. These are relatively fast NVMe drives so I'm not sure how limited they are by fsync (which should be a no-op for the drive, but still require the syscall). I suspect we may see better numbers as other bottlenecks are eliminated. It would be very interesting however to see how this performs on consumer grade flash and HDDs. Here are the results: Single NVMe OSD 4K Random Write (1X)
30 NVMe OSD 4K Random Write (3X)
|
Oh cool! I wonder the difference between our nvmes used are. If you are up for it, I will update the code removing some obvious inefficiencies that might have an effect on CPU, and you may run it again to see if it did fix something. Nevertheless I will attach my results again after that fix |
@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
for t in ['randrw', 'randwrite', 'randread', 'rw']:
# do fio test with "t" |
@markhpc I got new results! Looks like randwrites are not getting any better somehow but randrw do get better: |
Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want. |
Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes. |
Yes! |
Naw, this was a quick test. I was curious what kind of syscall overhead reduction we might see here versus other previous tests where I can see some overhead for 4k random writes. We could certainly do tests at larger fill values though. |
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
the real offset of the data of a wal file is offseted by extra flush evelope data. Currently 2 uint64_t are added to each flush therefore we keep track of number of flushes in a BlueStore::File with wal_flush_count and offset should be transalated simply by doing a simple: `offset += 2 * sizeof(uint64_t) * wal_flush_count` Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
size_t len, ///< [in] this many bytes | ||
bufferlist *outbl, ///< [out] optional: reference the result here | ||
char *out) ///< [out] optional: or copy it here | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider making a simplified logic for _wal_read, since it is only accessed in sequential mode:
2024-05-10T12:11:30.922+0000 7fe5bea18b40 10 bluefs open_for_read db.wal/000271.log (sequential)
The simplification would be that regular _read would be used.
We would keep:
- wal offset (as rocksdb sees)
- bluefs offset of wal file
- data read from wal file
- location of next envelope within data already read
And on call to _wal_read() we will prefetch from bluefs file to data buffer if needed,
crop beginning of buffer to output *outbl, and update above pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider making a simplified logic for _wal_read, since it is only accessed in sequential mode:
2024-05-10T12:11:30.922+0000 7fe5bea18b40 10 bluefs open_for_read db.wal/000271.log (sequential)
The simplification would be that regular _read would be used. We would keep:
- wal offset (as rocksdb sees)
- bluefs offset of wal file
- data read from wal file
- location of next envelope within data already read
And on call to _wal_read() we will prefetch from bluefs file to data buffer if needed,
crop beginning of buffer to output *outbl, and update above pointers.
I fail to understand the simplification.
- wal offset is there, named
wal_data_logical_offset
- bluefs offset is
flush_offset
. I agree this could be another name - data read from wal file is kept under FileReader, since it's sequential
_read
already does the heavy lifting of prefetching - We use bufferlists to gather the output so the crop we do is barely expensive, I know this is some extra work that could be optimized away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_read_wal
is merely a wrapper of _read
use 2-2 version for wal_v2 and 1-1 for normal nodes Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Instead of using 0 as EOF, we signal file is dirty but don't force flush. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Introducing two new fields: * wal_size -> real wal data size * wal_limit -> upper limit to read and recover flush extents Wal limit is used to read past of fnode.size until hitting wal_limit trying to find flush extents that might be missing. On truncate we decrease the limit to the offset. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
decoding depends on the version as if it's version >= 2 we may decode wal fields Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
New fnode doesn't really work as advertised because there are no assertions when trying to read with a wrong version with api `DENC_START...`. Thankfully (or not), bluefs_transaction does have the good api `DECODE_START`. This way we ensure when downgrading, previous versions don't try to read this new WAL files. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
__func__, file->fnode.ino, flush_offset, increment, file->fnode.wal_limit) | ||
<< dendl; | ||
ceph_assert(file->is_wal_read_loaded); | ||
ceph_assert((flush_offset == 0 && file->wal_flushes.empty()) || (flush_offset == file->wal_flushes.back().end_offset())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see just a single location where _wal_update_size is called from for now. And there is an explicitly reset of file->wal_flushes before that call. So the assertion looks redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same applies to is_wal_read_loaded assertion above and flush_offset parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved assert file->is_wal_read_loaded to read_wal as it makes more sense there now.
Second one is redundant too but ensures how this method is supposed to be called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why not moving both wal_flush.clear() and is_wal_read_loaded assignment into _wal_update_size() function.
And get rid off flush_offset parameter too.
Do you really expect more usage of this function?
src/os/bluestore/BlueFS.cc
Outdated
break; | ||
} | ||
File::WALFlush::WALLength flush_length_le(0); | ||
flush_length_le = *((File::WALFlush::WALLength*)bl.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely love to see our regular encode/decode mechanics to be applied to flush_length here instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally - wouldn't it be better to use a more general approach and make flush envelop in a way we do for other persistent data structures. I.e. it should include:
header (struct_v + others)
bufferlist content
postfix member(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done with exception of first encode of flush_length
src/os/bluestore/BlueFS.cc
Outdated
} | ||
marker_le = *((File::WALFlush::WALMarker*)bl.c_str()); | ||
uint64_t marker = marker_le; | ||
if (marker != file->fnode.ino) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is a pretty weak condition to determine if the next envelope is present. E.g. having repetitive OSD deployment could relatively easily hit legacy envelops and erroneously reuse them.
The idea of having a more advanced header comes to my mind once again here. It can also help to get rid of marker part. Here is an overview:
The header to be our regular encode/decode structure.
Apart from flush length it should include osd uuid and file ino and flush seq no.
Inability to decode this header, as well as unexpected uuid/ino/seqno values, result in scan termination.
This will cause some additional write amplification but actually I expect WAL flush envelop to be hundred/thousands of bytes length most of the time so that's not a very big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WALv2 is relaying on data it did not write to extract information.
With any method like this, we need sufficiently low probability of false positive detection.
False positive here means that we:
- read data we did not write
- decided that the data is legit
I will try to estimate how reliable is "ino + uuid" as a marker check.
- UUID
How many uuid sequences are on disk?
Lets assume uuid is fully random, meaning that probability of random bits forming uuid is 0.5^(8*8). With that, we do not expect even one uuid to appear on the device.
But uuid is written once perbluefs_transaction_t
. We will have many uuids from this source.
Lets assume bluefs log and rocksdb sst files are the only contenders for disk space.
In the system running long time, all disk has been used. The balance between (conf.bytes_written_wal + conf.bytes_written_sst) and (conf.logged_bytes) determines amount of uuid on device. So it will be many, like many thousands. (numbers needed!) - INO
How many ino values might be on the disk?
As ino are small integer values, we should expect plenty of other, unrelated, small 32 bit values. I think its a lot.
In bluefs_transaction_t
right before uuid is len
, and right after it is seq
. Both len
and seq
we expect to be small integers.
Ino + uuid are a highly unreliable pair for unique marker.
Proposal
Lets use hash(ino + uuid) as marker.
If our hash function is good, it will be basically random. There will be no risk of contamination from previous deployments, and we will have no negative side effect from ino
being small value.
We can tune probability of false possitive simply by taking longer hash function.
For 32bit hash, we expect to have 1 pattern per 4GB of disk.
For 64bit hash, there should be none, except those written specifically to the WALv2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ifed01 I've implemented a version a hashed generated marker based on Adam's comments. let me know what you think :)
src/os/bluestore/BlueFS.cc
Outdated
|
||
uint64_t increase = flush_length+(sizeof(File::WALFlush::WALLength)+sizeof(File::WALFlush::WALMarker)); | ||
dout(20) << fmt::format("{} adding flush {:#x}~{:#x}", __func__, flush_offset, flush_length) << dendl; | ||
file->wal_flushes.push_back({flush_offset, flush_length}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flush_offset is an offset of the envelop header, flush_length - actual envelop payload length here, right?
IMO that's a bit confusing - you better preserve payload offset here as well. I presume one doesn't need envelop header beyond this point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flush_offset and flush_length is the minimum amount of information needed to extrapolate all other information. I think I can reuse the get_payload_offset for comment below.
src/os/bluestore/BlueFS.cc
Outdated
|
||
} | ||
|
||
flush_offset += sizeof(File::WALFlush::WALLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you wouldn't need this increment if wal_flushes would keep actual payload offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll reuse get_payload_offset, I might've have used += here so that it was easier to understand the flow.
The plan is simple. 1. Find all wal v2 files. 2. Get total size of them 3. Copy data from the v1 copy of that file. 4. If all went well unlink previous and rename new files accordingly. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
// 4. truncate -> fnode.size | ||
// 5. unlink | ||
ceph_assert(h->file->fnode.size == offset || offset == 0); | ||
h->file->fnode.wal_limit = offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling that wal_limit is an "alias" of f->fnode.get_allocated() so it should not shrink without pextent list being shrinked. At best it looks like we don't need to persist wal_limit at all - a run-time instance should be sufficient: we recalculate it from get_allocated() on space allocation or bluefs log replay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pereman2 - what do you think? Do we really need to shrink wal_limit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may be right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well not really. wal_limit signals the range of data that might exist. When we truncate the file we don't remove extents but we do decrease the upper limit represented by wal_limit.
src/os/bluestore/BlueFS.cc
Outdated
@@ -3775,7 +4155,7 @@ int BlueFS::fsync(FileWriter *h)/*_WF_WD_WLD_WLNF_WNF*/ | |||
} | |||
} | |||
} | |||
if (old_dirty_seq) { | |||
if (old_dirty_seq && !h->file->is_new_wal()) { // don't force flush on WAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we don't need fsync-ing after regular appends to WAL but what's about the cases when we've [pre]allocated some space for it? This should get into log and be fsynced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done with last commit, please take a look
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
plain ino as a maker is jsut bad business with high probability of collisions. With this commit we hash uuid and ino together to form a good enough marker capable to withstand false positives to withstand false positives. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
New encoder and decoder functions for a wal flush header. It is intended by design to not have bound checks and version check because we expect wrong data to be decoded when reading after End Of File. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
@ifed01 With the latest commit 493dc63 I've added an ad-hoc encoder/decoder for the header. This comes at a price of not being able to really check compatibility and struct_len because we expect to read WAL out of bounds on recovery, therefore we will see bad values decoding which for sure, by design, will cause issues. |
IMO using try..catch eliminates this concern, doesn't it? |
__func__, file->fnode.ino, flush_offset, increment, file->fnode.wal_limit) | ||
<< dendl; | ||
ceph_assert(file->is_wal_read_loaded); | ||
ceph_assert((flush_offset == 0 && file->wal_flushes.empty()) || (flush_offset == file->wal_flushes.back().end_offset())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why not moving both wal_flush.clear() and is_wal_read_loaded assignment into _wal_update_size() function.
And get rid off flush_offset parameter too.
Do you really expect more usage of this function?
bl.hexdump(*_dout); | ||
*_dout << dendl; | ||
auto buffer_iterator = bl.cbegin(); | ||
decode(header, buffer_iterator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can wrap decode() call with try..catch and break reading if decode fails. We've been successfully using the same technique in a couple other locations in BlueFS....
// 4. truncate -> fnode.size | ||
// 5. unlink | ||
ceph_assert(h->file->fnode.size == offset || offset == 0); | ||
h->file->fnode.wal_limit = offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pereman2 - what do you think? Do we really need to shrink wal_limit here?
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
problem
BlueFS write are expensive due to every
BlueFS::fsync
invokingdisk->flush
twice, one for the file data and another one for BlueFS log metadata. We can avoid this duality by joining merging metadata and data in the same envelope. This PR delievers on that front.New format
Previous a bluefs log transaction would hold a file_update_inc that includes the increase of file size, that way we now what is the length of data the file hold in its own extents. Therefore, every write would perform a
fnode->size += delta
and consequently mark it as dirty.This new format is basically a envelope that holds both data and delta metadata plus some error detection stuff:
Flush length (u64)
-> the length of the data in the envelopePayload (flush length)
-> data of the WAL write asked for (size is flush length)Marker (u64)
-> id of the file used for error detection (this in talks to change to crc or something else)With this new format, for every fsync we do, create this envelope and flush it without marking the file as dirty, therefore not generating the log disk flush. This inferred huge benefits in performance that will be looking at next.
EOF tricks
A "huge" problem is: how do we know we cannot read more data from the file. Either we reach end of allocated extents or... in this case we append some 0s to the evenlope so that next flush_length is overwritten and therefore we can check if next flush is not yet flushed to disk. This basically works like a null terminated string.
Preliminary results:
I ran multiple fio jobs including different workloads: randrw, random writes, random reads, etc...
By counting number of flushes with a simple counter vs a vector of flush extents we saw a significat performance degradation that might be worth to use the vector only in replay and forget about storing flush extents during the run:
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
x
between the brackets:[x]
. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e