Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: Blazingly fast new BlueFS WAL disk format 🚀 #56927

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

pereman2
Copy link
Contributor

@pereman2 pereman2 commented Apr 16, 2024

problem

BlueFS write are expensive due to every BlueFS::fsync invoking disk->flush twice, one for the file data and another one for BlueFS log metadata. We can avoid this duality by joining merging metadata and data in the same envelope. This PR delievers on that front.

New format

Previous a bluefs log transaction would hold a file_update_inc that includes the increase of file size, that way we now what is the length of data the file hold in its own extents. Therefore, every write would perform a fnode->size += delta and consequently mark it as dirty.

This new format is basically a envelope that holds both data and delta metadata plus some error detection stuff:

  • Flush length (u64) -> the length of the data in the envelope
  • Payload (flush length) -> data of the WAL write asked for (size is flush length)
  • Marker (u64) -> id of the file used for error detection (this in talks to change to crc or something else)

With this new format, for every fsync we do, create this envelope and flush it without marking the file as dirty, therefore not generating the log disk flush. This inferred huge benefits in performance that will be looking at next.

EOF tricks

A "huge" problem is: how do we know we cannot read more data from the file. Either we reach end of allocated extents or... in this case we append some 0s to the evenlope so that next flush_length is overwritten and therefore we can check if next flush is not yet flushed to disk. This basically works like a null terminated string.

Preliminary results:

I ran multiple fio jobs including different workloads: randrw, random writes, random reads, etc...

image

image

By counting number of flushes with a simple counter vs a vector of flush extents we saw a significat performance degradation that might be worth to use the vector only in replay and forget about storing flush extents during the run:
image

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@pereman2 pereman2 requested review from ifed01 and aclamk April 16, 2024 16:34
@pereman2 pereman2 changed the title os/bluestore: new BlueFS WAL disk format os/bluestore: Blazingly fast new BlueFS WAL disk format 🚀 Apr 17, 2024
@@ -1518,6 +1518,13 @@ int BlueFS::_replay(bool noop, bool to_stdout)
vselector->get_hint_by_dir(dirname);
vselector->add_usage(file->vselector_hint, file->fnode);

if (boost::algorithm::ends_with(filename, ".log")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ends_with() is now part of c++ (since c++20, which we are using)

bufferlist t;
t.substr_of(buf.bl, flush_offset - buf.bl_off, sizeof(File::WALFlush::WALLength));
dout(30) << "length dump\n";
t.hexdump(*_dout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we OK with log lines like this one, without the preamble?

}
}

int64_t BlueFS::_read_wal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a style/pref comment: this is a pretty long function. Can it be broken down into logical, named, sub-functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might abstract some parts that are reused in different places but imho I prefer reading from top to bottom what this function does instead of going to defitinition in functions.

@markhpc
Copy link
Member

markhpc commented Apr 22, 2024

@pereman2 I was very excited about your PR, and did some librbd fio testing on mako over the weekend. These are relatively fast NVMe drives so I'm not sure how limited they are by fsync (which should be a no-op for the drive, but still require the syscall). I suspect we may see better numbers as other bottlenecks are eliminated. It would be very interesting however to see how this performs on consumer grade flash and HDDs.

Here are the results:

Single NVMe OSD 4K Random Write (1X)

Ceph Version IOPS Avg Latency (ms) 99.9% Latency (ms) ceph-osd CPU Usage %
v18.2.2 94030 16.49 301.73 1508
main 94445 16.58 333.71 1538
main + #56927 97048 16.02 321.56 1561

30 NVMe OSD 4K Random Write (3X)

Ceph Version IOPS Avg Latency (ms) 99.9% Latency (ms) ceph-osd CPU Usage % (mako06)
v18.2.2 476575 13.46 385.54 1051
main 465904 13.76 322.02 1068
main + #56927 469424 13.88 410.33 1098

@pereman2
Copy link
Contributor Author

Here are the results:

Oh cool! I wonder the difference between our nvmes used are.

If you are up for it, I will update the code removing some obvious inefficiencies that might have an effect on CPU, and you may run it again to see if it did fix something. Nevertheless I will attach my results again after that fix

@pereman2
Copy link
Contributor Author

pereman2 commented Apr 22, 2024

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster?
My benchmark basically does:

iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"

@pereman2
Copy link
Contributor Author

@markhpc I got new results! Looks like randwrites are not getting any better somehow but randrw do get better:

randwrite 1 host 3 osds bssplit
randrw 1 host 3 osds bssplit

@markhpc
Copy link
Member

markhpc commented Apr 22, 2024

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:

iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"

Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.

@mheler
Copy link

mheler commented Apr 23, 2024

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:

iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"

Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.

Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes.

@pereman2
Copy link
Contributor Author

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:

iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"

Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.

Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes.

Yes!

@pereman2 pereman2 marked this pull request as ready for review April 29, 2024 14:03
@pereman2 pereman2 requested a review from a team as a code owner April 29, 2024 14:03
@markhpc
Copy link
Member

markhpc commented May 1, 2024

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:

iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"

Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.

Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes.

Naw, this was a quick test. I was curious what kind of syscall overhead reduction we might see here versus other previous tests where I can see some overhead for 4k random writes. We could certainly do tests at larger fill values though.

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
the real offset of the data of a wal file is offseted by extra flush
evelope data. Currently 2 uint64_t are added to each flush therefore we
keep track of number of flushes in a BlueStore::File with wal_flush_count
and offset should be transalated simply by doing a simple:
`offset += 2 * sizeof(uint64_t) * wal_flush_count`

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
size_t len, ///< [in] this many bytes
bufferlist *outbl, ///< [out] optional: reference the result here
char *out) ///< [out] optional: or copy it here
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider making a simplified logic for _wal_read, since it is only accessed in sequential mode:
2024-05-10T12:11:30.922+0000 7fe5bea18b40 10 bluefs open_for_read db.wal/000271.log (sequential)

The simplification would be that regular _read would be used.
We would keep:

  • wal offset (as rocksdb sees)
  • bluefs offset of wal file
  • data read from wal file
  • location of next envelope within data already read
    And on call to _wal_read() we will prefetch from bluefs file to data buffer if needed,
    crop beginning of buffer to output *outbl, and update above pointers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider making a simplified logic for _wal_read, since it is only accessed in sequential mode: 2024-05-10T12:11:30.922+0000 7fe5bea18b40 10 bluefs open_for_read db.wal/000271.log (sequential)

The simplification would be that regular _read would be used. We would keep:

  • wal offset (as rocksdb sees)
  • bluefs offset of wal file
  • data read from wal file
  • location of next envelope within data already read
    And on call to _wal_read() we will prefetch from bluefs file to data buffer if needed,
    crop beginning of buffer to output *outbl, and update above pointers.

I fail to understand the simplification.

  • wal offset is there, named wal_data_logical_offset
  • bluefs offset is flush_offset. I agree this could be another name
  • data read from wal file is kept under FileReader, since it's sequential _read already does the heavy lifting of prefetching
  • We use bufferlists to gather the output so the crop we do is barely expensive, I know this is some extra work that could be optimized away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_read_wal is merely a wrapper of _read

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
use 2-2 version for wal_v2 and 1-1 for normal nodes

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Instead of using 0 as EOF, we signal file is dirty but don't force
flush.

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Introducing two new fields:
* wal_size -> real wal data size
* wal_limit -> upper limit to read and recover flush extents

Wal limit is used to read past of fnode.size until hitting wal_limit
trying to find flush extents that might be missing. On truncate we
decrease the limit to the offset.

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
decoding depends on the version as if it's version >= 2 we may
decode wal fields

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/bluefs_types.h Outdated Show resolved Hide resolved
src/os/bluestore/bluefs_types.h Outdated Show resolved Hide resolved
src/os/bluestore/bluefs_types.h Outdated Show resolved Hide resolved
New fnode doesn't really work as advertised because there are no
assertions when trying to read with a wrong version with api
`DENC_START...`. Thankfully (or not), bluefs_transaction does have the
good api `DECODE_START`. This way we ensure when downgrading, previous
versions don't try to read this new WAL files.

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
src/os/bluestore/bluefs_types.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.h Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.h Outdated Show resolved Hide resolved
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
__func__, file->fnode.ino, flush_offset, increment, file->fnode.wal_limit)
<< dendl;
ceph_assert(file->is_wal_read_loaded);
ceph_assert((flush_offset == 0 && file->wal_flushes.empty()) || (flush_offset == file->wal_flushes.back().end_offset()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see just a single location where _wal_update_size is called from for now. And there is an explicitly reset of file->wal_flushes before that call. So the assertion looks redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies to is_wal_read_loaded assertion above and flush_offset parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved assert file->is_wal_read_loaded to read_wal as it makes more sense there now.
Second one is redundant too but ensures how this method is supposed to be called?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why not moving both wal_flush.clear() and is_wal_read_loaded assignment into _wal_update_size() function.
And get rid off flush_offset parameter too.
Do you really expect more usage of this function?

break;
}
File::WALFlush::WALLength flush_length_le(0);
flush_length_le = *((File::WALFlush::WALLength*)bl.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definitely love to see our regular encode/decode mechanics to be applied to flush_length here instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally - wouldn't it be better to use a more general approach and make flush envelop in a way we do for other persistent data structures. I.e. it should include:
header (struct_v + others)
bufferlist content
postfix member(s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with exception of first encode of flush_length

}
marker_le = *((File::WALFlush::WALMarker*)bl.c_str());
uint64_t marker = marker_le;
if (marker != file->fnode.ino) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a pretty weak condition to determine if the next envelope is present. E.g. having repetitive OSD deployment could relatively easily hit legacy envelops and erroneously reuse them.
The idea of having a more advanced header comes to my mind once again here. It can also help to get rid of marker part. Here is an overview:
The header to be our regular encode/decode structure.
Apart from flush length it should include osd uuid and file ino and flush seq no.
Inability to decode this header, as well as unexpected uuid/ino/seqno values, result in scan termination.
This will cause some additional write amplification but actually I expect WAL flush envelop to be hundred/thousands of bytes length most of the time so that's not a very big deal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WALv2 is relaying on data it did not write to extract information.
With any method like this, we need sufficiently low probability of false positive detection.
False positive here means that we:

  • read data we did not write
  • decided that the data is legit

I will try to estimate how reliable is "ino + uuid" as a marker check.

  1. UUID
    How many uuid sequences are on disk?
    Lets assume uuid is fully random, meaning that probability of random bits forming uuid is 0.5^(8*8). With that, we do not expect even one uuid to appear on the device.
    But uuid is written once per bluefs_transaction_t. We will have many uuids from this source.
    Lets assume bluefs log and rocksdb sst files are the only contenders for disk space.
    In the system running long time, all disk has been used. The balance between (conf.bytes_written_wal + conf.bytes_written_sst) and (conf.logged_bytes) determines amount of uuid on device. So it will be many, like many thousands. (numbers needed!)
  2. INO
    How many ino values might be on the disk?
    As ino are small integer values, we should expect plenty of other, unrelated, small 32 bit values. I think its a lot.

In bluefs_transaction_t right before uuid is len, and right after it is seq. Both len and seq we expect to be small integers.
Ino + uuid are a highly unreliable pair for unique marker.

Proposal
Lets use hash(ino + uuid) as marker.
If our hash function is good, it will be basically random. There will be no risk of contamination from previous deployments, and we will have no negative side effect from ino being small value.
We can tune probability of false possitive simply by taking longer hash function.
For 32bit hash, we expect to have 1 pattern per 4GB of disk.
For 64bit hash, there should be none, except those written specifically to the WALv2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 I've implemented a version a hashed generated marker based on Adam's comments. let me know what you think :)

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved

uint64_t increase = flush_length+(sizeof(File::WALFlush::WALLength)+sizeof(File::WALFlush::WALMarker));
dout(20) << fmt::format("{} adding flush {:#x}~{:#x}", __func__, flush_offset, flush_length) << dendl;
file->wal_flushes.push_back({flush_offset, flush_length});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush_offset is an offset of the envelop header, flush_length - actual envelop payload length here, right?
IMO that's a bit confusing - you better preserve payload offset here as well. I presume one doesn't need envelop header beyond this point...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush_offset and flush_length is the minimum amount of information needed to extrapolate all other information. I think I can reuse the get_payload_offset for comment below.


}

flush_offset += sizeof(File::WALFlush::WALLength);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you wouldn't need this increment if wal_flushes would keep actual payload offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll reuse get_payload_offset, I might've have used += here so that it was easier to understand the flow.

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Show resolved Hide resolved
The plan is simple.
1. Find all wal v2 files.
2. Get total size of them
3. Copy data from the v1 copy of that file.
4. If all went well unlink previous and rename new files accordingly.

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
// 4. truncate -> fnode.size
// 5. unlink
ceph_assert(h->file->fnode.size == offset || offset == 0);
h->file->fnode.wal_limit = offset;
Copy link
Contributor

@ifed01 ifed01 May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that wal_limit is an "alias" of f->fnode.get_allocated() so it should not shrink without pextent list being shrinked. At best it looks like we don't need to persist wal_limit at all - a run-time instance should be sufficient: we recalculate it from get_allocated() on space allocation or bluefs log replay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pereman2 - what do you think? Do we really need to shrink wal_limit here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may be right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well not really. wal_limit signals the range of data that might exist. When we truncate the file we don't remove extents but we do decrease the upper limit represented by wal_limit.

@@ -3775,7 +4155,7 @@ int BlueFS::fsync(FileWriter *h)/*_WF_WD_WLD_WLNF_WNF*/
}
}
}
if (old_dirty_seq) {
if (old_dirty_seq && !h->file->is_new_wal()) { // don't force flush on WAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we don't need fsync-ing after regular appends to WAL but what's about the cases when we've [pre]allocated some space for it? This should get into log and be fsynced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done with last commit, please take a look

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
plain ino as a maker is jsut bad business with high probability of
collisions. With  this commit we hash uuid and ino together to form a
good enough marker capable to withstand false positives to withstand
false positives.

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
New encoder and decoder functions for a wal flush header. It is intended
by design to not have bound checks and version check because we expect
wrong data to be decoded when reading after End Of File.

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
@pereman2
Copy link
Contributor Author

@ifed01 With the latest commit 493dc63 I've added an ad-hoc encoder/decoder for the header. This comes at a price of not being able to really check compatibility and struct_len because we expect to read WAL out of bounds on recovery, therefore we will see bad values decoding which for sure, by design, will cause issues.

@ifed01
Copy link
Contributor

ifed01 commented May 30, 2024

@ifed01 With the latest commit 493dc63 I've added an ad-hoc encoder/decoder for the header. This comes at a price of not being able to really check compatibility and struct_len because we expect to read WAL out of bounds on recovery, therefore we will see bad values decoding which for sure, by design, will cause issues.

IMO using try..catch eliminates this concern, doesn't it?

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
__func__, file->fnode.ino, flush_offset, increment, file->fnode.wal_limit)
<< dendl;
ceph_assert(file->is_wal_read_loaded);
ceph_assert((flush_offset == 0 && file->wal_flushes.empty()) || (flush_offset == file->wal_flushes.back().end_offset()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why not moving both wal_flush.clear() and is_wal_read_loaded assignment into _wal_update_size() function.
And get rid off flush_offset parameter too.
Do you really expect more usage of this function?

bl.hexdump(*_dout);
*_dout << dendl;
auto buffer_iterator = bl.cbegin();
decode(header, buffer_iterator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can wrap decode() call with try..catch and break reading if decode fails. We've been successfully using the same technique in a couple other locations in BlueFS....

src/os/bluestore/BlueFS.h Show resolved Hide resolved
src/os/bluestore/BlueFS.h Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
// 4. truncate -> fnode.size
// 5. unlink
ceph_assert(h->file->fnode.size == offset || offset == 0);
h->file->fnode.wal_limit = offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pereman2 - what do you think? Do we really need to shrink wal_limit here?

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants