os/bluestore: Blazingly fast new BlueFS WAL disk format 🚀 #56927

pereman2 · 2024-04-16T16:30:12Z

problem

BlueFS write are expensive due to every BlueFS::fsync invoking disk->flush twice, one for the file data and another one for BlueFS log metadata. We can avoid this duality by joining merging metadata and data in the same envelope. This PR delievers on that front.

New format

Previous a bluefs log transaction would hold a file_update_inc that includes the increase of file size, that way we now what is the length of data the file hold in its own extents. Therefore, every write would perform a fnode->size += delta and consequently mark it as dirty.

This new format is basically a envelope that holds both data and delta metadata plus some error detection stuff:

Flush length (u64) -> the length of the data in the envelope
Payload (flush length) -> data of the WAL write asked for (size is flush length)
Marker (u64) -> id of the file used for error detection (this in talks to change to crc or something else)

With this new format, for every fsync we do, create this envelope and flush it without marking the file as dirty, therefore not generating the log disk flush. This inferred huge benefits in performance that will be looking at next.

EOF tricks

A "huge" problem is: how do we know we cannot read more data from the file. Either we reach end of allocated extents or... in this case we append some 0s to the evenlope so that next flush_length is overwritten and therefore we can check if next flush is not yet flushed to disk. This basically works like a null terminated string.

Preliminary results:

I ran multiple fio jobs including different workloads: randrw, random writes, random reads, etc...

By counting number of flushes with a simple counter vs a vector of flush extents we saw a significat performance degradation that might be worth to use the vector only in replay and forget about storing flush extents during the run:

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

ronen-fr · 2024-04-18T04:31:09Z

src/os/bluestore/BlueFS.cc

@@ -1518,6 +1518,13 @@ int BlueFS::_replay(bool noop, bool to_stdout)
              vselector->get_hint_by_dir(dirname);
            vselector->add_usage(file->vselector_hint, file->fnode);

+            if (boost::algorithm::ends_with(filename, ".log")) {


ends_with() is now part of c++ (since c++20, which we are using)

src/os/bluestore/BlueFS.cc

ronen-fr · 2024-04-18T06:37:46Z

src/os/bluestore/BlueFS.cc

+      bufferlist t;
+      t.substr_of(buf.bl, flush_offset - buf.bl_off, sizeof(File::WALFlush::WALLength));
+      dout(30) << "length dump\n";
+      t.hexdump(*_dout);


are we OK with log lines like this one, without the preamble?

ronen-fr · 2024-04-18T06:42:34Z

src/os/bluestore/BlueFS.cc

+  }
+}
+
+int64_t BlueFS::_read_wal(


a style/pref comment: this is a pretty long function. Can it be broken down into logical, named, sub-functions?

I might abstract some parts that are reused in different places but imho I prefer reading from top to bottom what this function does instead of going to defitinition in functions.

markhpc · 2024-04-22T03:15:22Z

@pereman2 I was very excited about your PR, and did some librbd fio testing on mako over the weekend. These are relatively fast NVMe drives so I'm not sure how limited they are by fsync (which should be a no-op for the drive, but still require the syscall). I suspect we may see better numbers as other bottlenecks are eliminated. It would be very interesting however to see how this performs on consumer grade flash and HDDs.

Here are the results:

Single NVMe OSD 4K Random Write (1X)

Ceph Version	IOPS	Avg Latency (ms)	99.9% Latency (ms)	ceph-osd CPU Usage %
v18.2.2	94030	16.49	301.73	1508
main	94445	16.58	333.71	1538
main + #56927	97048	16.02	321.56	1561

30 NVMe OSD 4K Random Write (3X)

Ceph Version	IOPS	Avg Latency (ms)	99.9% Latency (ms)	ceph-osd CPU Usage % (mako06)
v18.2.2	476575	13.46	385.54	1051
main	465904	13.76	322.02	1068
main + #56927	469424	13.88	410.33	1098

pereman2 · 2024-04-22T09:56:55Z

Here are the results:

Oh cool! I wonder the difference between our nvmes used are.

If you are up for it, I will update the code removing some obvious inefficiencies that might have an effect on CPU, and you may run it again to see if it did fix something. Nevertheless I will attach my results again after that fix

pereman2 · 2024-04-22T11:52:56Z

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster?
My benchmark basically does:

iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"

pereman2 · 2024-04-22T13:48:32Z

@markhpc I got new results! Looks like randwrites are not getting any better somehow but randrw do get better:

markhpc · 2024-04-22T14:00:47Z

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:
iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"

Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.

mheler · 2024-04-23T14:28:29Z

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:
iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"
Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.

Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes.

pereman2 · 2024-04-23T14:30:55Z

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:
iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"
Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.
Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes.

Yes!

markhpc · 2024-05-01T16:23:45Z

@markhpc I'm curious. How many times did you run the 4k randwrite benchmark? Did you pre fill the cluster to simulated some real data or was it ran in a real cluster? My benchmark basically does:
iterations = 8
bssplit = "--bssplit=4k/16:8k/10:12k/9:16k/8:20k/7:24k/7:28k/6:32k/6:36k/5:40k/5:44k/4:48k/4:52k/4:56k/3:60k/3:64k/3"
for run in range(iterations):
  for t in ['randrw', 'randwrite', 'randread', 'rw']:
     # do fio test with "t"
Yep, these are prefilled rbd volumes. Only ran the set of tests for one iteration this time. 4k and 4m randreads and randwrites for 5 minutes. It's pretty easy to run repeated tests though if we want.
Was the NVMe drive pre-filled? That's actually going to matter more than the rbd volumes.

Naw, this was a quick test. I was curious what kind of syscall overhead reduction we might see here versus other previous tests where I can see some overhead for 4k random writes. We could certainly do tests at larger fill values though.

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

the real offset of the data of a wal file is offseted by extra flush evelope data. Currently 2 uint64_t are added to each flush therefore we keep track of number of flushes in a BlueStore::File with wal_flush_count and offset should be transalated simply by doing a simple: `offset += 2 * sizeof(uint64_t) * wal_flush_count` Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

aclamk · 2024-05-10T15:40:19Z

src/os/bluestore/BlueFS.cc

+  size_t len,            ///< [in] this many bytes
+  bufferlist *outbl,     ///< [out] optional: reference the result here
+  char *out)             ///< [out] optional: or copy it here
+{


I think we should consider making a simplified logic for _wal_read, since it is only accessed in sequential mode:
2024-05-10T12:11:30.922+0000 7fe5bea18b40 10 bluefs open_for_read db.wal/000271.log (sequential)

The simplification would be that regular _read would be used.
We would keep:

wal offset (as rocksdb sees)

bluefs offset of wal file

data read from wal file

location of next envelope within data already read
And on call to _wal_read() we will prefetch from bluefs file to data buffer if needed,
crop beginning of buffer to output *outbl, and update above pointers.

I think we should consider making a simplified logic for _wal_read, since it is only accessed in sequential mode: 2024-05-10T12:11:30.922+0000 7fe5bea18b40 10 bluefs open_for_read db.wal/000271.log (sequential)

The simplification would be that regular _read would be used. We would keep:

wal offset (as rocksdb sees)

bluefs offset of wal file

data read from wal file

location of next envelope within data already read
And on call to _wal_read() we will prefetch from bluefs file to data buffer if needed,
crop beginning of buffer to output *outbl, and update above pointers.

I fail to understand the simplification.

wal offset is there, named wal_data_logical_offset

bluefs offset is flush_offset. I agree this could be another name

data read from wal file is kept under FileReader, since it's sequential _read already does the heavy lifting of prefetching

We use bufferlists to gather the output so the crop we do is barely expensive, I know this is some extra work that could be optimized away.

_read_wal is merely a wrapper of _read

src/os/bluestore/BlueFS.cc

src/os/bluestore/bluefs_types.h

src/os/bluestore/bluefs_types.cc

use 2-2 version for wal_v2 and 1-1 for normal nodes Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

Instead of using 0 as EOF, we signal file is dirty but don't force flush. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

Introducing two new fields: * wal_size -> real wal data size * wal_limit -> upper limit to read and recover flush extents Wal limit is used to read past of fnode.size until hitting wal_limit trying to find flush extents that might be missing. On truncate we decrease the limit to the offset. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

decoding depends on the version as if it's version >= 2 we may decode wal fields Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

src/os/bluestore/BlueFS.cc

src/os/bluestore/bluefs_types.h

New fnode doesn't really work as advertised because there are no assertions when trying to read with a wrong version with api `DENC_START...`. Thankfully (or not), bluefs_transaction does have the good api `DECODE_START`. This way we ensure when downgrading, previous versions don't try to read this new WAL files. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

src/os/bluestore/bluefs_types.cc

src/os/bluestore/BlueFS.h

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

src/os/bluestore/BlueFS.cc

ifed01 · 2024-05-22T13:57:15Z

src/os/bluestore/BlueFS.cc

+        __func__, file->fnode.ino, flush_offset, increment, file->fnode.wal_limit) 
+      << dendl;
+  ceph_assert(file->is_wal_read_loaded);
+  ceph_assert((flush_offset == 0 && file->wal_flushes.empty()) || (flush_offset == file->wal_flushes.back().end_offset()));


I can see just a single location where _wal_update_size is called from for now. And there is an explicitly reset of file->wal_flushes before that call. So the assertion looks redundant.

The same applies to is_wal_read_loaded assertion above and flush_offset parameter.

I moved assert file->is_wal_read_loaded to read_wal as it makes more sense there now.
Second one is redundant too but ensures how this method is supposed to be called?

but why not moving both wal_flush.clear() and is_wal_read_loaded assignment into _wal_update_size() function.
And get rid off flush_offset parameter too.
Do you really expect more usage of this function?

ifed01 · 2024-05-22T14:08:52Z

src/os/bluestore/BlueFS.cc

+      break;
+    }
+    File::WALFlush::WALLength flush_length_le(0);
+    flush_length_le = *((File::WALFlush::WALLength*)bl.c_str());


I would definitely love to see our regular encode/decode mechanics to be applied to flush_length here instead

More generally - wouldn't it be better to use a more general approach and make flush envelop in a way we do for other persistent data structures. I.e. it should include:
header (struct_v + others)
bufferlist content
postfix member(s)

done with exception of first encode of flush_length

ifed01 · 2024-05-22T15:19:02Z

src/os/bluestore/BlueFS.cc

+    }
+    marker_le = *((File::WALFlush::WALMarker*)bl.c_str());
+    uint64_t marker = marker_le;
+    if (marker != file->fnode.ino) {


IMO this is a pretty weak condition to determine if the next envelope is present. E.g. having repetitive OSD deployment could relatively easily hit legacy envelops and erroneously reuse them.
The idea of having a more advanced header comes to my mind once again here. It can also help to get rid of marker part. Here is an overview:
The header to be our regular encode/decode structure.
Apart from flush length it should include osd uuid and file ino and flush seq no.
Inability to decode this header, as well as unexpected uuid/ino/seqno values, result in scan termination.
This will cause some additional write amplification but actually I expect WAL flush envelop to be hundred/thousands of bytes length most of the time so that's not a very big deal.

WALv2 is relaying on data it did not write to extract information.
With any method like this, we need sufficiently low probability of false positive detection.
False positive here means that we:

read data we did not write

decided that the data is legit

I will try to estimate how reliable is "ino + uuid" as a marker check.

UUID
How many uuid sequences are on disk?
Lets assume uuid is fully random, meaning that probability of random bits forming uuid is 0.5^(8*8). With that, we do not expect even one uuid to appear on the device.
But uuid is written once per bluefs_transaction_t. We will have many uuids from this source.
Lets assume bluefs log and rocksdb sst files are the only contenders for disk space.
In the system running long time, all disk has been used. The balance between (conf.bytes_written_wal + conf.bytes_written_sst) and (conf.logged_bytes) determines amount of uuid on device. So it will be many, like many thousands. (numbers needed!)

INO
How many ino values might be on the disk?
As ino are small integer values, we should expect plenty of other, unrelated, small 32 bit values. I think its a lot.

In bluefs_transaction_t right before uuid is len, and right after it is seq. Both len and seq we expect to be small integers.
Ino + uuid are a highly unreliable pair for unique marker.

Proposal
Lets use hash(ino + uuid) as marker.
If our hash function is good, it will be basically random. There will be no risk of contamination from previous deployments, and we will have no negative side effect from ino being small value.
We can tune probability of false possitive simply by taking longer hash function.
For 32bit hash, we expect to have 1 pattern per 4GB of disk.
For 64bit hash, there should be none, except those written specifically to the WALv2.

@ifed01 I've implemented a version a hashed generated marker based on Adam's comments. let me know what you think :)

src/os/bluestore/BlueFS.cc

ifed01 · 2024-05-23T12:39:28Z

src/os/bluestore/BlueFS.cc

+
+    uint64_t increase = flush_length+(sizeof(File::WALFlush::WALLength)+sizeof(File::WALFlush::WALMarker));
+    dout(20) << fmt::format("{} adding flush {:#x}~{:#x}", __func__, flush_offset, flush_length) << dendl;
+    file->wal_flushes.push_back({flush_offset, flush_length});


flush_offset is an offset of the envelop header, flush_length - actual envelop payload length here, right?
IMO that's a bit confusing - you better preserve payload offset here as well. I presume one doesn't need envelop header beyond this point...

flush_offset and flush_length is the minimum amount of information needed to extrapolate all other information. I think I can reuse the get_payload_offset for comment below.

ifed01 · 2024-05-23T12:40:38Z

src/os/bluestore/BlueFS.cc

+
+    }
+
+    flush_offset += sizeof(File::WALFlush::WALLength);


you wouldn't need this increment if wal_flushes would keep actual payload offset.

I'll reuse get_payload_offset, I might've have used += here so that it was easier to understand the flow.

src/os/bluestore/BlueFS.cc

The plan is simple. 1. Find all wal v2 files. 2. Get total size of them 3. Copy data from the v1 copy of that file. 4. If all went well unlink previous and rename new files accordingly. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

src/os/bluestore/BlueFS.cc

ifed01 · 2024-05-23T19:26:06Z

src/os/bluestore/BlueFS.cc

+    // 4. truncate -> fnode.size
+    // 5. unlink
+    ceph_assert(h->file->fnode.size == offset || offset == 0);
+    h->file->fnode.wal_limit = offset;


I have a feeling that wal_limit is an "alias" of f->fnode.get_allocated() so it should not shrink without pextent list being shrinked. At best it looks like we don't need to persist wal_limit at all - a run-time instance should be sufficient: we recalculate it from get_allocated() on space allocation or bluefs log replay.

@pereman2 - what do you think? Do we really need to shrink wal_limit here?

I think you may be right.

well not really. wal_limit signals the range of data that might exist. When we truncate the file we don't remove extents but we do decrease the upper limit represented by wal_limit.

ifed01 · 2024-05-23T19:29:29Z

src/os/bluestore/BlueFS.cc

@@ -3775,7 +4155,7 @@ int BlueFS::fsync(FileWriter *h)/*_WF_WD_WLD_WLNF_WNF*/
      }
    }
  }
-  if (old_dirty_seq) {
+  if (old_dirty_seq && !h->file->is_new_wal()) { // don't force flush on WAL


May be we don't need fsync-ing after regular appends to WAL but what's about the cases when we've [pre]allocated some space for it? This should get into log and be fsynced.

done with last commit, please take a look

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

plain ino as a maker is jsut bad business with high probability of collisions. With this commit we hash uuid and ino together to form a good enough marker capable to withstand false positives to withstand false positives. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

New encoder and decoder functions for a wal flush header. It is intended by design to not have bound checks and version check because we expect wrong data to be decoded when reading after End Of File. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

pereman2 · 2024-05-29T16:55:52Z

@ifed01 With the latest commit 493dc63 I've added an ad-hoc encoder/decoder for the header. This comes at a price of not being able to really check compatibility and struct_len because we expect to read WAL out of bounds on recovery, therefore we will see bad values decoding which for sure, by design, will cause issues.

ifed01 · 2024-05-30T17:50:41Z

@ifed01 With the latest commit 493dc63 I've added an ad-hoc encoder/decoder for the header. This comes at a price of not being able to really check compatibility and struct_len because we expect to read WAL out of bounds on recovery, therefore we will see bad values decoding which for sure, by design, will cause issues.

IMO using try..catch eliminates this concern, doesn't it?

src/os/bluestore/BlueFS.cc

ifed01 · 2024-05-30T17:19:46Z

src/os/bluestore/BlueFS.cc

+        __func__, file->fnode.ino, flush_offset, increment, file->fnode.wal_limit) 
+      << dendl;
+  ceph_assert(file->is_wal_read_loaded);
+  ceph_assert((flush_offset == 0 && file->wal_flushes.empty()) || (flush_offset == file->wal_flushes.back().end_offset()));


but why not moving both wal_flush.clear() and is_wal_read_loaded assignment into _wal_update_size() function.
And get rid off flush_offset parameter too.
Do you really expect more usage of this function?

ifed01 · 2024-05-30T17:21:25Z

src/os/bluestore/BlueFS.cc

+    bl.hexdump(*_dout);
+    *_dout << dendl;
+    auto buffer_iterator = bl.cbegin();
+    decode(header, buffer_iterator);


You can wrap decode() call with try..catch and break reading if decode fails. We've been successfully using the same technique in a couple other locations in BlueFS....

src/os/bluestore/BlueFS.h

src/os/bluestore/BlueFS.cc

ifed01 · 2024-05-30T17:45:50Z

src/os/bluestore/BlueFS.cc

+    // 4. truncate -> fnode.size
+    // 5. unlink
+    ceph_assert(h->file->fnode.size == offset || offset == 0);
+    h->file->fnode.wal_limit = offset;


@pereman2 - what do you think? Do we really need to shrink wal_limit here?

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

github-actions bot added bluestore core tests labels Apr 16, 2024

pereman2 requested review from ifed01 and aclamk April 16, 2024 16:34

pereman2 changed the title ~~os/bluestore: new BlueFS WAL disk format~~ os/bluestore: Blazingly fast new BlueFS WAL disk format 🚀 Apr 17, 2024

ronen-fr reviewed Apr 18, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved

ronen-fr reviewed Apr 18, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved

ronen-fr reviewed Apr 18, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved

ronen-fr reviewed Apr 18, 2024

View reviewed changes

markhpc added the performance label Apr 18, 2024

pereman2 marked this pull request as ready for review April 29, 2024 14:03

pereman2 requested a review from a team as a code owner April 29, 2024 14:03

pereman2 force-pushed the wal-fsync branch from 551ee0f to 72dc822 Compare May 9, 2024 15:57

pereman2 added 6 commits May 9, 2024 17:57

os/bluestore: test wal write path

dd0cbea

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: WIP read_wal + write_wal

7d7b3ee

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: test append and prepend functions

a999d08

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: check marker

8380c0b

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: test multiple wal writes

3b148ae

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

aclamk reviewed May 10, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved

aclamk reviewed May 10, 2024

View reviewed changes

src/os/bluestore/bluefs_types.h Outdated Show resolved Hide resolved

aclamk reviewed May 10, 2024

View reviewed changes

src/os/bluestore/bluefs_types.cc Outdated Show resolved Hide resolved

pereman2 added 8 commits May 13, 2024 10:30

os/bluestore: conditional bluefs_fnode_t versioning

f1d4617

use 2-2 version for wal_v2 and 1-1 for normal nodes Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: don't print node type with legacy

f94b133

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: _wal_update_size delete FileReader

cbfcdb6

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: async signal dirty file and remove extra 0

6c45e6e

Instead of using 0 as EOF, we signal file is dirty but don't force flush. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/blustore: add commetns for Wal disk format

1574fb3

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: separate encode/decode functions fnode and delta

41d441e

decoding depends on the version as if it's version >= 2 we may decode wal fields Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: really fix versioning (hopefully)

1ca1350

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

ifed01 reviewed May 21, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved

src/os/bluestore/bluefs_types.h Outdated Show resolved Hide resolved

src/os/bluestore/bluefs_types.h Outdated Show resolved Hide resolved

src/os/bluestore/bluefs_types.h Outdated Show resolved Hide resolved

ifed01 reviewed May 21, 2024

View reviewed changes

src/os/bluestore/bluefs_types.cc Outdated Show resolved Hide resolved

src/os/bluestore/BlueFS.h Outdated Show resolved Hide resolved

src/os/bluestore/BlueFS.h Outdated Show resolved Hide resolved

pereman2 added 2 commits May 22, 2024 15:04

os/bluestore: little endian flush members, __unused__ back

a7188e4

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: add some missing uses of bluefs_node_type::*

5774f6a

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

pereman2 force-pushed the wal-fsync branch from 4fa6733 to 5774f6a Compare May 22, 2024 13:04

ifed01 reviewed May 23, 2024

View reviewed changes

src/os/bluestore/BlueFS.cc Show resolved Hide resolved

ifed01 reviewed May 23, 2024

View reviewed changes

pereman2 added 4 commits May 27, 2024 18:03

os/bluestore: multiple cosmetic changes

b2aa66d

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: fix missing force fsync when allocating

93c9a4b

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

ifed01 reviewed May 30, 2024

View reviewed changes

os/bluestore: simplify wal_update_size, marker precalc...

5d561d0

Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>

os/bluestore: Blazingly fast new BlueFS WAL disk format 🚀 #56927

Are you sure you want to change the base?

os/bluestore: Blazingly fast new BlueFS WAL disk format 🚀 #56927

Conversation

pereman2 commented Apr 16, 2024 • edited

problem

New format

EOF tricks

Preliminary results:

Contribution Guidelines

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markhpc commented Apr 22, 2024 • edited

pereman2 commented Apr 22, 2024

pereman2 commented Apr 22, 2024 • edited

pereman2 commented Apr 22, 2024

markhpc commented Apr 22, 2024

mheler commented Apr 23, 2024 • edited

pereman2 commented Apr 23, 2024

markhpc commented May 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifed01 May 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pereman2 commented May 29, 2024

ifed01 commented May 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pereman2 commented Apr 16, 2024 •

edited

markhpc commented Apr 22, 2024 •

edited

pereman2 commented Apr 22, 2024 •

edited

mheler commented Apr 23, 2024 •

edited

ifed01 May 23, 2024 •

edited