New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore/BlueFS: only flush device which writed by file when do f… #14118
Conversation
…sync(FileWriter). For bluefs, we can set three differnet device for db,db.wal,db.slow.Now every fsync(FileWrite) will flush all existing device. Although, BlockDevice::flush() check whether need to do flush. But there is a chance to flush other device. Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
The KernelDevice already makes flush() a no-op if no io has completed since the last call (see the very top of flush()), so I don't think this is needed. (NVMEDevice coudl do the same thing; not sure if it does now.) |
@liewegas . For this case: wal and db have different device. This happens as long as there is such a concurrency。Is it ok? |
You mean kv_sync_thread doing a rocksdb sync commit vs another thread in rocksdb/bluefs also doing writes? This should be safe (although please review the code to make sure my logic is sound!). The key points are in the io completion thread, _aio_thread() at KernelDevice::274 https://github.com/ceph/ceph/blob/master/src/os/bluestore/KernelDevice.cc#L274, and in flush() at https://github.com/ceph/ceph/blob/master/src/os/bluestore/KernelDevice.cc#L185. If you have two racing threads that call flush(), only one will switch the atomic_bool from true->false but since the winner takes the mutex, the race loser will also block until the flush completes. Does that make sense? Also, one other oddity here that's worth noting: BLueStore and BlueFS have different instances of the BlockDevice class, which means they have different instance of that bool. That means that if bluefs does a write via its instance and then BlueStore calls flush() on its instance, the BlueStore flush() won't block because it thinks now IOs have been done. This shouldn't affect the single-device optimization (we know that submit_transaction_sync will do a write and a flush on the device), but please double-check my reasoning! :) |
@liewegas . For the flush() I didn't doubt it. The different between bluestore and bluefs is bluestore has one device but bluefs can have three device at most. For fsync(FileWrite), why flush all blockdeices? my PR can't omit flush which cause io unsafe. This PR don't make submit_transaction_sync wait db-device to flush. |
flush() already returns immediately if no writes have been done, though, without doing an actual flush. There's no need to track this in bluefs since KernelDevice itself is doing it. See https://github.com/ceph/ceph/blob/master/src/os/bluestore/KernelDevice.cc#L274 and https://github.com/ceph/ceph/blob/master/src/os/bluestore/KernelDevice.cc#L185 |
From the point of bluefs, it don't care who call flush_bdev(thread of submit-transaction or flush thread or compaction thread). Those threads make data safe. But from the thread of submit-transaction, if it flush other device, it will cause latency become longer. Of course, other threads can flush wal-device which make latency of submit-transaction become shorter. My main point is to make submit-transaction more stable which effect by other device especially by slow-device. |
Do you mean a scenario like:
? In that case I don't think we can make submit_transaction not sync the other device because we don't know that some bluefs metadata referencing the other device isn't in the bluefs journal. Not currently, at least. (Maybe we could make bluefs not queue any journal events for new files until they are completely synced to disk.. that would need to be fixed first tho.) |
If we call "submit_transaction_sync" and compaction threads are triggered at the same time, we don't need wait for db/slow device to be flushed into block device. That will lead to unnecessary high latency at some points. Right, we can't guarantee bluefs journal to be flushed into db device. How about to write bluefs journal and data into same device? db device->db data/db journal
wal device->wal data/wal journal
slow device-> slow data/journal |
…sync(FileWriter).
For bluefs, we can set three differnet device for
db,db.wal,db.slow.Now every fsync(FileWrite) will flush all existing device. Although,
BlockDevice::flush() check whether need to do flush. But there is a
chance to flush other device.
Signed-off-by: Jianpeng Ma jianpeng.ma@intel.com