Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore/BlueFS: only flush device which writed by file when do f… #14118

Closed
wants to merge 1 commit into from

Conversation

majianpeng
Copy link
Member

…sync(FileWriter).

For bluefs, we can set three differnet device for
db,db.wal,db.slow.Now every fsync(FileWrite) will flush all existing device. Although,
BlockDevice::flush() check whether need to do flush. But there is a
chance to flush other device.

Signed-off-by: Jianpeng Ma jianpeng.ma@intel.com

…sync(FileWriter).

For bluefs, we can set three differnet device for
db,db.wal,db.slow.Now every fsync(FileWrite) will flush all existing device. Although,
BlockDevice::flush() check whether need to do flush. But there is a
chance to flush other device.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
@liewegas
Copy link
Member

The KernelDevice already makes flush() a no-op if no io has completed since the last call (see the very top of flush()), so I don't think this is needed. (NVMEDevice coudl do the same thing; not sure if it does now.)

@liewegas liewegas closed this Mar 24, 2017
@majianpeng
Copy link
Member Author

@liewegas . For this case: wal and db have different device.
submit_transaction_sync flush thread or compact
write to wal-device write to db-device
fsync() fsync()
flush(wal-device & db-device) flush(wal-device & db-device)

This happens as long as there is such a concurrency。Is it ok?

@liewegas
Copy link
Member

You mean kv_sync_thread doing a rocksdb sync commit vs another thread in rocksdb/bluefs also doing writes? This should be safe (although please review the code to make sure my logic is sound!). The key points are in the io completion thread, _aio_thread() at KernelDevice::274 https://github.com/ceph/ceph/blob/master/src/os/bluestore/KernelDevice.cc#L274, and in flush() at https://github.com/ceph/ceph/blob/master/src/os/bluestore/KernelDevice.cc#L185. If you have two racing threads that call flush(), only one will switch the atomic_bool from true->false but since the winner takes the mutex, the race loser will also block until the flush completes. Does that make sense?

Also, one other oddity here that's worth noting: BLueStore and BlueFS have different instances of the BlockDevice class, which means they have different instance of that bool. That means that if bluefs does a write via its instance and then BlueStore calls flush() on its instance, the BlueStore flush() won't block because it thinks now IOs have been done. This shouldn't affect the single-device optimization (we know that submit_transaction_sync will do a write and a flush on the device), but please double-check my reasoning! :)

@majianpeng
Copy link
Member Author

@liewegas . For the flush() I didn't doubt it. The different between bluestore and bluefs is bluestore has one device but bluefs can have three device at most. For fsync(FileWrite), why flush all blockdeices? my PR can't omit flush which cause io unsafe. This PR don't make submit_transaction_sync wait db-device to flush.
Is it ok?

@liewegas
Copy link
Member

flush() already returns immediately if no writes have been done, though, without doing an actual flush. There's no need to track this in bluefs since KernelDevice itself is doing it. See https://github.com/ceph/ceph/blob/master/src/os/bluestore/KernelDevice.cc#L274 and https://github.com/ceph/ceph/blob/master/src/os/bluestore/KernelDevice.cc#L185

@majianpeng
Copy link
Member Author

From the point of bluefs, it don't care who call flush_bdev(thread of submit-transaction or flush thread or compaction thread). Those threads make data safe. But from the thread of submit-transaction, if it flush other device, it will cause latency become longer. Of course, other threads can flush wal-device which make latency of submit-transaction become shorter. My main point is to make submit-transaction more stable which effect by other device especially by slow-device.

@liewegas
Copy link
Member

Do you mean a scenario like:

  • rocksdb compaction writes some data (to db or slow device), but hasn't synced/flushed it yet
  • submit_transaction writes to wal, we call into flush_bdev, and we flush both wal device and slow device

? In that case I don't think we can make submit_transaction not sync the other device because we don't know that some bluefs metadata referencing the other device isn't in the bluefs journal. Not currently, at least. (Maybe we could make bluefs not queue any journal events for new files until they are completely synced to disk.. that would need to be fixed first tho.)

@tanghaodong25
Copy link
Contributor

If we call "submit_transaction_sync" and compaction threads are triggered at the same time, we don't need wait for db/slow device to be flushed into block device. That will lead to unnecessary high latency at some points.

Right, we can't guarantee bluefs journal to be flushed into db device. How about to write bluefs journal and data into same device?

db device->db data/db journal
wal device->wal data/wal journal
slow device-> slow data/journal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants