Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: introduce new io_uring IO engine #27392

Open
wants to merge 2 commits into
base: master
from

Conversation

@rouming
Copy link
Contributor

rouming commented Apr 5, 2019

This implements low-level IO engine, which utilizes brand-new
io_uring IO interface: https://lwn.net/Articles/776428/

The following basic config was used for fio_ceph_objectstore:

  rw=randwrite
  iodepth=16
  nr_files=1
  numjobs=1
  size=256m

  bluestore_min_alloc_size = 4096
  bluestore_max_blob_size  = 65536

  bluestore_block_path     = /dev/ram0
  bluestore_block_db_path  = /dev/ram1
  bluestore_block_wal_path = /dev/ram2

bluestore_iouring=false

   4k  IOPS=25.5k, BW=99.8MiB/s, Lat=0.374ms
   8k  IOPS=21.5k, BW=168MiB/s,  Lat=0.441ms
  16k  IOPS=17.2k, BW=268MiB/s,  Lat=0.544ms
  32k  IOPS=12.3k, BW=383MiB/s,  Lat=0.753ms
  64k  IOPS=8358,  BW=522MiB/s,  Lat=1.083ms
 128k  IOPS=4724,  BW=591MiB/s,  Lat=1.906ms

bluestore_iouring=true

   4k  IOPS=29.2k, BW=114MiB/s,  Lat=0.331ms
   8k  IOPS=30.7k, BW=240MiB/s,  Lat=0.319ms
  16k  IOPS=27.4k, BW=428MiB/s,  Lat=0.368ms
  32k  IOPS=22.7k, BW=709MiB/s,  Lat=0.475ms
  64k  IOPS=15.6k, BW=978MiB/s,  Lat=0.754ms
 128k  IOPS=9572,  BW=1197MiB/s, Lat=1.223ms

Overall IOPS increase is the following:

   4k  +14%
   8k  +42%
  16k  +59%
  32k  +89%
  64k  +85%
 128k  +102%

By default libaio is used.  If bluestore_ioring=true is set but kernel
does not support io_uring or architecture is not x86-64, libaio will be
used instead.

Cc: @ifed01

@batrick

This comment has been minimized.

Copy link
Member

batrick commented Apr 5, 2019

needs rebase

@rouming rouming force-pushed the rouming:bluestore-iouring branch from 323016c to d761727 Apr 8, 2019
@tchaikov tchaikov added the performance label Apr 9, 2019
@rouming rouming force-pushed the rouming:bluestore-iouring branch 3 times, most recently from e058747 to 4c89f13 Apr 9, 2019
@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented Apr 11, 2019

Super excited about this! Any chance you can test with the default min_alloc and blob sizes on a typical device?

@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented Apr 11, 2019

Super excited about this! Any chance you can test with the default min_alloc and blob sizes on a typical device?

@markhpc Yes, sure, I will return to this topic. Would be great if you provide desired options, because for me is not always clear how to setup blueshark in order to squeeze everything from the performance.

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Apr 11, 2019

From @JeffMoyer: "I noticed that the pull request supports IO_URING_REGISTER_FILES, but not _BUFFERS. If there's a way to add the latter, that should provide another good speedup."

Have you done any tests on NVMe devices (instead of ramdisks)?

@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented Apr 12, 2019

@liewegas

From @JeffMoyer: "I noticed that the pull request supports IO_URING_REGISTER_FILES, but not _BUFFERS. If there's a way to add the latter, that should provide another good speedup."

low-level IO engine does not control incoming buffers, so I could not find a way to register buffers in advance.

Have you done any tests on NVMe devices (instead of ramdisks)?

Not yet, but plan to do that. I decided to publish rfc asap in order to get some early feedback or help with testing.

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented Apr 18, 2019

@rouming I think to start out just default options on NVMe would be fine. Mostly we want to see how the behavior changes. This is a big enough change that it may impact what things we need to tune in the future.

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented May 6, 2019

retest this please

@blacktear23

This comment has been minimized.

Copy link

blacktear23 commented May 7, 2019

@rouming after we do some testing on our test environment, we found 2 problem:

  1. If cluster is in recovery mode, the OSD may crash by iocb.aio_lio_opcode set to IO_CMD_PREAD (0).
  2. If we start rados benchmark, it may cause some OSD cannot response heartbeat to monitor, so the OSD is set to down.
@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented May 7, 2019

@blacktear23

  1. If cluster is in recovery mode, the OSD may crash by iocb.aio_lio_opcode set to IO_CMD_PREAD (0).

You mean sqe->opcode = IORING_OP_READV write access? Do you have an explicit backtrace of a crash?

  1. If we start rados benchmark, it may cause some OSD cannot response heartbeat to monitor, so the OSD is set to down.

Hm, messenger loop is separated from IO loop. Do you experience IO hang? Is it possible to get bactkraces from all threads from OSD which does not respond?

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented May 7, 2019

Did some testing of this on Kernel 5.1. First hit what looked like it may have been a deadlock. Second attempt got farther, but as soon as reads started hit the following assert. I'm debugging now.

2019-05-07 14:28:54.719 7efdd7db5700 -1 /home/ubuntu/src/markhpc/ceph/src/os/bluestore/io_uring.cc: In function 'void init_io(ioring_data*, unsigned int, aio_t*)' thread 7efddddc1700 time 2019-05-07 14:28:54.716314
/home/ubuntu/src/markhpc/ceph/src/os/bluestore/io_uring.cc: 201: FAILED ceph_assert(0)

 ceph version 10.0.4-49321-g1e9907b (1e9907bcfff6a9ba011a299354e061d5dbdcb7a5) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55951d2416dc]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55951d2418aa]
 3: (()+0xb681cc) [0x55951d9531cc]
 4: (KernelDevice::aio_submit(IOContext*)+0x16b) [0x55951d94a67b]
 5: (BlueStore::_do_read(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x14bd) [0x55951d7f99bd]
 6: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x342) [0x55951d7fc7e2]
 7: (ReplicatedBackend::objects_read_sync(hobject_t const&, unsigned long, unsigned long, unsigned int, ceph::buffer::v14_2_0::list*)+0x9d) [0x55951d68624d]
 8: (PrimaryLogPG::do_sparse_read(PrimaryLogPG::OpContext*, OSDOp&)+0x444) [0x55951d4c5624]
 9: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x68c1) [0x55951d4d6801]
 10: (PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0x97) [0x55951d4e4257]
 11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x2fd) [0x55951d4e4a1d]
 12: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x391a) [0x55951d4e998a]
 13: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xc15) [0x55951d4eb475]
 14: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ac) [0x55951d381a0c]
 15: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55951d5c5262]
 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9fb) [0x55951d39e93b]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55951d9af643]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55951d9b26f0]
 19: (()+0x7e25) [0x7efdfd771e25]
 20: (clone()+0x6d) [0x7efdfc85334d]
@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented May 7, 2019

Did some testing of this on Kernel 5.1. First hit what looked like it may have been a deadlock. Second attempt got farther, but as soon as reads started hit the following assert. I'm debugging now.

Seems that io->iocb.aio_lio_opcode should be equal to IO_CMD_PREAD, not vector variant. At least I do not see preadv() variant in ceph_aio.h, only pread(). In my tests I did only random writes, so can be that I missed that.

The other thing which worries me is that you are not allowed to call ioring_getevents() or ioring_queue() from different threads. I could not prove myself that requests are submitted from different threads (actually they are, but locks are taken by the caller, at least my shallow debug showed that). So this question is still open: does the caller of aio (io_uring as well) is responsible for all locks?

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented May 7, 2019

@rouming yep, it's the same as @blacktear23 saw, IO_CMD_PREAD is not currently handled in init_io so we assert. I added a switch statement for the other stuff. I'll see if I can poke at KernelDevice.cc some more and make progress.

@blacktear23

This comment has been minimized.

Copy link

blacktear23 commented May 8, 2019

@rouming for first issue, below is the backtrace with debug compile. So that can give you more information on what happend.

#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00005589efec2d47 in reraise_fatal (signum=6) at /data/ceph/src/global/signal_handler.cc:81
#2  0x00005589efec3e14 in handle_fatal_signal (signum=6) at /data/ceph/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#5  0x00007f982011d801 in __GI_abort () at abort.c:79
#6  0x00005589eff2db43 in ceph::__ceph_assert_fail (assertion=0x5589f0b921ba "0",
    file=0x5589f0b921c0 "/data/ceph/src/os/bluestore/io_uring.cc", line=201,
    func=0x5589f0b92260 <init_io(ioring_data*, unsigned int, aio_t*)::__PRETTY_FUNCTION__> "void init_io(ioring_data*, unsigned int, aio_t*)") at /data/ceph/src/common/assert.cc:73
#7  0x00005589eff2dbf6 in ceph::__ceph_assert_fail (ctx=...) at /data/ceph/src/common/assert.cc:78
#8  0x00005589efea85e7 in init_io (d=0x5589fc15a908, index=12, io=0x5589fc1841d0)
    at /data/ceph/src/os/bluestore/io_uring.cc:201
#9  0x00005589efea87ba in ioring_queue (d=0x5589fc15a908, priv=0x7f9805340a80, beg=
        {iocb = {data = 0x0, key = 0, __pad2 = 0, aio_lio_opcode = 0, aio_reqprio = 0, aio_fildes = 30, u = {c = {buf = 0x5589fe978000, nbytes = 524288, offset = 8611233792, __pad3 = 0, flags = 0, resfd = 0}, v = {vec = 0x5589fe978000, nr = 524288, offset = 8611233792}, poll = {events = -23625728, __pad1 = 21897}, saddr = {addr = 0x5589fe978000, len = 524288}}}, priv = 0x7f9805340a80, fd = 30, iov = {<boost::container::small_vector_base<iovec, boost::container::new_allocator<iovec> >> = {<boost::container::vector<iovec, boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >, void>> = {m_holder = {<boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >> = {<boost::container::new_allocator<iovec>> = {<No data fields>}, <No data fields>}, m_start = 0x5589fc184238, m_size = 0, m_capacity = 4}}, m_storage_start = {aligner = {data = "\001\000\000\000\377\001\001\027\000\000\000\342\000\000", <incomplete sequence \343>}, data = "\001\000\000\000\377\001\001\027\000\000\000\342\000\000", <incomplete sequence \343>}}, <boost::container::small_vector_storage<boost::move_detail::aligned_struct_wrapper<16, 8>, 3>> = {m_rest_of_storage = {{aligner = {data = "\000\000\000\001\000\000\000\001\001\005\000\000\000\005\000"}, data = "\000\000\000\001\000\000\000\001\001\005\000\000\000\005\000"}, {aligner = {data = "\000\377\000\000\000\000\001\021P\000\000\000\000\000\000"}, data = "\000\377\000\000\000\000\001\021P\000\000\000\000\000\000"}, {aligner = {data = "6\025._epoch\004\346\000\000\000\001\020"}, data = "6\025._epoch\004\346\000\000\000\001\020"}}}, static needed_extra_storages = <optimized out>, static needed_bytes = <optimized out>, static header_bytes = <optimized out>, static s_start = <optimized out>, static static_capacity = <optimized out>}, offset = 8611233792, length = 524288, rval = -1000, bl = {_buffers = {_root = {next = 0x5589fe88e2a0}, _tail = 0x5589fe88e2a0, _size = 1}, _carriage = 0x5589f9a835b0 <ceph::buffer::v14_2_0::list::always_empty_bptr>, _len = 524288, _memcopy_count = 0, last_p = {<ceph::buffer::v14_2_0::list::iterator_impl<false>> = {bl = 0x5589fc184290, ls = 0x5589fc184290, p = {cur = 0x5589fc184290}, off = 0, p_off = 0}, <No data fields>}, static always_empty_bptr = {_raw = 0x0, _off = 0, _len = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}, queue_item = {<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>> = {<boost::intrusive::list_node<void*>> = {next_ = 0x0, prev_ = 0x0}, <boost::intrusive::hook_tags_definer<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>, 0>> = {<No data fields>}, <No data fields>}, <No data fields>}}, end=
        {iocb = {data = 0x8, key = 0, __pad2 = 8, aio_lio_opcode = 2817, aio_reqprio = 1332, aio_fildes = 32664, u = {c = {buf = 0xb2a816f7ee63a400, nbytes = 94050922021624, offset = 94050922023750, __pad3 = 94051144048640, flags = 32, resfd = 0}, v = {vec = 0xb2a816f7ee63a400, nr = -271826184, offset = 94050922023750}, poll = {events = -295459840, __pad1 = -1297606921}, saddr = {addr = 0xb2a816f7ee63a400, len = -271826184}}}, priv = 0x7f9805340d00, fd = -271822627, iov = {<boost::container::small_vector_base<iovec, boost::container::new_allocator<iovec> >> = {<boost::container::vector<iovec, boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >, void>> = {m_holder = {<boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >> = {<boost::container::new_allocator<iovec>> = {<No data fields>}, <No data fields>}, m_start = 0x20, m_size = 0, m_capacity = 140290899054240}}, m_storage_start = {aligner = {data = "\000\000@", '\000' <repeats 12 times>}, data = "\000\000@", '\000' <repeats 12 times>}}, <boost::container::small_vector_storage<boost::move_detail::aligned_struct_wrapper<16, 8>, 3>> = {m_rest_of_storage = {{aligner = {data = "@\017\064\005\230\177\000\000xt\273\375\211U\000"}, data = "@\017\064\005\230\177\000\000xt\273\375\211U\000"}, {aligner = {data = "\000 \b\375\211U\000\000\000`!\376\211U\000"}, data = "\000 \b\375\211U\000\000\000`!\376\211U\000"}, {aligner = {data = "@\017\064\005\230\177\000\000\343l\204\370\251K\000"}, data = "@\017\064\005\230\177\000\000\343l\204\370\251K\000"}}}, static needed_extra_storages = <optimized out>, static needed_bytes = <optimized out>, static header_bytes = <optimized out>, static s_start = <optimized out>, static static_capacity = <optimized out>}, offset = 83193390985925, length = 94051162606592, rval = 3204, bl = {_buffers = {_root = {next = 0x5589fcf42880}, _tail = 0x5589fcf42898, _size = 140290899053696}, _carriage = 0x5589ef474c07 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*)+49>, _len = 4263600128, _memcopy_count =---Type <return> to continue, or q <return> to quit---
 21897, last_p = {<ceph::buffer::v14_2_0::list::iterator_impl<false>> = {bl = 0x5589ef4748d2 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider::_Alloc_hider(char*, std::allocator<char>&&)+50>, ls = 0x7f9805340c80, p = {cur = 0x5589fe3cb2f0}, off = 4265390832, p_off = 21897}, <No data fields>}, static always_empty_bptr = {_raw = 0x0, _off = 0, _len = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}, queue_item = {<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>> = {<boost::intrusive::list_node<void*>> = {next_ = 0x7f9805340fa0, prev_ = 0x7f9805340cc0}, <boost::intrusive::hook_tags_definer<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>, 0>> = {<No data fields>}, <No data fields>}, <No data fields>}}) at /data/ceph/src/os/bluestore/io_uring.cc:249
#10 0x00005589efea8bd2 in ioring_queue_t::submit_batch (this=0x5589fc15a900, beg=
        {iocb = {data = 0x0, key = 0, __pad2 = 0, aio_lio_opcode = 0, aio_reqprio = 0, aio_fildes = 30, u = {c = {buf = 0x5589fe978000, nbytes = 524288, offset = 8611233792, __pad3 = 0, flags = 0, resfd = 0}, v = {vec = 0x5589fe978000, nr = 524288, offset = 8611233792}, poll = {events = -23625728, __pad1 = 21897}, saddr = {addr = 0x5589fe978000, len = 524288}}}, priv = 0x7f9805340a80, fd = 30, iov = {<boost::container::small_vector_base<iovec, boost::container::new_allocator<iovec> >> = {<boost::container::vector<iovec, boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >, void>> = {m_holder = {<boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >> = {<boost::container::new_allocator<iovec>> = {<No data fields>}, <No data fields>}, m_start = 0x5589fc184238, m_size = 0, m_capacity = 4}}, m_storage_start = {aligner = {data = "\001\000\000\000\377\001\001\027\000\000\000\342\000\000", <incomplete sequence \343>}, data = "\001\000\000\000\377\001\001\027\000\000\000\342\000\000", <incomplete sequence \343>}}, <boost::container::small_vector_storage<boost::move_detail::aligned_struct_wrapper<16, 8>, 3>> = {m_rest_of_storage = {{aligner = {data = "\000\000\000\001\000\000\000\001\001\005\000\000\000\005\000"}, data = "\000\000\000\001\000\000\000\001\001\005\000\000\000\005\000"}, {aligner = {data = "\000\377\000\000\000\000\001\021P\000\000\000\000\000\000"}, data = "\000\377\000\000\000\000\001\021P\000\000\000\000\000\000"}, {aligner = {data = "6\025._epoch\004\346\000\000\000\001\020"}, data = "6\025._epoch\004\346\000\000\000\001\020"}}}, static needed_extra_storages = <optimized out>, static needed_bytes = <optimized out>, static header_bytes = <optimized out>, static s_start = <optimized out>, static static_capacity = <optimized out>}, offset = 8611233792, length = 524288, rval = -1000, bl = {_buffers = {_root = {next = 0x5589fe88e2a0}, _tail = 0x5589fe88e2a0, _size = 1}, _carriage = 0x5589f9a835b0 <ceph::buffer::v14_2_0::list::always_empty_bptr>, _len = 524288, _memcopy_count = 0, last_p = {<ceph::buffer::v14_2_0::list::iterator_impl<false>> = {bl = 0x5589fc184290, ls = 0x5589fc184290, p = {cur = 0x5589fc184290}, off = 0, p_off = 0}, <No data fields>}, static always_empty_bptr = {_raw = 0x0, _off = 0, _len = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}, queue_item = {<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>> = {<boost::intrusive::list_node<void*>> = {next_ = 0x0, prev_ = 0x0}, <boost::intrusive::hook_tags_definer<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>, 0>> = {<No data fields>}, <No data fields>}, <No data fields>}}, end=
        {iocb = {data = 0x8, key = 0, __pad2 = 8, aio_lio_opcode = 2817, aio_reqprio = 1332, aio_fildes = 32664, u = {c = {buf = 0xb2a816f7ee63a400, nbytes = 94050922021624, offset = 94050922023750, __pad3 = 94051144048640, flags = 32, resfd = 0}, v = {vec = 0xb2a816f7ee63a400, nr = -271826184, offset = 94050922023750}, poll = {events = -295459840, __pad1 = -1297606921}, saddr = {addr = 0xb2a816f7ee63a400, len = -271826184}}}, priv = 0x7f9805340d00, fd = -271822627, iov = {<boost::container::small_vector_base<iovec, boost::container::new_allocator<iovec> >> = {<boost::container::vector<iovec, boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >, void>> = {m_holder = {<boost::container::small_vector_allocator<boost::container::new_allocator<iovec> >> = {<boost::container::new_allocator<iovec>> = {<No data fields>}, <No data fields>}, m_start = 0x20, m_size = 0, m_capacity = 140290899054240}}, m_storage_start = {aligner = {data = "\000\000@", '\000' <repeats 12 times>}, data = "\000\000@", '\000' <repeats 12 times>}}, <boost::container::small_vector_storage<boost::move_detail::aligned_struct_wrapper<16, 8>, 3>> = {m_rest_of_storage = {{aligner = {data = "@\017\064\005\230\177\000\000xt\273\375\211U\000"}, data = "@\017\064\005\230\177\000\000xt\273\375\211U\000"}, {aligner = {data = "\000 \b\375\211U\000\000\000`!\376\211U\000"}, data = "\000 \b\375\211U\000\000\000`!\376\211U\000"}, {aligner = {data = "@\017\064\005\230\177\000\000\343l\204\370\251K\000"}, data = "@\017\064\005\230\177\000\000\343l\204\370\251K\000"}}}, static needed_extra_storages = <optimized out>, static needed_bytes = <optimized out>, static header_bytes = <optimized out>, static s_start = <optimized out>, static static_capacity = <optimized out>}, offset = 83193390985925, length = 94051162606592, rval = 3204, bl = {_buffers = {_root = {next = 0x5589fcf42880}, _tail = 0x5589fcf42898, _size = 140290899053696}, _carriage = 0x5589ef474c07 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*)+49>, _len = 4263600128, _memcopy_count = 21897, last_p = {<ceph::buffer::v14_2_0::list::iterator_impl<false>> = {bl = 0x5589ef4748d2 <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider::_Alloc_hider(char*, std::allocator<char>&&)+50>, ls = 0x7f9805340c80, p = {cur = 0x5589fe3cb2f0}, off = 4265390832, p_off = 21897}, <No data fields>}, static always_---Type <return> to continue, or q <return> to quit---
empty_bptr = {_raw = 0x0, _off = 0, _len = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}, queue_item = {<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>> = {<boost::intrusive::list_node<void*>> = {next_ = 0x7f9805340fa0, prev_ = 0x7f9805340cc0}, <boost::intrusive::hook_tags_definer<boost::intrusive::generic_hook<(boost::intrusive::algo_types)0, boost::intrusive::list_node_traits<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, (boost::intrusive::base_hook_type)0>, 0>> = {<No data fields>}, <No data fields>}, <No data fields>}}, aios_size=8, priv=0x7f9805340a80, retries=0x7f9805340588)
    at /data/ceph/src/os/bluestore/io_uring.cc:340
#11 0x00005589efe9c597 in KernelDevice::aio_submit (this=0x5589fd04c400, ioc=0x7f9805340a80)
    at /data/ceph/src/os/bluestore/KernelDevice.cc:760
#12 0x00005589efcc806e in BlueStore::_do_read (this=0x5589fd082000, c=0x5589fcf42880, o=..., offset=0,
    length=4194304, bl=..., op_flags=32, retry_count=0) at /data/ceph/src/os/bluestore/BlueStore.cc:8696
#13 0x00005589efcc50dd in BlueStore::read (this=0x5589fd082000, c_=..., oid=..., offset=0, length=4194304, bl=...,
    op_flags=32) at /data/ceph/src/os/bluestore/BlueStore.cc:8398
#14 0x00005589efadfe5b in ReplicatedBackend::build_push_op (this=0x5589fdbc2000, recovery_info=..., progress=...,
    out_progress=0x7f9805340f10, out_op=0x5589fec30000, stat=0x0, cache_dont_need=true)
    at /data/ceph/src/osd/ReplicatedBackend.cc:1979
#15 0x00005589efae1bf7 in ReplicatedBackend::handle_pull (this=0x5589fdbc2000, peer=..., op=...,
    reply=0x5589fec30000) at /data/ceph/src/osd/ReplicatedBackend.cc:2143
#16 0x00005589efad4acc in ReplicatedBackend::do_pull (this=0x5589fdbc2000, op=...)
    at /data/ceph/src/osd/ReplicatedBackend.cc:866
#17 0x00005589efacf4ed in ReplicatedBackend::_handle_message (this=0x5589fdbc2000, op=...)
    at /data/ceph/src/osd/ReplicatedBackend.cc:196
#18 0x00005589ef907732 in PGBackend::handle_message (this=0x5589fdbc2000, op=...)
    at /data/ceph/src/osd/PGBackend.cc:114
#19 0x00005589ef79d7c5 in PrimaryLogPG::do_request (this=0x5589fdbb7400, op=..., handle=...)
    at /data/ceph/src/osd/PrimaryLogPG.cc:1848
#20 0x00005589ef4f0e0b in OSD::dequeue_op (this=0x5589fcf7c000, pg=..., op=..., handle=...)
    at /data/ceph/src/osd/OSD.cc:9763
#21 0x00005589efa49724 in PGOpItem::run (this=0x5589fe1be6f0, osd=0x5589fcf7c000, sdata=0x5589fc17c700, pg=...,
    handle=...) at /data/ceph/src/osd/OpQueueItem.cc:24
#22 0x00005589ef51f6ed in OpQueueItem::run (this=0x7f9805342380, osd=0x5589fcf7c000, sdata=0x5589fc17c700, pg=...,
    handle=...) at /data/ceph/src/osd/OpQueueItem.h:134
#23 0x00005589ef4fe45d in OSD::ShardedOpWQ::_process (this=0x5589fcf7d2e8, thread_index=0, hb=0x5589fe1b50c0)
    at /data/ceph/src/osd/OSD.cc:10947
#24 0x00005589eff1d230 in ShardedThreadPool::shardedthreadpool_worker (this=0x5589fcf7caf8, thread_index=0)
    at /data/ceph/src/common/WorkQueue.cc:311
#25 0x00005589eff1ec49 in ShardedThreadPool::WorkThreadSharded::entry (this=0x5589fdf61530)
    at /data/ceph/src/common/WorkQueue.h:704
#26 0x00005589eff09634 in Thread::entry_wrapper (this=0x5589fdf61530) at /data/ceph/src/common/Thread.cc:84
#27 0x00005589eff095b2 in Thread::_entry_func (arg=0x5589fdf61530) at /data/ceph/src/common/Thread.cc:71
#28 0x00007f982145e6db in start_thread (arg=0x7f9805345700) at pthread_create.c:463
#29 0x00007f98201fe88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

For second issue, yes we are experiencing IO hang in testing.

By the way, I also do some code examination and also wondering how IO_CMD_PREAD come from.

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented May 8, 2019

FWIW I'm working on another PR to convert KernelDevice/ceph_aio to use preadv. It's a bit broken now with random reads but works for sequential. Will work on it more tomorrow.

@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented May 8, 2019

FWIW I'm working on another PR to convert KernelDevice/ceph_aio to use preadv. It's a bit broken now with random reads but works for sequential. Will work on it more tomorrow.

As a quick fix for io_uring we can pretend we have iov for read path as well:

diff --git a/src/os/bluestore/ceph_aio.h b/src/os/bluestore/ceph_aio.h
index 24fd72e359be..d8e67668cf6f 100644
--- a/src/os/bluestore/ceph_aio.h
+++ b/src/os/bluestore/ceph_aio.h
@@ -63,8 +63,10 @@ struct aio_t {
     offset = _offset;
     length = len;
     bufferptr p = buffer::create_small_page_aligned(length);
+    bl.append(std::move(p));
 #if defined(HAVE_LIBAIO)
-    io_prep_pread(&iocb, fd, p.c_str(), length, offset);
+    bl.prepare_iov(&iov);
+    io_prep_preadv(&iocb, fd, &iov[0], 1, offset);
 #elif defined(HAVE_POSIXAIO)
     n_aiocb = 1;
     aio.aiocb.aio_fildes = fd;
@@ -72,7 +74,6 @@ struct aio_t {
     aio.aiocb.aio_nbytes = length;
     aio.aiocb.aio_offset = offset;
 #endif
-    bl.append(std::move(p));
   }
 

@markhpc, what do you think?

@rouming rouming force-pushed the rouming:bluestore-iouring branch from 4c89f13 to 43d028d May 8, 2019
@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented May 8, 2019

I updated current pr with minor changes in pread() in order to pass iov for both read and write paths.
Strange enough I can't test read path in fio_ceph_objectstore fio engine: even I do only reads io_uring does not receive them at all. So will take time to run this in real environment.

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented May 8, 2019

I'm starting to wonder a bit if some of the KernelDevice and ceph_aio code could use a bigger refactor. FWIW, I tested the io_uring read path with my preadv cahnge last night and the OSD was hanging again. Will need to dig in more to figure that out.

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented May 8, 2019

FWIW I'm working on another PR to convert KernelDevice/ceph_aio to use preadv. It's a bit broken now with random reads but works for sequential. Will work on it more tomorrow.

As a quick fix for io_uring we can pretend we have iov for read path as well:

diff --git a/src/os/bluestore/ceph_aio.h b/src/os/bluestore/ceph_aio.h
index 24fd72e359be..d8e67668cf6f 100644
--- a/src/os/bluestore/ceph_aio.h
+++ b/src/os/bluestore/ceph_aio.h
@@ -63,8 +63,10 @@ struct aio_t {
     offset = _offset;
     length = len;
     bufferptr p = buffer::create_small_page_aligned(length);
+    bl.append(std::move(p));
 #if defined(HAVE_LIBAIO)
-    io_prep_pread(&iocb, fd, p.c_str(), length, offset);
+    bl.prepare_iov(&iov);
+    io_prep_preadv(&iocb, fd, &iov[0], 1, offset);
 #elif defined(HAVE_POSIXAIO)
     n_aiocb = 1;
     aio.aiocb.aio_fildes = fd;
@@ -72,7 +74,6 @@ struct aio_t {
     aio.aiocb.aio_nbytes = length;
     aio.aiocb.aio_offset = offset;
 #endif
-    bl.append(std::move(p));
   }
 

@markhpc, what do you think?

That's similar to what I did, though I made it more like pwritev by moving some of that logic out into KernelDevice.

@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented May 8, 2019

I'm starting to wonder a bit if some of the KernelDevice and ceph_aio code could use a bigger refactor. FWIW, I tested the io_uring read path with my preadv cahnge last night and the OSD was hanging again. Will need to dig in more to figure that out.

Indeed, this part looks very hairy.

Regarding the hang: is low-level IO engine should be thread-safe? Since we access the ring buffer from userspace and do that as fast as possible I did not do any locking there and rely on the level above. So the question is: do we need to protect submit_batch() and get_next_completed() calls?

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented May 8, 2019

I updated current pr with minor changes in pread() in order to pass iov for both read and write paths.
Strange enough I can't test read path in fio_ceph_objectstore fio engine: even I do only reads io_uring does not receive them at all. So will take time to run this in real environment.

I made a separate PR for my changes. We'll want to test this independently of the io_uring work to make sure we don't introduce any regressions in the aio path.

r = ioring_cqring_reap(d, events, max, paio);
if (r) {
events += r;
continue;

This comment has been minimized.

Copy link
@markhpc

markhpc May 8, 2019

Member

Need to incorporate this fix from the fio code:

		if (actual_min != 0)
			actual_min -= r;

axboe/fio@f7cbbbf#diff-f23fe137e0491aa4a3958612529b0334

This comment has been minimized.

Copy link
@rouming

rouming May 8, 2019

Author Contributor

Ha, what will happen if 'r > actual_min'? Not nice.
Seems should be actual_min -= min(r, actual_min);
Also for our case actual_min is always zero.
But yes, the idea is clear. Thanks for pointing that out.

This comment has been minimized.

Copy link
@markhpc

markhpc May 9, 2019

Member

Yep, I agree. Might want to point that out to the fio guys too.

{
struct io_uring_sqe *sqe = &d->sqes[index];

if (io->iocb.aio_lio_opcode == IO_CMD_PWRITEV)

This comment has been minimized.

Copy link
@markhpc

markhpc May 8, 2019

Member

To debug the pread issue I changed this if/else block into a switch statement to handle the other cases:

  switch (io->iocb.aio_lio_opcode) {
  case IO_CMD_PREAD:
    ceph_abort_msg("Don't know how to handle IO_CMD_PREAD with io_uring");
    break;
  case IO_CMD_PWRITE:
    ceph_abort_msg("Don't know how to handle IO_CMD_PWRITE with io_uring");
    break;
  case IO_CMD_FSYNC:
    sqe->opcode = IORING_OP_FSYNC;
    break;
  case IO_CMD_FDSYNC:
    ceph_abort_msg("Don't know how to handle IO_CMD_FDSYNC yet");
    break;
  case IO_CMD_POLL: /* Never implemented in mainline, see io_prep_poll */
    sqe->opcode = IORING_OP_NOP;
    break;
  case IO_CMD_NOOP:
    sqe->opcode = IORING_OP_NOP;
    break;
  case IO_CMD_PREADV:
    sqe->opcode = IORING_OP_READV;
    break;
  case IO_CMD_PWRITEV:
    sqe->opcode = IORING_OP_WRITEV;
    break;
  default:
    ceph_assert(0);
  }

This comment has been minimized.

Copy link
@rouming

rouming May 8, 2019

Author Contributor

Why don't you want to print and crash in all the cases which are not readv/writev? You found places where something else can be issued?

This comment has been minimized.

Copy link
@markhpc

markhpc May 9, 2019

Member

Mostly I was just trying to clean up the non-readv/writev handling using ceph_abort_msg instead of assert, but it looks like we should be able to handle noop cleanly at least. POLL apparently was never implemented but you are right that we should probably abort there rather than nop. FDSYNC I'm aborting on and fsync passing through. I think if POLL is changed to abort, the only question then is fsync?

This comment has been minimized.

Copy link
@rouming

rouming May 9, 2019

Author Contributor

Since nobody sends other than write/read I would always print & abort. If someone introduces other requests - io_uring engine has to be updated as well. Seems this is more explicit behavior which forces developer to test if something new is introduced.

@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented Oct 14, 2019

@markhpc awesome, thanks.

@mykaul

This comment has been minimized.

Copy link
Contributor

mykaul commented Oct 14, 2019

Since this PR and the intro of io_uring, there were some additional updates to it. Are they relevant?

@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented Oct 14, 2019

@mykaul

Since this PR and the intro of io_uring, there were some additional updates to it. Are they relevant?

Is your question about user-kernel ABI / API changes? These changes depend on latest libiouring API, so should not be an issue.

@tchaikov tchaikov added the needs-qa label Oct 15, 2019
@tchaikov

This comment has been minimized.

Copy link
Contributor

tchaikov commented Oct 15, 2019

@markhpc i tested with 5.4.0-050400rc3 kernel with a 3-OSD vstart cluster. the 4k sequential write of io uring backend is almost the same as aio, while the 4k random read is much better.

backend read bw (MiB/s) read IOPS read lat (us) write bw (MiB/s) write IOPS write lat (s)
io uring 143.204 36660 433.094 0.22226 56 0.277773
aio 13.6573 3496 4572.4 0.226194 57 0.275743

so i guess it's good to merge once it passes the rados qa suite. what do you think? probably we can remove the DNM label now?

@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented Oct 15, 2019

@tchaikov Could you please share with me your testing load? That is rbd fio job?

@tchaikov

This comment has been minimized.

Copy link
Contributor

tchaikov commented Oct 15, 2019

@rouming i was using

MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh -n -x --without-dashboard -o 'bluestore_ioring=true'
../src/script/run-cbt.sh --classical --use-existing --cbt ~/cbt \
   -a /tmp/io-uring ../src/test/crimson/cbt/radosbench_4K_read.yaml

it was not driven by fio. instead, it's using two instances of rados bench issuing 4k write/read with io depth of 16.

@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented Oct 15, 2019

@tchaikov Ok, good to know, thanks.

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented Oct 17, 2019

@markhpc i tested with 5.4.0-050400rc3 kernel with a 3-OSD vstart cluster. the 4k sequential write of io uring backend is almost the same as aio, while the 4k random read is much better.
backend read bw (MiB/s) read IOPS read lat (us) write bw (MiB/s) write IOPS write lat (s)
io uring 143.204 36660 433.094 0.22226 56 0.277773
aio 13.6573 3496 4572.4 0.226194 57 0.275743

so i guess it's good to merge once it passes the rados qa suite. what do you think? probably we can remove the DNM label now?

I suspect there is something odd going on with those aio read numbers. I can't imagine that aio was hurting that badly (4-5K us latency? How much concurrency?) Anyway if it io_uring looks good otherwise I think the only concern I have is our ability to test/support it with it needing a new kernel. I wonder if we should consider the experimental flag?

@JeffMoyer

This comment has been minimized.

Copy link

JeffMoyer commented Oct 17, 2019

@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented Oct 17, 2019

Buffered I/O, perhaps?

Hm, should not be the case, since descriptors from "fd_directs" array (opened with O_DIRECT) are registered in io_uring. But results are indeed strange, at least rbd.fio does not show significant difference between aio and io_uring engines for 4k reads.

@tchaikov

This comment has been minimized.

Copy link
Contributor

tchaikov commented Oct 21, 2019

@markhpc anything is pending to remove the DNM label?

@wjwithagen i think you requested change. but i still don't get what exactly the change was.

@@ -4948,6 +4948,9 @@ std::vector<Option> get_global_options() {
.set_description("Enforces specific hw profile settings")
.set_long_description("'hdd' enforces settings intended for BlueStore above a rotational drive. 'ssd' enforces settings intended for BlueStore above a solid drive. 'default' - using settings for the actual hardware."),

Option("bluestore_ioring", Option::TYPE_BOOL, Option::LEVEL_ADVANCED)

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 21, 2019

Contributor

@markhpc we don't have an experimental flag anymore. but this setting is marked with "Option::LEVEL_ADVANCED", and is disabled by default. i think it's good enough. or do you have a specific idea how to mark it experimental?

@wjwithagen

This comment has been minimized.

Copy link
Contributor

wjwithagen commented Oct 23, 2019

@wjwithagen i think you requested change. but i still don't get what exactly the change was.

I'll going to be practical in this.... I'll take the FreeBSD in my posix-aio fallout as it comes, and work from there. So feel free to commit.

@wjwithagen wjwithagen self-requested a review Oct 23, 2019
@mykaul

This comment has been minimized.

Copy link
Contributor

mykaul commented Oct 28, 2019

@tchaikov - I wonder what the CPU usage is - if there's any difference per IOP or MBps.

@markhpc

This comment has been minimized.

Copy link
Member

markhpc commented Oct 31, 2019

@markhpc anything is pending to remove the DNM label?

@wjwithagen i think you requested change. but i still don't get what exactly the change was.

Nothing on my end. I think we may want to refactor some day, but that day isn't today! 👍

@tchaikov tchaikov removed the DNM label Oct 31, 2019
@tchaikov

This comment has been minimized.

Copy link
Contributor

tchaikov commented Oct 31, 2019

@markhpc thanks. will rebase the change. and rerun the perf test using fio.

@rouming

This comment has been minimized.

Copy link
Contributor Author

rouming commented Nov 13, 2019

@tchaikov Do you plan to return to this pr?

@tchaikov

This comment has been minimized.

Copy link
Contributor

tchaikov commented Nov 13, 2019

@rouming definitely yes. i am still debugging my own vstart cluster. stuck by another issue.

rouming added 2 commits Mar 26, 2019
In the next patch new io_uring API will be used instead of libaio.
So this prepares the abstract interface.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
This implements low-level IO engine, which utilizes brand-new
io_uring IO interface: https://lwn.net/Articles/776428/

By default libaio is used.  If bluestore_ioring=true is set but kernel
does not support io_uring or architecture is not x86-64, libaio will be
used instead.

In current patch liburing library is used in order not to open code
everything.

In order to compile with liburing WITH_LIBURING=ON should be specified.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
@tchaikov tchaikov force-pushed the rouming:bluestore-iouring branch from 48838d2 to c5bcb59 Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.