rbd: parallelize "rbd ls -l" #15579

branch-predictor · 2017-06-08T16:07:20Z

When a cluster contains a large number of images, "rbd ls -l" takes a long time to finish. In my particular case, it took about 58s to process 3000 images.
"rbd ls -l" opens each of image and that takes majority of time, so improve this by using aio_open() and aio_close() to do it asynchronously. This reduced total processing time down to around 15 seconds when using default 8 concurrently opened images.
New option "-t" / "--threads" lets user pick a concurrency level that better suits their environment -- as the async I/O is used, there's a certain point at which increasing concurrency doesn't help (or even hurts), the best thread count depends mostly on single-thread processing speed of CPU on which "rbd ls -l" is ran.

Signed-off-by: Piotr Dałek piotr.dalek@corp.ovh.com

branch-predictor · 2017-06-09T12:31:40Z

retest this please

branch-predictor · 2017-06-19T12:06:33Z

@dillaman ping?

dillaman

You might want to take a look at how the "rbd mirror pool" commands handle parallelized image actions (i.e. open the image asynchronously, do something, close the image) [1][2]. You could copy the basic structure from that implementation, ignoring the complications added to generically handle multiple commands with a single state machine.

[1] https://github.com/ceph/ceph/blob/master/src/tools/rbd/action/MirrorPool.cc#L454
[2] https://github.com/ceph/ceph/blob/master/src/tools/rbd/action/MirrorPool.cc#L131

dillaman · 2017-06-19T13:52:28Z

src/tools/rbd/action/List.cc

  return r < 0 ? r : 0;
 }

 void get_arguments(po::options_description *positional,
                   po::options_description *options) {
  options->add_options()
    ("long,l", po::bool_switch(), "long listing format");
+  options->add_options()


I would prefer to re-use the existing rbd_concurrent_management_ops instead of introducing a new (equivalent) setting for "rbd ls -l" since it's already used by numerous other CLI actions.

I'll look into it. Thanks!

dillaman · 2017-06-19T13:55:56Z

src/tools/rbd/action/List.cc

+
+int init_worker(std::string pool_name, worker_entry &worker, librados::Rados& rados)
+{
+  worker.io_ctx = new librados::IoCtx();


librados::IoCtx and librbd::RBD can safely be re-used between threads of execution, so no need allocate multiple on the heap.

dillaman · 2017-06-19T13:57:04Z

src/tools/rbd/action/List.cc

+{
+  worker.io_ctx = new librados::IoCtx();
+  worker.rbd = new librbd::RBD();
+  worker.img = new librbd::Image();


librbd::Image is uses the pimpl pattern (pointer to implementation), so allocating a librbd::Image on the heap really just allocated space for a pointer to the real implementation.

dillaman · 2017-06-19T14:02:09Z

src/tools/rbd/action/List.cc

+	      comp.opened =  LIST_IDLE;
+	      continue;
+	    }
+	    r = list_process_image(&rados, comp, lflag, f, tbl);


Will the images now be out-of-order when displayed since nothing is ensuring that the things remain in-sync with the original image list order?

Yes, for sure. But I don't think it's a real problem, most users grep it or sort their way anyway, especially machine readable formats.

branch-predictor · 2017-06-26T09:31:41Z

@dillaman adressed all your comments, except for the templatized, generator-based state machine as I feel it's somewhat overkill for this purpose.

dillaman · 2017-06-26T11:47:55Z

@branch-predictor Yeah -- agree w/ not re-creating a templated generator, I was just showing an example.

dillaman

In general it looks ok, but it would be great to switch to the use of OrderedThrottle so that (1) it remains consistently ordered between "rbd ls" and "rbd ls -l" (it's the little things) and (2) it would allow you to convert WorkerThread to a C_ImageList state machine that directly invokes the next state method to eliminate the need for a state enum.

dillaman · 2017-06-26T13:51:52Z

src/tools/rbd/action/List.cc

@@ -203,15 +326,7 @@ int execute(const po::variables_map &vm) {
    return r;
  }

-  librados::Rados rados;


You should be able to leave all of this logic for initializing the rados connection instead of creating your own solution

dillaman · 2017-06-26T15:59:52Z

src/tools/rbd/action/List.cc

+  if (threads < 1) {
+    threads = 1;
+  }
+  if (threads > 32) {


Nit: if someone has a large enough cluster, why limit them to 32? I would just remove this guard.

I tried running it with 100 parallel jobs and it ate almost 140MB of memory, which is quite high, yet that much didn't help in any way. Anything above 150 caused it to crash in lockdep on vstart cluster and I expect it to cause a lot of other problems with absurd amounts of parallel jobs (think of "make -j 322" kind of typo), so IMHO it's better to cap it to some reasonable value now, than collect crash reports later.

dillaman · 2017-06-26T16:35:13Z

src/tools/rbd/action/List.cc

+  string name;
+};
+
+enum {


Nit: use a named enum a la State and prefix values with STATE_ instead of LIST_

dillaman · 2017-06-26T16:36:01Z

src/tools/rbd/action/List.cc

+  librbd::RBD* rbd;
+  librbd::Image img;
+  librbd::RBD::AioCompletion* completion;
+  int opened;


Nit: use named enum here instead of generic int and rename to state

dillaman · 2017-06-26T16:36:34Z

src/tools/rbd/action/List.cc

 namespace action {
 namespace list {

 namespace at = argument_types;
 namespace po = boost::program_options;

-int do_list(librbd::RBD &rbd, librados::IoCtx& io_ctx, bool lflag,
-                   Formatter *f) {
+struct worker_entry {


Nit: perhaps rename to WorkerEntry

dillaman · 2017-06-26T16:37:12Z

src/tools/rbd/action/List.cc

+  return r < 0 ? r : 0;
+}
+
+void init_worker(worker_entry* worker, librbd::RBD* rbd, librados::IoCtx* ioctx)


Nit: just use a constructor in the struct instead of an external helper function to "construct" it.

dillaman · 2017-06-26T16:38:48Z

src/tools/rbd/action/List.cc

+    return r;
+  }
+
+  worker_entry* main = new worker_entry();


Nit: this worker entry doesn't serve much value -- just directly use the librbd::RBD object, etc when retrieving the image list.

dillaman · 2017-06-26T16:39:32Z

src/tools/rbd/action/List.cc

-int do_list(librbd::RBD &rbd, librados::IoCtx& io_ctx, bool lflag,
-                   Formatter *f) {
+struct worker_entry {
+  librados::IoCtx* io_ctx;


Nit: io_ctx and rbd are unnecessary since they are shared

dillaman · 2017-06-26T16:49:12Z

src/tools/rbd/action/List.cc

-    if (!lockers.empty()) {
-      lockstr = (exclusive) ? "excl" : "shr";
-    }
+  for (int left = 1; left < std::min(threads, (int)names.size()); left++) {


Nit: once you eliminate the unnecessary list worker, this loop could be zero-based and you could emplace_back-construct the workers instead of heap allocating them.

branch-predictor · 2017-06-29T15:01:18Z

@dillaman Adressed most of your comments. As for the rest, I don't think this is going to be extended in any reasonable manner, anything regarding data collection is done in list_process_image and hence I don't believe it deserves oop-style state machine. That would make perfect sense if each worker would be an actual, physical thread of execution, but that calls for larger changes and more potential for bugs.
If user needs reliable order of output, they can sort it on their own - it'll take less than the performance hit inflicted by keeping output in old order.

dillaman

lgtm

dillaman · 2017-07-18T13:49:07Z

src/tools/rbd/action/List.cc

+  auto i = names.begin();
+  bool wait_needed = false; // true if no aio finished during previous iteration, so we may
+                              // wait for first aio to finish as well instead of reiterating
+  while (true) {


@branch-predictor Sorry for the delay -- I was playing with this today and it cannot run under valgrind since this loop turns into an effective 100% busy-loop which eats up all of valgrind's single CPU. If you used a SimpleThrottle (or equivalent), you could at least block until you have work to do.

@dillaman on 239f3f7#diff-d060c6e7d7d39b5bff9393edac1f2531R237 there's a wait_for_complete which does just that. More precisely, it starts all workers in one iteration, then goes over all of them during second iteration, and if during that it finds out that no worker has finished, on third iteration it waits on wait_for_complete() for first busy worker to complete.

The STATE_DONE is missing that logic to wait, so if I run w/ 1 concurrent operation, I can see it busy-loop spin while it's waiting for the image to close.

TBH, I didn't expect aio_close to be that slow. Added a new commit which changes the logic - now it should be better.

dillaman · 2017-07-19T12:27:04Z

@branch-predictor Thanks -- I'll run this through the test suite today

dillaman · 2017-07-19T15:19:26Z

@branch-predictor Looks like the non-deterministic ordering is causing a test failure:

http://qa-proxy.ceph.com/teuthology/jdillaman-2017-07-19_09:59:40-rbd-wip-jd-testing-distro-basic-smithi/1419193/teuthology.log

When a cluster contains a large number of images, "rbd ls -l" takes a long time to finish. In my particular case, it took about 58s to process 3000 images. "rbd ls -l" opens each of image and that takes majority of time, so improve this by using aio_open() and aio_close() to do it asynchronously. This reduced total processing time down to around 15 seconds when using default 10 concurrently opened images. Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>

branch-predictor · 2017-07-21T14:14:49Z

@dillaman try now, ripped out the previous logic and now processing images in order.

branch-predictor · 2017-08-03T07:50:00Z

@dillaman any news?

dillaman · 2017-08-08T14:25:12Z

@branch-predictor lgtm -- just need to run it through the integration tests again (hopefully this afternoon after the build completes).

branch-predictor force-pushed the bp-parallel-rbd-lsl branch 4 times, most recently from 93b0ab2 to 59b2766 Compare June 9, 2017 10:52

dillaman added rbd feature labels Jun 19, 2017

dillaman self-requested a review June 19, 2017 12:07

dillaman self-assigned this Jun 19, 2017

dillaman reviewed Jun 19, 2017

View reviewed changes

branch-predictor force-pushed the bp-parallel-rbd-lsl branch 2 times, most recently from 069de0c to 543ec64 Compare June 26, 2017 07:47

dillaman reviewed Jun 26, 2017

View reviewed changes

branch-predictor force-pushed the bp-parallel-rbd-lsl branch from 543ec64 to 239f3f7 Compare June 29, 2017 14:23

dillaman approved these changes Jul 6, 2017

View reviewed changes

dillaman added the wip-jason-testing label Jul 6, 2017

dillaman reviewed Jul 18, 2017

View reviewed changes

dillaman approved these changes Jul 19, 2017

View reviewed changes

branch-predictor force-pushed the bp-parallel-rbd-lsl branch from 97b890b to f1d6860 Compare July 21, 2017 10:35

branch-predictor force-pushed the bp-parallel-rbd-lsl branch from f1d6860 to 8f76fc8 Compare July 21, 2017 14:13

dillaman merged commit 988c300 into ceph:master Aug 8, 2017

trociny mentioned this pull request Aug 9, 2017

rbd: switched from legacy to new-style configuration options #16737

Merged

branch-predictor deleted the bp-parallel-rbd-lsl branch January 24, 2018 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rbd: parallelize "rbd ls -l" #15579

rbd: parallelize "rbd ls -l" #15579

branch-predictor commented Jun 8, 2017

branch-predictor commented Jun 9, 2017

branch-predictor commented Jun 19, 2017

dillaman left a comment

dillaman Jun 19, 2017

branch-predictor Jun 19, 2017

dillaman Jun 19, 2017

dillaman Jun 19, 2017

dillaman Jun 19, 2017

branch-predictor Jun 19, 2017

branch-predictor commented Jun 26, 2017

dillaman commented Jun 26, 2017

dillaman left a comment

dillaman Jun 26, 2017

dillaman Jun 26, 2017

branch-predictor Jun 27, 2017

dillaman Jun 26, 2017

dillaman Jun 26, 2017

dillaman Jun 26, 2017

dillaman Jun 26, 2017

dillaman Jun 26, 2017

dillaman Jun 26, 2017

dillaman Jun 26, 2017

branch-predictor commented Jun 29, 2017 •

edited

dillaman left a comment

dillaman Jul 18, 2017

branch-predictor Jul 18, 2017

dillaman Jul 18, 2017 •

edited

branch-predictor Jul 19, 2017

dillaman commented Jul 19, 2017

dillaman commented Jul 19, 2017

branch-predictor commented Jul 21, 2017

branch-predictor commented Aug 3, 2017

dillaman commented Aug 8, 2017

rbd: parallelize "rbd ls -l" #15579

rbd: parallelize "rbd ls -l" #15579

Conversation

branch-predictor commented Jun 8, 2017

branch-predictor commented Jun 9, 2017

branch-predictor commented Jun 19, 2017

dillaman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

branch-predictor commented Jun 26, 2017

dillaman commented Jun 26, 2017

dillaman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

branch-predictor commented Jun 29, 2017 • edited

dillaman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dillaman Jul 18, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dillaman commented Jul 19, 2017

dillaman commented Jul 19, 2017

branch-predictor commented Jul 21, 2017

branch-predictor commented Aug 3, 2017

dillaman commented Aug 8, 2017

branch-predictor commented Jun 29, 2017 •

edited

dillaman Jul 18, 2017 •

edited