Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: parallelize "rbd ls -l" #15579

Merged
merged 1 commit into from Aug 8, 2017
Merged

rbd: parallelize "rbd ls -l" #15579

merged 1 commit into from Aug 8, 2017

Conversation

branch-predictor
Copy link
Contributor

When a cluster contains a large number of images, "rbd ls -l" takes a long time to finish. In my particular case, it took about 58s to process 3000 images.
"rbd ls -l" opens each of image and that takes majority of time, so improve this by using aio_open() and aio_close() to do it asynchronously. This reduced total processing time down to around 15 seconds when using default 8 concurrently opened images.
New option "-t" / "--threads" lets user pick a concurrency level that better suits their environment -- as the async I/O is used, there's a certain point at which increasing concurrency doesn't help (or even hurts), the best thread count depends mostly on single-thread processing speed of CPU on which "rbd ls -l" is ran.

Signed-off-by: Piotr Dałek piotr.dalek@corp.ovh.com

@branch-predictor branch-predictor force-pushed the bp-parallel-rbd-lsl branch 4 times, most recently from 93b0ab2 to 59b2766 Compare June 9, 2017 10:52
@branch-predictor
Copy link
Contributor Author

retest this please

@branch-predictor
Copy link
Contributor Author

@dillaman ping?

@dillaman dillaman self-requested a review June 19, 2017 12:07
@dillaman dillaman self-assigned this Jun 19, 2017
Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to take a look at how the "rbd mirror pool" commands handle parallelized image actions (i.e. open the image asynchronously, do something, close the image) [1][2]. You could copy the basic structure from that implementation, ignoring the complications added to generically handle multiple commands with a single state machine.

[1] https://github.com/ceph/ceph/blob/master/src/tools/rbd/action/MirrorPool.cc#L454
[2] https://github.com/ceph/ceph/blob/master/src/tools/rbd/action/MirrorPool.cc#L131

return r < 0 ? r : 0;
}

void get_arguments(po::options_description *positional,
po::options_description *options) {
options->add_options()
("long,l", po::bool_switch(), "long listing format");
options->add_options()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to re-use the existing rbd_concurrent_management_ops instead of introducing a new (equivalent) setting for "rbd ls -l" since it's already used by numerous other CLI actions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it. Thanks!


int init_worker(std::string pool_name, worker_entry &worker, librados::Rados& rados)
{
worker.io_ctx = new librados::IoCtx();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

librados::IoCtx and librbd::RBD can safely be re-used between threads of execution, so no need allocate multiple on the heap.

{
worker.io_ctx = new librados::IoCtx();
worker.rbd = new librbd::RBD();
worker.img = new librbd::Image();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

librbd::Image is uses the pimpl pattern (pointer to implementation), so allocating a librbd::Image on the heap really just allocated space for a pointer to the real implementation.

comp.opened = LIST_IDLE;
continue;
}
r = list_process_image(&rados, comp, lflag, f, tbl);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the images now be out-of-order when displayed since nothing is ensuring that the things remain in-sync with the original image list order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for sure. But I don't think it's a real problem, most users grep it or sort their way anyway, especially machine readable formats.

@branch-predictor branch-predictor force-pushed the bp-parallel-rbd-lsl branch 2 times, most recently from 069de0c to 543ec64 Compare June 26, 2017 07:47
@branch-predictor
Copy link
Contributor Author

@dillaman adressed all your comments, except for the templatized, generator-based state machine as I feel it's somewhat overkill for this purpose.

@dillaman
Copy link

@branch-predictor Yeah -- agree w/ not re-creating a templated generator, I was just showing an example.

Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it looks ok, but it would be great to switch to the use of OrderedThrottle so that (1) it remains consistently ordered between "rbd ls" and "rbd ls -l" (it's the little things) and (2) it would allow you to convert WorkerThread to a C_ImageList state machine that directly invokes the next state method to eliminate the need for a state enum.

@@ -203,15 +326,7 @@ int execute(const po::variables_map &vm) {
return r;
}

librados::Rados rados;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to leave all of this logic for initializing the rados connection instead of creating your own solution

if (threads < 1) {
threads = 1;
}
if (threads > 32) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if someone has a large enough cluster, why limit them to 32? I would just remove this guard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running it with 100 parallel jobs and it ate almost 140MB of memory, which is quite high, yet that much didn't help in any way. Anything above 150 caused it to crash in lockdep on vstart cluster and I expect it to cause a lot of other problems with absurd amounts of parallel jobs (think of "make -j 322" kind of typo), so IMHO it's better to cap it to some reasonable value now, than collect crash reports later.

string name;
};

enum {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use a named enum a la State and prefix values with STATE_ instead of LIST_

librbd::RBD* rbd;
librbd::Image img;
librbd::RBD::AioCompletion* completion;
int opened;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use named enum here instead of generic int and rename to state

namespace action {
namespace list {

namespace at = argument_types;
namespace po = boost::program_options;

int do_list(librbd::RBD &rbd, librados::IoCtx& io_ctx, bool lflag,
Formatter *f) {
struct worker_entry {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: perhaps rename to WorkerEntry

return r < 0 ? r : 0;
}

void init_worker(worker_entry* worker, librbd::RBD* rbd, librados::IoCtx* ioctx)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: just use a constructor in the struct instead of an external helper function to "construct" it.

return r;
}

worker_entry* main = new worker_entry();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this worker entry doesn't serve much value -- just directly use the librbd::RBD object, etc when retrieving the image list.

int do_list(librbd::RBD &rbd, librados::IoCtx& io_ctx, bool lflag,
Formatter *f) {
struct worker_entry {
librados::IoCtx* io_ctx;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: io_ctx and rbd are unnecessary since they are shared

if (!lockers.empty()) {
lockstr = (exclusive) ? "excl" : "shr";
}
for (int left = 1; left < std::min(threads, (int)names.size()); left++) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: once you eliminate the unnecessary list worker, this loop could be zero-based and you could emplace_back-construct the workers instead of heap allocating them.

@branch-predictor
Copy link
Contributor Author

branch-predictor commented Jun 29, 2017

@dillaman Adressed most of your comments. As for the rest, I don't think this is going to be extended in any reasonable manner, anything regarding data collection is done in list_process_image and hence I don't believe it deserves oop-style state machine. That would make perfect sense if each worker would be an actual, physical thread of execution, but that calls for larger changes and more potential for bugs.
If user needs reliable order of output, they can sort it on their own - it'll take less than the performance hit inflicted by keeping output in old order.

Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

auto i = names.begin();
bool wait_needed = false; // true if no aio finished during previous iteration, so we may
// wait for first aio to finish as well instead of reiterating
while (true) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@branch-predictor Sorry for the delay -- I was playing with this today and it cannot run under valgrind since this loop turns into an effective 100% busy-loop which eats up all of valgrind's single CPU. If you used a SimpleThrottle (or equivalent), you could at least block until you have work to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman on 239f3f7#diff-d060c6e7d7d39b5bff9393edac1f2531R237 there's a wait_for_complete which does just that. More precisely, it starts all workers in one iteration, then goes over all of them during second iteration, and if during that it finds out that no worker has finished, on third iteration it waits on wait_for_complete() for first busy worker to complete.

Copy link

@dillaman dillaman Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The STATE_DONE is missing that logic to wait, so if I run w/ 1 concurrent operation, I can see it busy-loop spin while it's waiting for the image to close.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I didn't expect aio_close to be that slow. Added a new commit which changes the logic - now it should be better.

@dillaman
Copy link

@branch-predictor Thanks -- I'll run this through the test suite today

@dillaman
Copy link

When a cluster contains a large number of images, "rbd ls -l" takes a
long time to finish. In my particular case, it took about 58s to
process 3000 images. 
"rbd ls -l" opens each of image and that takes majority of time, so
improve this by using aio_open() and aio_close() to do it
asynchronously. This reduced total processing time down to around 15
seconds when using default 10 concurrently opened images.

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
@branch-predictor
Copy link
Contributor Author

@dillaman try now, ripped out the previous logic and now processing images in order.

@branch-predictor
Copy link
Contributor Author

@dillaman any news?

@dillaman
Copy link

dillaman commented Aug 8, 2017

@branch-predictor lgtm -- just need to run it through the integration tests again (hopefully this afternoon after the build completes).

@dillaman dillaman merged commit 988c300 into ceph:master Aug 8, 2017
@branch-predictor branch-predictor deleted the bp-parallel-rbd-lsl branch January 24, 2018 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants