osdc: self-managed snapshot helper should catch decode exception #21804

dillaman · 2018-05-03T19:27:20Z

Fixes: http://tracker.ceph.com/issues/24000
Signed-off-by: Jason Dillaman dillaman@redhat.com

gregsfortytwo · 2018-05-03T21:00:14Z

Can you use more words to describe the issue here?
Why isn't the monitor setting some kind of error code? Can we do more careful detection than a try-catch on the client side?

dillaman · 2018-05-03T21:29:13Z

@gregsfortytwo The tracker includes the crash and the issue description. Basically, [1] returns success if the pool doesn't exist, but for self-managed snap create it expects the snap id in the response. I tend not to understand which mon command should and should not throw an error given the expectation that you should be able to run the same command multiple times. However, I do say it's good to sanitize your inputs to prevent crashes.

[1] https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L11718

gregsfortytwo · 2018-05-04T18:20:47Z

I think this is a monitor bug too; we need to reply success on already-deleted snapshots, but if you're racing snapshot deletes and deleting pools it's okay to fail. Maybe need to use a special error code to indicate it's a missing pool and not a missing snapshot...not sure?

But in any case we need the client to function correctly as well not merely now but when the monitor or network screws up, so...

gregsfortytwo

Reviewed-by: Greg Farnum gfarnum@redhat.com

dillaman · 2018-05-04T18:26:33Z

@gregsfortytwo Perhaps I can append a monitor commit to return -ENOENT in the case it wasn't a pool delete op and then silently succeed if the pool doesn't exist and it was a delete op?

gregsfortytwo · 2018-05-04T18:27:50Z

I haven't looked at what's causing the monitor to do this, but that sounds like a plausible solution.

Fixes: http://tracker.ceph.com/issues/24000 Signed-off-by: Jason Dillaman <dillaman@redhat.com>

dillaman · 2018-05-09T15:47:59Z

@gregsfortytwo appended a commit to have the mon return an -ENOENT error if the pool doesn't exist and it's not a create/delete op.

gregsfortytwo

Reviewed-by: Greg Farnum gfarnum@redhat.com

This change to the preprocess_pool_op() behavior mimics what already exists in prepare_pool_op, so it looks good to me!

gregsfortytwo · 2018-05-09T20:05:47Z

Should this get a quick test in the librados api tests somewhere?

dillaman · 2018-05-09T23:14:28Z

@gregsfortytwo Perhaps, but do you have a suggestion for a test case? I don't think attempting to recreate a race condition by creating a pool, self-managed snap creating, and deleting a pool would be of much help and librados doesn't allow you to send direct pool ops. I've been running this PR under a different branch and through the original test case that hit the bug and I haven't hit it again after 100+ runs.

gregsfortytwo · 2018-05-10T00:08:32Z

I was thinking more directly that the monitor returns the correct error code on pool doesn't exist for both a delete and a snap delete op; not trying to catch the race and validate the client behaved correctly.

dillaman · 2018-05-10T01:09:04Z

I was thinking that would be a race condition test since I would have to create an IoCtx against an existing pool, delete the pool and send the snap create before the map update is received (otherwise the op would have been aborted before it's sent since the pool does not exist).

gregsfortytwo · 2018-05-11T19:05:25Z

Ah right. I can’t think of any good tricks here after all and I wouldn’t block merging it on anything except a suite run.

tchaikov · 2018-05-12T11:20:01Z

http://pulpito.ceph.com/kchai-2018-05-12_08:15:46-rados-wip-kefu-testing-2018-05-12-1405-distro-basic-smithi/

tchaikov · 2018-05-12T11:22:39Z

@dillaman @gregsfortytwo shall we backport this change to mimic? my guess is "yes", but needs your confirmation. please remove the "mimic" in backport field in http://tracker.ceph.com/issues/24000 if i am wrong.

dillaman · 2018-05-12T11:51:44Z

@tchaikov Yeah, I'll need it there for rbd-mirror thrasher tests

dillaman added bug-fix core labels May 3, 2018

gregsfortytwo approved these changes May 4, 2018

View reviewed changes

Jason Dillaman added 2 commits May 9, 2018 11:31

osdc: self-managed snapshot helper should catch decode exception

43f7b17

Fixes: http://tracker.ceph.com/issues/24000 Signed-off-by: Jason Dillaman <dillaman@redhat.com>

mon: pool-ops against non-existent pools should return error

43dbdb9

Fixes: http://tracker.ceph.com/issues/24000 Signed-off-by: Jason Dillaman <dillaman@redhat.com>

dillaman force-pushed the wip-24000 branch from de45e26 to 43dbdb9 Compare May 9, 2018 15:46

dillaman mentioned this pull request May 9, 2018

rbd-mirror: optionally support active/active replication #21915

Merged

gregsfortytwo approved these changes May 9, 2018

View reviewed changes

gregsfortytwo added the needs-qa label May 9, 2018

tchaikov added the wip-kefu-testing label May 12, 2018

tchaikov merged commit ade3bce into ceph:master May 12, 2018

dillaman deleted the wip-24000 branch May 12, 2018 11:51

tchaikov mentioned this pull request May 12, 2018

mimic: osdc: self-managed snapshot helper should catch decode exception #21958

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osdc: self-managed snapshot helper should catch decode exception #21804

osdc: self-managed snapshot helper should catch decode exception #21804

dillaman commented May 3, 2018

gregsfortytwo commented May 3, 2018

dillaman commented May 3, 2018

gregsfortytwo commented May 4, 2018

gregsfortytwo left a comment

dillaman commented May 4, 2018 •

edited

gregsfortytwo commented May 4, 2018

dillaman commented May 9, 2018

gregsfortytwo left a comment

gregsfortytwo commented May 9, 2018

dillaman commented May 9, 2018

gregsfortytwo commented May 10, 2018

dillaman commented May 10, 2018

gregsfortytwo commented May 11, 2018

tchaikov commented May 12, 2018

tchaikov commented May 12, 2018 •

edited

dillaman commented May 12, 2018

osdc: self-managed snapshot helper should catch decode exception #21804

osdc: self-managed snapshot helper should catch decode exception #21804

Conversation

dillaman commented May 3, 2018

gregsfortytwo commented May 3, 2018

dillaman commented May 3, 2018

gregsfortytwo commented May 4, 2018

gregsfortytwo left a comment

Choose a reason for hiding this comment

dillaman commented May 4, 2018 • edited

gregsfortytwo commented May 4, 2018

dillaman commented May 9, 2018

gregsfortytwo left a comment

Choose a reason for hiding this comment

gregsfortytwo commented May 9, 2018

dillaman commented May 9, 2018

gregsfortytwo commented May 10, 2018

dillaman commented May 10, 2018

gregsfortytwo commented May 11, 2018

tchaikov commented May 12, 2018

tchaikov commented May 12, 2018 • edited

dillaman commented May 12, 2018

dillaman commented May 4, 2018 •

edited

tchaikov commented May 12, 2018 •

edited