New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osdc: self-managed snapshot helper should catch decode exception #21804
Conversation
Can you use more words to describe the issue here? |
@gregsfortytwo The tracker includes the crash and the issue description. Basically, [1] returns success if the pool doesn't exist, but for self-managed snap create it expects the snap id in the response. I tend not to understand which mon command should and should not throw an error given the expectation that you should be able to run the same command multiple times. However, I do say it's good to sanitize your inputs to prevent crashes. [1] https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L11718 |
I think this is a monitor bug too; we need to reply success on already-deleted snapshots, but if you're racing snapshot deletes and deleting pools it's okay to fail. Maybe need to use a special error code to indicate it's a missing pool and not a missing snapshot...not sure? But in any case we need the client to function correctly as well not merely now but when the monitor or network screws up, so... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed-by: Greg Farnum gfarnum@redhat.com
@gregsfortytwo Perhaps I can append a monitor commit to return |
I haven't looked at what's causing the monitor to do this, but that sounds like a plausible solution. |
Fixes: http://tracker.ceph.com/issues/24000 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Fixes: http://tracker.ceph.com/issues/24000 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@gregsfortytwo appended a commit to have the mon return an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed-by: Greg Farnum gfarnum@redhat.com
This change to the preprocess_pool_op() behavior mimics what already exists in prepare_pool_op, so it looks good to me!
Should this get a quick test in the librados api tests somewhere? |
@gregsfortytwo Perhaps, but do you have a suggestion for a test case? I don't think attempting to recreate a race condition by creating a pool, self-managed snap creating, and deleting a pool would be of much help and librados doesn't allow you to send direct pool ops. I've been running this PR under a different branch and through the original test case that hit the bug and I haven't hit it again after 100+ runs. |
I was thinking more directly that the monitor returns the correct error code on pool doesn't exist for both a delete and a snap delete op; not trying to catch the race and validate the client behaved correctly. |
I was thinking that would be a race condition test since I would have to create an IoCtx against an existing pool, delete the pool and send the snap create before the map update is received (otherwise the op would have been aborted before it's sent since the pool does not exist). |
Ah right. I can’t think of any good tricks here after all and I wouldn’t block merging it on anything except a suite run. |
@dillaman @gregsfortytwo shall we backport this change to mimic? my guess is "yes", but needs your confirmation. please remove the "mimic" in backport field in http://tracker.ceph.com/issues/24000 if i am wrong. |
@tchaikov Yeah, I'll need it there for rbd-mirror thrasher tests |
Fixes: http://tracker.ceph.com/issues/24000
Signed-off-by: Jason Dillaman dillaman@redhat.com