Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osdc: self-managed snapshot helper should catch decode exception #21804

Merged
merged 2 commits into from May 12, 2018

Conversation

dillaman
Copy link

@dillaman dillaman commented May 3, 2018

Fixes: http://tracker.ceph.com/issues/24000
Signed-off-by: Jason Dillaman dillaman@redhat.com

@gregsfortytwo
Copy link
Member

Can you use more words to describe the issue here?
Why isn't the monitor setting some kind of error code? Can we do more careful detection than a try-catch on the client side?

@dillaman
Copy link
Author

dillaman commented May 3, 2018

@gregsfortytwo The tracker includes the crash and the issue description. Basically, [1] returns success if the pool doesn't exist, but for self-managed snap create it expects the snap id in the response. I tend not to understand which mon command should and should not throw an error given the expectation that you should be able to run the same command multiple times. However, I do say it's good to sanitize your inputs to prevent crashes.

[1] https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L11718

@gregsfortytwo
Copy link
Member

I think this is a monitor bug too; we need to reply success on already-deleted snapshots, but if you're racing snapshot deletes and deleting pools it's okay to fail. Maybe need to use a special error code to indicate it's a missing pool and not a missing snapshot...not sure?

But in any case we need the client to function correctly as well not merely now but when the monitor or network screws up, so...

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed-by: Greg Farnum gfarnum@redhat.com

@dillaman
Copy link
Author

dillaman commented May 4, 2018

@gregsfortytwo Perhaps I can append a monitor commit to return -ENOENT in the case it wasn't a pool delete op and then silently succeed if the pool doesn't exist and it was a delete op?

@gregsfortytwo
Copy link
Member

I haven't looked at what's causing the monitor to do this, but that sounds like a plausible solution.

Jason Dillaman added 2 commits May 9, 2018 11:31
@dillaman
Copy link
Author

dillaman commented May 9, 2018

@gregsfortytwo appended a commit to have the mon return an -ENOENT error if the pool doesn't exist and it's not a create/delete op.

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed-by: Greg Farnum gfarnum@redhat.com

This change to the preprocess_pool_op() behavior mimics what already exists in prepare_pool_op, so it looks good to me!

@gregsfortytwo
Copy link
Member

Should this get a quick test in the librados api tests somewhere?

@dillaman
Copy link
Author

dillaman commented May 9, 2018

@gregsfortytwo Perhaps, but do you have a suggestion for a test case? I don't think attempting to recreate a race condition by creating a pool, self-managed snap creating, and deleting a pool would be of much help and librados doesn't allow you to send direct pool ops. I've been running this PR under a different branch and through the original test case that hit the bug and I haven't hit it again after 100+ runs.

@gregsfortytwo
Copy link
Member

I was thinking more directly that the monitor returns the correct error code on pool doesn't exist for both a delete and a snap delete op; not trying to catch the race and validate the client behaved correctly.

@dillaman
Copy link
Author

I was thinking that would be a race condition test since I would have to create an IoCtx against an existing pool, delete the pool and send the snap create before the map update is received (otherwise the op would have been aborted before it's sent since the pool does not exist).

@gregsfortytwo
Copy link
Member

Ah right. I can’t think of any good tricks here after all and I wouldn’t block merging it on anything except a suite run.

@tchaikov tchaikov merged commit ade3bce into ceph:master May 12, 2018
@tchaikov
Copy link
Contributor

tchaikov commented May 12, 2018

@dillaman @gregsfortytwo shall we backport this change to mimic? my guess is "yes", but needs your confirmation. please remove the "mimic" in backport field in http://tracker.ceph.com/issues/24000 if i am wrong.

@dillaman dillaman deleted the wip-24000 branch May 12, 2018 11:51
@dillaman
Copy link
Author

@tchaikov Yeah, I'll need it there for rbd-mirror thrasher tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants