Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librbd: fix rbd close race with rewatch #21141

Merged
merged 1 commit into from Apr 17, 2018

Conversation

Projects
None yet
2 participants
@shun-s
Copy link
Contributor

commented Mar 30, 2018

fix rbd close race with rewatch

Signed-off-by: Song Shun song.shun3@zte.com.cn

@shun-s

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2018

@dillaman i'm not sure about promoting m_lock from private to public, but i don't have other solutions on hand.

@dillaman

This comment has been minimized.

Copy link
Contributor

commented Mar 30, 2018

@shun-s Can you provide more background on the issue you are seeing? I would expect that in the image/CloseRequest state machine, when it invokes Watcher::unregister_watch, if that it's in the re-registering state and after the re-watch completes, it proceeds w/ the unregister.

@shun-s

This comment has been minimized.

Copy link
Contributor Author

commented Mar 31, 2018

@dillaman
the backgroud is that we want to stop the librbd context(A) when IO hangs exceeds rados_osd_op_timeout, then switch io to other node(other librbd context(B)).
at this time, A is not added to blacklist yet, so it may rewatch again and again, then close hangs and can't switch io immediately.

done I would expect that in the image/CloseRequest state machine

@shun-s shun-s force-pushed the shun-s:fix-rbd-close-race-with-rewatch branch from dca59ca to b255a00 Mar 31, 2018

@dillaman

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2018

@shun-s I am still confused why you need all these changes. Like I was saying, I would have expected the existing unregister call in the close image state machine to stop the re-watch. Do you have a log of what you are seeing?

@shun-s

This comment has been minimized.

Copy link
Contributor Author

commented Mar 31, 2018

@dillaman
that's what i want finally.
but i think the previous code can't reach this purpose.
when calling watcher::unregister_watch in closing image state, if watch is registering or rewatching,
the unregister_watch will be delayed until register completed[1], which means
unregister_watch must wait until rewatch success or explicit failed(like EBLACKLIST).

void Watcher::unregister_watch(Context *on_finish) {
if (m_watch_state == WATCH_STATE_REGISTERING ||
        m_watch_state == WATCH_STATE_REWATCHING) {
      ldout(m_cct, 10) << "delaying unregister until register completed"
                       << dendl;
    }
}

but rewatching logic is complemented in RewatchRequest[2], it simply rewatch again if last rewatch failed. so unregister_watch in closing image state can't stop rewatch imediately. it must wait RewatchRequest complete.
when setting rados_osd_op_timeout =10 for example and down public netwrok, it may rewatch failed
and then try to rewatch again and again, which finally hangs the close process.

if above is not clear yet or i miss something, please let me know. i'll try again, thanks.
[1]

m_watch_state == WATCH_STATE_REWATCHING) {

[2]

@shun-s shun-s force-pushed the shun-s:fix-rbd-close-race-with-rewatch branch from b255a00 to 8090051 Mar 31, 2018

@shun-s

This comment has been minimized.

Copy link
Contributor Author

commented Apr 3, 2018

@dillaman ping , does what i say make sense to you?

@dillaman

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2018

@shun-s In that case, I think the solution is to have the RewatchRequest state machine bubble the watch error back up to Watcher so that it can decide to re-watch or proceed w/ the shut down.

@shun-s

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2018

retest this please

@shun-s

This comment has been minimized.

Copy link
Contributor Author

commented Apr 8, 2018

@dillaman sorry for late response, i actually does what you guide me. but now i have some trouble with local unittest, like http://tracker.ceph.com/issues/23517 and others in my laptop.
if all these got done, i'll inform you. thanks again.

@shun-s shun-s force-pushed the shun-s:fix-rbd-close-race-with-rewatch branch 2 times, most recently from 0905230 to dd9557d Apr 9, 2018

@shun-s

This comment has been minimized.

Copy link
Contributor Author

commented Apr 9, 2018

@ceph-jenkins retest this please

@shun-s

This comment has been minimized.

Copy link
Contributor Author

commented Apr 9, 2018

the failed unittest seems no relation with this pr
ERROR: InvocationError: '/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/py27/bin/py.test --cov=. --cov-report= --junitxml=junit.py27.xml --doctest-modules controllers/rbd.py tests/

@dillaman ping , please review

@dillaman

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2018

retest this please

@@ -49,7 +49,7 @@ class ImageState {
int unregister_update_watcher(uint64_t handle);
void flush_update_watchers(Context *on_finish);
void shut_down_update_watchers(Context *on_finish);

This comment has been minimized.

Copy link
@dillaman

dillaman Apr 9, 2018

Contributor

Nit: remove added spaces

bool is_unregistered() const {
RWLock::RLocker locker(m_watch_lock);
return m_watch_state == WATCH_STATE_UNREGISTERED;
}


void set_image_closing_state(bool state) {

This comment has been minimized.

Copy link
@dillaman

dillaman Apr 9, 2018

Contributor

Shouldn't be needed

@@ -79,6 +84,7 @@ class Watcher {
watcher::Notifier m_notifier;
WatchState m_watch_state;
AsyncOpTracker m_async_op_tracker;
bool m_image_closing = false;

This comment has been minimized.

Copy link
@dillaman

dillaman Apr 9, 2018

Contributor

Same comment here

This comment has been minimized.

Copy link
@dillaman

dillaman Apr 10, 2018

Contributor

... this is still not needed. You can detect if you have a pending unregister since m_unregister_watch_ctx would be non-null.

This comment has been minimized.

Copy link
@shun-s

shun-s Apr 11, 2018

Author Contributor

very nice recommendation, the variable m_image_closing is removed.
please review.

@@ -193,7 +193,8 @@ void CloseRequest<I>::send_unregister_image_watcher() {

CephContext *cct = m_image_ctx->cct;
ldout(cct, 10) << this << " " << __func__ << dendl;


m_image_ctx->image_watcher->set_image_closing_state(true);

This comment has been minimized.

Copy link
@dillaman

dillaman Apr 9, 2018

Contributor

Same comment here

@shun-s shun-s force-pushed the shun-s:fix-rbd-close-race-with-rewatch branch 2 times, most recently from 648252d to 96371e1 Apr 10, 2018

@shun-s

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2018

@dillaman all empty limes are removed, please review again

@shun-s shun-s force-pushed the shun-s:fix-rbd-close-race-with-rewatch branch from 96371e1 to 8b833a2 Apr 11, 2018

librbd: fix rbd close race with rewatch
  fix rbd close race with rewatch

Signed-off-by: Song Shun <song.shun3@zte.com.cn>
@dillaman
Copy link
Contributor

left a comment

lgtm

@dillaman dillaman merged commit c320e88 into ceph:master Apr 17, 2018

4 of 5 checks passed

make check (arm64) make check failed
Details
Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.