New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-mirror: track images via global image id #13416

Merged
merged 7 commits into from Feb 16, 2017

Conversation

Projects
None yet
2 participants
@dillaman
Contributor

dillaman commented Feb 14, 2017

No description provided.

@dillaman dillaman requested a review from trociny Feb 14, 2017

@trociny trociny self-assigned this Feb 14, 2017

@trociny

This comment has been minimized.

Contributor

trociny commented Feb 14, 2017

@dillaman Don't you think we can observe the false alarms you fixed in "qa/workunits/rbd" commits when testing our stable branches? I mean that it might make sense to backport these and a separate PR would make it easier?

Also, the commit log message "handle teuthology-specific race rbd-mirror race" looks too racy.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Feb 14, 2017

@trociny

Also, the commit log message "handle teuthology-specific race rbd-mirror race" looks too racy.

lol -- indeed. I'll put both of those fixes into a backport PR.

};
std::set<InitImageInfo> m_init_images;
std::set<ImageId> m_init_images;

This comment has been minimized.

@trociny

trociny Feb 14, 2017

Contributor

ImageIds?

std::swap(unregister_watch_ctx, m_unregister_watch_ctx);
if (r < 0) {
lderr(m_cct) << ": failed to register watch: " << cpp_strerror(r)

This comment has been minimized.

@trociny

trociny Feb 14, 2017

Contributor

While here it would be nice to fix the error message prefix (add this and remove ':').

template <typename I>
void RefreshImagesRequest<I>::handle_mirror_image_list(int r) {
dout(10) << ": r=" << r << dendl;

This comment has been minimized.

@trociny

trociny Feb 14, 2017

Contributor

redundant ": " here and in other messages below.

dillaman added some commits Nov 28, 2016

test: added missing IoCtx copy/assignment methods in librados_test_stub
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
rbd-mirror: utilize global image id as internal unique key
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
rbd-mirror: preliminary support to track multiple remote peer image s…
…ources

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
librbd: improved state tracking within object watcher
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
cls/rbd: async versions for dir_list and mirror_image_list
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
rbd-mirror: async mirror image refresh state machine
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
test: fix unused function warnings due to explicit template instantia…
…tion

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@dillaman

This comment has been minimized.

Contributor

dillaman commented Feb 15, 2017

@trociny updates pushed

@trociny

@dillaman LGTM
Though I observed one test failure [1]. It failed in "create many images" stress test, because the replayers for many images had not been started.

I have failed to reproduce this rerunning the test many times and don't have an idea if the cause is your changes. It looks like an already existent issue.

[1] http://qa-proxy.ceph.com/teuthology/trociny-2017-02-14_22:14:05-rbd-wip-mgolub-testing---basic-smithi/815813/teuthology.log

@dillaman

This comment has been minimized.

Contributor

dillaman commented Feb 15, 2017

@trociny The ImageReplayer class instances for the down images are complaining that "start: already running" every minute so I don't think it's related to any changes in this PR. I'll see if I can locally reproduce to figure out how it got stuck in a pseudo-started state for the primary images. Most likely some odd ImageReplayer shut down (due to the image being primary) race.

@trociny

This comment has been minimized.

Contributor

trociny commented Feb 15, 2017

@dillaman I saw these "already started" messages but for me they looked like were for "normally started" replayers? Here are couple of examples:

2017-02-14 23:00:41.990556 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:00:41.990570 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8019700 [2/9c5abc40-0734-4d0e-a50a-9b471a103884] start: already running
2017-02-14 23:00:41.990575 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8022130 [2/9f912876-1044-47bc-bbb1-9f357a136781] start: already running
...
2017-02-14 23:01:11.997393 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:01:11.997405 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8019700 [2/9c5abc40-0734-4d0e-a50a-9b471a103884] start: already running
2017-02-14 23:01:11.997410 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c8022130 [2/9f912876-1044-47bc-bbb1-9f357a136781] start: already running

and the expected reported status for these images up+stopped on cluster1:

2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:image_2:
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  global_id:   619e3619-260d-48be-9712-3785f0c49e44
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  state:       up+stopped
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  description: remote image is non-primary or local image is primary
2017-02-14T23:17:48.836 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  last_update: 2017-02-14 23:17:42

2017-02-14T23:17:48.842 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:image_6:
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  global_id:   9c5abc40-0734-4d0e-a50a-9b471a103884
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  state:       up+stopped
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  description: remote image is non-primary or local image is primary
2017-02-14T23:17:48.843 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  last_update: 2017-02-14 23:17:42

2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:image_7:
2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  global_id:   9f912876-1044-47bc-bbb1-9f357a136781
2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  state:       up+stopped
2017-02-14T23:17:48.844 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  description: remote image is non-primary or local image is primary
2017-02-14T23:17:48.845 INFO:tasks.workunit.cluster1.client.mirror.smithi104.stderr:  last_update: 2017-02-14 23:17:42

I interpreted "already started" errors as that the daemon was so slow by some reason that previous "try to start, detect it is primary" iteration was still in progress when the new one was started.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Feb 15, 2017

@trociny In that case -- I'd hold off on merging this until I can repeat and ensure this isn't a regression.

@dillaman

This comment has been minimized.

Contributor

dillaman commented Feb 16, 2017

@trociny I've had the stress test running in a constant loop on my machine all day w/o fail until just now where the rbd CLI hung attempting to acquire the image status. It looks like it was a low-level objecter issue since it had a hung request that required restarting the OSD to recover. If the rbd-mirror daemon experienced a similar osdc / osd bug like I hit, that could possibly explain the error messages about "already started" if it was attempting to shut-down but librados was effectively non-responsive.

TL;DR: I think this PR is functional and I need to see if I can track down the osd / osdc issue that caused a permanently hung request.

@trociny

This comment has been minimized.

Contributor

trociny commented Feb 16, 2017

@dillaman Was that hung permanent in your case? Because in that teuthology test failure it looked like rather a slowdown, not a permanent hung:

log$ zcat *mirr*gz |grep '619e3619-260d-48be-9712-3785f0c49e44.*start: already running' |sort
2017-02-14 22:56:11.928297 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:00:41.990556 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:01:11.997393 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:01:42.003361 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:03:12.018426 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:03:42.023971 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:04:12.029526 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:04:42.033766 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running
2017-02-14 23:15:12.340505 7f2465865700 -1 rbd::mirror::ImageReplayer: 0x7f23c80024f0 [2/619e3619-260d-48be-9712-3785f0c49e44] start: already running

Note, the errors were not observed permanently. For me it looked like there were periods of time when the image replayer was too slow in its periodic run (start -> check the image -> detect it is primary -> stop), which lasted for up to 1.5 minutes and during these periods 'already running' errors observed from the next (30 sec interval) attempts. I thought the cause of this slowdown might have been a mess introduced by the injected socket failures.

@trociny trociny merged commit 321dc61 into ceph:master Feb 16, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@dillaman dillaman deleted the dillaman:wip-rbd-mirror-cleanup branch Feb 16, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment