ceph-disk: set the default systemd unit timeout to 3h #15585

ghost · 2017-06-08T20:34:41Z

There needs to be a timeout to prevent ceph-disk from hanging
forever. But there is no good reason to set it to a value that is less
than a few hours.

Each OSD activation needs to happen in sequence and not in parallel,
reason why there is a global activation lock.

It would be possible, when an OSD is using a device that is not
otherwise used by another OSD (i.e. they do not share an SSD journal
device etc.), to run all activations in parallel. It would however
require a more extensive modification of ceph-disk to avoid any chances
of races.

Fixes: http://tracker.ceph.com/issues/20229

Signed-off-by: Loic Dachary loic@dachary.org

There needs to be a timeout to prevent ceph-disk from hanging forever. But there is no good reason to set it to a value that is less than a few hours. Each OSD activation needs to happen in sequence and not in parallel, reason why there is a global activation lock. It would be possible, when an OSD is using a device that is not otherwise used by another OSD (i.e. they do not share an SSD journal device etc.), to run all activations in parallel. It would however require a more extensive modification of ceph-disk to avoid any chances of races. Fixes: http://tracker.ceph.com/issues/20229 Signed-off-by: Loic Dachary <loic@dachary.org>

ghost · 2017-06-09T14:55:33Z

rebase & run the ceph-disk suite once #15576 is merged

alfredodeza · 2017-06-12T18:31:02Z

@dachary can you explain why this needs to be in sequence (vs. parallel)? I understand you mention the global activation lock, but I am not clear why we are preventing activation in parallel.

I'm also a bit confused by "there is no good reason to set it to a value that is less
than a few hours"

What is "a few hours" ? and why is 3 hours a good number? Do you see a problem if this waits for 1 hour?

ghost · 2017-06-12T20:41:42Z

If two activation run in parallel and they both use the same SSD disk, they will race against each other. I picked 3 hours because I've seen a 60 disk activation take 15 minutes. I supposed it's not impossible for an activation to take one or two hours. That's where it comes from: gut feeling because we don't actually have any actual numbers to support that.

alfredodeza · 2017-06-12T20:49:19Z

@dachary can you clarify a bit further on what is the problem of using the same SSD disk and racing? Is this related to udev? or Ceph itself?

ghost · 2017-06-13T16:00:48Z

jenkins test this please

alfredodeza · 2017-06-13T17:55:43Z

@dachary would you expand here on the problem? it isn't clear to me why/what races and what is the side effect. I am happy to go read anything that you think can explain this condition as well. Anything really :)

Thanks!

alfredodeza · 2017-06-15T13:59:17Z

@dachary ping?

ghost · 2017-06-15T14:22:03Z

@alfredodeza I'm not sure what this has to do with the topic of this pull request ? I'm happy to discuss that in another context if you want, it's an interesting subject ;-)

alfredodeza · 2017-06-15T14:26:09Z

@dachary this pull request is about timing out after 3 hours, and you mention in the description that it cannot be in parallel, and later you said that it is needed to avoid a race condition.

That seems to be on topic for this PR. Helping understand what the context is to timeout after 3 hours, vs. doing it in parallel. If this is documented somewhere I would be happy to look at that and add it to this PR for reference too, it will help others understand why this tool needs to timeout after that 3 hours, and not 300 seconds.

ghost · 2017-06-15T14:53:45Z

What is off topic is articulating the detailed reasons for these races, let's move that discussion elsewhere. There is no reason to block this PR on the grounds that ceph-disk might be implemented differently and that a global lock is not the way to go, for whatever reason.

alfredodeza · 2017-06-15T15:11:07Z

@dachary I didn't ask for a detailed (?) explanation. This PR is very unclear as to why is it that we need to wait for 3 hours for a timeout.

I have not said that ceph-disk needs to be implemented differently, or that a global lock is the way to go.

All I am asking is something a bit more than "because of races". What kind of "races"? The tracker ticket linked in the subject does not mention any kind of races while this PR does. I am happy to see more documentation anywhere, so I am +1 if you want to pursue that.

So for the sake of clarifying this PR, and for future reference, could you please explain what races? is that something that is caused by UDEV? Or is it Ceph that has issues with multiple OSDs getting activated?

jdurgin

seems like this is a reasonable workaround

ghost requested a review from asheplyakov June 8, 2017 20:34

ghost added core bug-fix labels Jun 8, 2017

ghost requested a review from ktdreyer June 9, 2017 17:43

ktdreyer requested a review from alfredodeza June 12, 2017 15:24

alfredodeza mentioned this pull request Jun 14, 2017

ceph-disk: support osd new #15432

Merged

jdurgin approved these changes Jun 28, 2017

View reviewed changes

jdurgin added the needs-qa label Jun 28, 2017

tchaikov added the wip-kefu-testing label Jun 30, 2017

liewegas added the wip-sage2-testing label Jun 30, 2017

liewegas merged commit 70a9907 into ceph:master Jul 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ceph-disk: set the default systemd unit timeout to 3h #15585

ceph-disk: set the default systemd unit timeout to 3h #15585

ghost commented Jun 8, 2017

ghost commented Jun 9, 2017

alfredodeza commented Jun 12, 2017

ghost commented Jun 12, 2017

alfredodeza commented Jun 12, 2017

ghost commented Jun 13, 2017

alfredodeza commented Jun 13, 2017

alfredodeza commented Jun 15, 2017

ghost commented Jun 15, 2017

alfredodeza commented Jun 15, 2017

ghost commented Jun 15, 2017

alfredodeza commented Jun 15, 2017

jdurgin left a comment

ceph-disk: set the default systemd unit timeout to 3h #15585

ceph-disk: set the default systemd unit timeout to 3h #15585

Conversation

ghost commented Jun 8, 2017

ghost commented Jun 9, 2017

alfredodeza commented Jun 12, 2017

ghost commented Jun 12, 2017

alfredodeza commented Jun 12, 2017

ghost commented Jun 13, 2017

alfredodeza commented Jun 13, 2017

alfredodeza commented Jun 15, 2017

ghost commented Jun 15, 2017

alfredodeza commented Jun 15, 2017

ghost commented Jun 15, 2017

alfredodeza commented Jun 15, 2017

jdurgin left a comment

Choose a reason for hiding this comment