New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph-disk: set the default systemd unit timeout to 3h #15585

Merged
merged 1 commit into from Jul 1, 2017

Conversation

Projects
None yet
5 participants
@ghost

ghost commented Jun 8, 2017

There needs to be a timeout to prevent ceph-disk from hanging
forever. But there is no good reason to set it to a value that is less
than a few hours.

Each OSD activation needs to happen in sequence and not in parallel,
reason why there is a global activation lock.

It would be possible, when an OSD is using a device that is not
otherwise used by another OSD (i.e. they do not share an SSD journal
device etc.), to run all activations in parallel. It would however
require a more extensive modification of ceph-disk to avoid any chances
of races.

Fixes: http://tracker.ceph.com/issues/20229

Signed-off-by: Loic Dachary loic@dachary.org

ceph-disk: set the default systemd unit timeout to 3h
There needs to be a timeout to prevent ceph-disk from hanging
forever. But there is no good reason to set it to a value that is less
than a few hours.

Each OSD activation needs to happen in sequence and not in parallel,
reason why there is a global activation lock.

It would be possible, when an OSD is using a device that is not
otherwise used by another OSD (i.e. they do not share an SSD journal
device etc.), to run all activations in parallel. It would however
require a more extensive modification of ceph-disk to avoid any chances
of races.

Fixes: http://tracker.ceph.com/issues/20229

Signed-off-by: Loic Dachary <loic@dachary.org>

@ghost ghost requested a review from asheplyakov Jun 8, 2017

@ghost ghost added core bug fix labels Jun 8, 2017

@ghost

This comment has been minimized.

ghost commented Jun 9, 2017

rebase & run the ceph-disk suite once #15576 is merged

@ghost ghost requested a review from ktdreyer Jun 9, 2017

@ktdreyer ktdreyer requested a review from alfredodeza Jun 12, 2017

@alfredodeza

This comment has been minimized.

Contributor

alfredodeza commented Jun 12, 2017

@dachary can you explain why this needs to be in sequence (vs. parallel)? I understand you mention the global activation lock, but I am not clear why we are preventing activation in parallel.

I'm also a bit confused by "there is no good reason to set it to a value that is less
than a few hours"

What is "a few hours" ? and why is 3 hours a good number? Do you see a problem if this waits for 1 hour?

@ghost

This comment has been minimized.

ghost commented Jun 12, 2017

If two activation run in parallel and they both use the same SSD disk, they will race against each other. I picked 3 hours because I've seen a 60 disk activation take 15 minutes. I supposed it's not impossible for an activation to take one or two hours. That's where it comes from: gut feeling because we don't actually have any actual numbers to support that.

@alfredodeza

This comment has been minimized.

Contributor

alfredodeza commented Jun 12, 2017

@dachary can you clarify a bit further on what is the problem of using the same SSD disk and racing? Is this related to udev? or Ceph itself?

@ghost

This comment has been minimized.

ghost commented Jun 13, 2017

jenkins test this please

@alfredodeza

This comment has been minimized.

Contributor

alfredodeza commented Jun 13, 2017

@dachary would you expand here on the problem? it isn't clear to me why/what races and what is the side effect. I am happy to go read anything that you think can explain this condition as well. Anything really :)

Thanks!

@alfredodeza

This comment has been minimized.

Contributor

alfredodeza commented Jun 15, 2017

@dachary ping?

@ghost

This comment has been minimized.

ghost commented Jun 15, 2017

@alfredodeza I'm not sure what this has to do with the topic of this pull request ? I'm happy to discuss that in another context if you want, it's an interesting subject ;-)

@alfredodeza

This comment has been minimized.

Contributor

alfredodeza commented Jun 15, 2017

@dachary this pull request is about timing out after 3 hours, and you mention in the description that it cannot be in parallel, and later you said that it is needed to avoid a race condition.

That seems to be on topic for this PR. Helping understand what the context is to timeout after 3 hours, vs. doing it in parallel. If this is documented somewhere I would be happy to look at that and add it to this PR for reference too, it will help others understand why this tool needs to timeout after that 3 hours, and not 300 seconds.

@ghost

This comment has been minimized.

ghost commented Jun 15, 2017

What is off topic is articulating the detailed reasons for these races, let's move that discussion elsewhere. There is no reason to block this PR on the grounds that ceph-disk might be implemented differently and that a global lock is not the way to go, for whatever reason.

@alfredodeza

This comment has been minimized.

Contributor

alfredodeza commented Jun 15, 2017

@dachary I didn't ask for a detailed (?) explanation. This PR is very unclear as to why is it that we need to wait for 3 hours for a timeout.

I have not said that ceph-disk needs to be implemented differently, or that a global lock is the way to go.

All I am asking is something a bit more than "because of races". What kind of "races"? The tracker ticket linked in the subject does not mention any kind of races while this PR does. I am happy to see more documentation anywhere, so I am +1 if you want to pursue that.

So for the sake of clarifying this PR, and for future reference, could you please explain what races? is that something that is caused by UDEV? Or is it Ceph that has issues with multiple OSDs getting activated?

@jdurgin

seems like this is a reasonable workaround

@jdurgin jdurgin added the needs-qa label Jun 28, 2017

@liewegas liewegas merged commit 70a9907 into ceph:master Jul 1, 2017

6 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
Unmodified Submodules submodules for project are unmodified
Details
arm64 make check arm64 make check succeeded
Details
default Build finished.
Details
make check make check succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment