New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon/PGMap: call blocked requests ERR not WARN #15501

Merged
merged 1 commit into from Jun 7, 2017

Conversation

Projects
None yet
4 participants
@liewegas
Member

liewegas commented Jun 5, 2017

Signed-off-by: Sage Weil sage@redhat.com

@liewegas liewegas added core mon labels Jun 5, 2017

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jun 5, 2017

I'm concerned that it's pretty common to get slow requests during peering and recovery processes on overloaded clusters. This is going to significantly turn up the amount of "errors" sys admins see and have to react to. Could we at least only make it an error when they stick around for a threshold higher than the one where we mark them slow?

@liewegas

This comment has been minimized.

Member

liewegas commented Jun 5, 2017

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Jun 5, 2017

Or maybe just make the err_op_age a multiplier on top of the warn age?

@liewegas liewegas requested a review from gregsfortytwo Jun 7, 2017

@tchaikov

aside from the typo, lgtm.

* The ``mon_osd_max_op_age`` option has been renamed to ``mon_osd_warn_op_age``,
to indicate we generate a warning at this age. There is also a new
``mon_osd_err_op_age_multiple`` that is a expressed as a multitple of

This comment has been minimized.

@tchaikov

tchaikov Jun 7, 2017

Contributor

s/mon_osd_err_op_age_multiple/mon_osd_err_op_age_ratio/

@@ -288,7 +288,8 @@ OPTION(mon_osd_down_out_interval, OPT_INT, 600) // seconds
OPTION(mon_osd_down_out_subtree_limit, OPT_STR, "rack") // smallest crush unit/type that we will not automatically mark out
OPTION(mon_osd_min_up_ratio, OPT_DOUBLE, .3) // min osds required to be up to mark things down
OPTION(mon_osd_min_in_ratio, OPT_DOUBLE, .75) // min osds required to be in to mark things out
OPTION(mon_osd_max_op_age, OPT_DOUBLE, 32) // max op age before we get concerned (make it a power of 2)
OPTION(mon_osd_warn_op_age, OPT_DOUBLE, 32) // max op age before we generate a warning (make it a power of 2)
OPTION(mon_osd_err_op_age_ratio, OPT_DOUBLE, 2) // when to generate an error, as multiple of mon_osd_warn_op_age

This comment has been minimized.

@jdurgin

jdurgin Jun 7, 2017

Member

should add to the config ref docs

@gregsfortytwo

As discussed verbally, we want to wait a loooong time before promoting blocked ops to an error since they're very likely to be caused by general slowness or else other cluster state issues.
The commit message now incorrectly refers to 2x, but otherwise:

Reviewed-by: Greg Farnum gfarnum@redhat.com

mon/PGMap: call requests blocked for 128x as long ERR not WARN
- rename the option (max -> warn)
- add an err_..._ratio multiplier
- switch to HEALTH_ERR once requests are blocked long enough
- make the error ratio high (default is 32*128s -> about an hour) so that
we don't trigger on a heavily loaded cluster.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas

This comment has been minimized.

Member

liewegas commented Jun 7, 2017

fixed!

@liewegas liewegas merged commit 2cbcec1 into ceph:master Jun 7, 2017

1 of 3 checks passed

Unmodifed Submodules checking if PR has modified submodules
Details
default Build triggered. sha1 is merged.
Details
Signed-off-by all commits in this PR are signed
Details

@liewegas liewegas deleted the liewegas:wip-blocked-is-err branch Jun 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment