mon: increase mon_lease to avoid timeouts #17169

neha-ojha · 2017-08-22T23:21:55Z

This PR increases the value of mon_lease with the aim of avoiding timeouts during pool creation smoke test.

Fixes: http://tracker.ceph.com/issues/20909
Signed-off-by: Neha Ojha nojha@redhat.com

Signed-off-by: Neha Ojha <nojha@redhat.com>

dzafman · 2017-08-22T23:38:46Z

@neha-ojha It isn't clear to me why a pool creation would stall the monitor so you need to increase mon_lease.

xiexingguo · 2017-08-23T00:05:23Z

The pool-creation will trigger a crush smoke test by default. It will take longer when there isn't very much OSDs and hence comes long with the TIMEOUT error.

xiexingguo · 2017-08-23T00:12:30Z

I don't think this is an ideal fix, we shall instead reduce the crush-testing inputs to shorten the whole testing time...

jdurgin · 2017-08-23T00:32:56Z

I was thinking we could increase the mon_lease just in the qa suite - e.g. qa/cluster/*.yaml . If folks have other ideas on avoiding this, that'd be great.

These are just coming from pool creation with normal test clusters though - I'm not sure how else to avoid a loady host or bad crush luck from causing it to sometimes take extra time.

jecluis · 2017-08-23T21:58:47Z

Yeah, I don't think it's a good idea to blindly increase mon_lease to 30. We rely on this to detect if monitors are dead, and until such a time we have a better mechanism, we shouldn't be increasing this value unless we are certain that it is indeed necessary.

@jdurgin even in the qa suite, I would first make sure increasing this value didn't break anything else. I can't recall if we have tests that rely on monitors being marked down (and, if so, what is their timeouts), but if we do, increasing this value to 30 will delay said monitors being marked down by a long time.

Wouldn't it be better to account for this possibility in the smoke test instead? E.g., if it timed out, rerun the command a few times? Pool creation should be idempotent, if done right, and if the pool already exists the second time around I would presume the crush tool test would not be invoked -- and the command would simply return. Is this, for some reason, not feasible?

liewegas · 2017-08-23T22:05:08Z

Maybe a more more modest increase (maybe double it to 10s) just for qa?

It's hard to account for in the test because it's random CLI commands that trigger the check. It would be easier to have a hacky mon option that runs the check twice and only fails if the check fails twice (and enable that for qa only).

xiexingguo · 2017-08-23T23:04:50Z

#17179 might help or at least reduce the chance

liewegas · 2017-08-25T18:55:37Z

Closing this one. We could (1) increase this just in qa, (2) try to do the smoke test multiple times, or (3) make the smoke test lighter-weight (as the other pr tries to do).

mon: increase mon_lease to avoid timeouts

99f6b8f

Signed-off-by: Neha Ojha <nojha@redhat.com>

neha-ojha requested a review from jdurgin August 22, 2017 23:22

liewegas closed this Aug 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mon: increase mon_lease to avoid timeouts #17169

mon: increase mon_lease to avoid timeouts #17169

neha-ojha commented Aug 22, 2017

dzafman commented Aug 22, 2017

xiexingguo commented Aug 23, 2017

xiexingguo commented Aug 23, 2017

jdurgin commented Aug 23, 2017

jecluis commented Aug 23, 2017

liewegas commented Aug 23, 2017

xiexingguo commented Aug 23, 2017

liewegas commented Aug 25, 2017

mon: increase mon_lease to avoid timeouts #17169

mon: increase mon_lease to avoid timeouts #17169

Conversation

neha-ojha commented Aug 22, 2017

dzafman commented Aug 22, 2017

xiexingguo commented Aug 23, 2017

xiexingguo commented Aug 23, 2017

jdurgin commented Aug 23, 2017

jecluis commented Aug 23, 2017

liewegas commented Aug 23, 2017

xiexingguo commented Aug 23, 2017

liewegas commented Aug 25, 2017