Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: increase mon_lease to avoid timeouts #17169

Closed
wants to merge 1 commit into from

Conversation

neha-ojha
Copy link
Member

This PR increases the value of mon_lease with the aim of avoiding timeouts during pool creation smoke test.

Fixes: http://tracker.ceph.com/issues/20909
Signed-off-by: Neha Ojha nojha@redhat.com

Signed-off-by: Neha Ojha <nojha@redhat.com>
@neha-ojha neha-ojha requested a review from jdurgin August 22, 2017 23:22
@dzafman
Copy link
Contributor

dzafman commented Aug 22, 2017

@neha-ojha It isn't clear to me why a pool creation would stall the monitor so you need to increase mon_lease.

@xiexingguo
Copy link
Member

The pool-creation will trigger a crush smoke test by default. It will take longer when there isn't very much OSDs and hence comes long with the TIMEOUT error.

@xiexingguo
Copy link
Member

I don't think this is an ideal fix, we shall instead reduce the crush-testing inputs to shorten the whole testing time...

@jdurgin
Copy link
Member

jdurgin commented Aug 23, 2017

I was thinking we could increase the mon_lease just in the qa suite - e.g. qa/cluster/*.yaml . If folks have other ideas on avoiding this, that'd be great.

These are just coming from pool creation with normal test clusters though - I'm not sure how else to avoid a loady host or bad crush luck from causing it to sometimes take extra time.

@jecluis
Copy link
Member

jecluis commented Aug 23, 2017

Yeah, I don't think it's a good idea to blindly increase mon_lease to 30. We rely on this to detect if monitors are dead, and until such a time we have a better mechanism, we shouldn't be increasing this value unless we are certain that it is indeed necessary.

@jdurgin even in the qa suite, I would first make sure increasing this value didn't break anything else. I can't recall if we have tests that rely on monitors being marked down (and, if so, what is their timeouts), but if we do, increasing this value to 30 will delay said monitors being marked down by a long time.

Wouldn't it be better to account for this possibility in the smoke test instead? E.g., if it timed out, rerun the command a few times? Pool creation should be idempotent, if done right, and if the pool already exists the second time around I would presume the crush tool test would not be invoked -- and the command would simply return. Is this, for some reason, not feasible?

@liewegas
Copy link
Member

Maybe a more more modest increase (maybe double it to 10s) just for qa?

It's hard to account for in the test because it's random CLI commands that trigger the check. It would be easier to have a hacky mon option that runs the check twice and only fails if the check fails twice (and enable that for qa only).

@xiexingguo
Copy link
Member

#17179 might help or at least reduce the chance

@liewegas
Copy link
Member

Closing this one. We could (1) increase this just in qa, (2) try to do the smoke test multiple times, or (3) make the smoke test lighter-weight (as the other pr tries to do).

@liewegas liewegas closed this Aug 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants