New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: osd-scrub-repair.sh disable scrub backoff in test #13334

Merged
merged 1 commit into from Feb 10, 2017

Conversation

Projects
None yet
2 participants
@tchaikov
Contributor

tchaikov commented Feb 9, 2017

there is good chance that the newly scheduled scrub fail to pass the
flip-coin check: the probability of scheduling a new scrub 1/3 every
time we flip the coin. so, let's be more patient.

Signed-off-by: Kefu Chai kchai@redhat.com

@tchaikov tchaikov requested review from and dzafman Feb 9, 2017

@dzafman

This comment has been minimized.

Member

dzafman commented Feb 9, 2017

@tchaikov Instead maybe we should add a configuration to the random scrub backoff and then disable it for this test instead of sleeping.

@dzafman

Really I think we should add the configuration variable per my comment.

grep -q "Regular scrub request, losing deep-scrub details" $dir/osd.${primary}.log || return 1
# the probability of flipping tail (not being scheduled) for 10 times in a
# raw is (2/3)^60 = 2.7e-11, which is approximately 0.
local -a delays=($(get_timeout_delays 60 1))

This comment has been minimized.

@dzafman

dzafman Feb 9, 2017

Member

In the spirit of keeping "make check" as fast as possible it seems like starting with 60 second sleep is too much.

This comment has been minimized.

@tchaikov

tchaikov Feb 9, 2017

Contributor

60 is seconds to timeout, 1 is the size of the first step.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Feb 9, 2017

@dzafman

Instead maybe we should add a configuration to the random scrub backoff and then disable it for this test instead of sleeping.

  bool coin_flip = (rand() % 3) == whoami % 3;
  return !coin_flip; // false to schedule the scrub immediately

how about we having something like?

OPTION(osd_scrub_backoff_ratio, OPT_DOUBLE, .33)   // the probability to back off the scheduled scrub
// ....
   bool coin_flip = rand() % (double)RAND_MAX >= _conf->osd_scrub_backoff_ratio;
   return !coin_flip;

and we can this the ratio to 0.0 before the test, so the scrub is always scheduled right after the pg_scrub call? but we need to wait for at least 1 second to make sure that the tick() is called.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Feb 9, 2017

@dzafman fixed and repushed.

changelog

  • see the commit message .

@tchaikov tchaikov changed the title from tests: osd-scrub-repair.sh be more patient waiting for the scrub to tests: osd-scrub-repair.sh disable scrub backoff in test Feb 9, 2017

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Feb 9, 2017

/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/erasure-code/test-erasure-eio.sh: line 73:  9789 Terminated              rados --pool $poolname put $objname $dir/ORIGINAL
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/erasure-code/test-erasure-eio.sh:84: rados_put:  return 1

retest this please.

@@ -6582,7 +6582,8 @@ void OSD::handle_scrub(MOSDScrub *m)
bool OSD::scrub_random_backoff()
{
bool coin_flip = (rand() % 3) == whoami % 3;
bool coin_flip = (rand() / (double)RAND_MAX >=
cct->_conf->osd_scrub_backoff_ratio);

This comment has been minimized.

@dzafman

dzafman Feb 9, 2017

Member

The purpose of using whoami before must have been that if all the OSDs booted at the same time and the values from rand() produced the same sequence on all OSDs (seeding off the clock?), then all OSDs would backoff at the same time. I'm confused about rand() and thread safety. I found that srand() is being called multiple times on OSD start but not sure what that means. I filed http://tracker.ceph.com/issues/18873

This comment has been minimized.

@tchaikov

tchaikov Feb 9, 2017

Contributor

The purpose of using whoami before must have been that if all the OSDs booted at the same time and the values from rand() produced the same sequence on all OSDs (seeding off the clock?), then all OSDs would backoff at the same time.

probably, if this is the expected behavior, then it's broken anyway. as we are using WeightedPriorityQueue in OSD, and it calls srand(time(0)). and i don't think there is good chance that all OSDs boot up at the same time. and what's the merit of backing off at the same time? i doubt it.

@@ -6582,7 +6582,8 @@ void OSD::handle_scrub(MOSDScrub *m)
bool OSD::scrub_random_backoff()
{
bool coin_flip = (rand() % 3) == whoami % 3;
bool coin_flip = (rand() / (double)RAND_MAX >=
cct->_conf->osd_scrub_backoff_ratio);
if (!coin_flip) {

This comment has been minimized.

@dzafman

dzafman Feb 9, 2017

Member

Actually the old code was backing off 2/3 of the time because this says !coin_flip. So maybe change >= above to <.

pg_scrub $pg
sleep 1
set_config osd ${primary} osd_scrub_backoff_ratio $scrub_backoff_ratio

This comment has been minimized.

@dzafman

dzafman Feb 9, 2017

Member

I don't see where you are setting osd_scrub_backoff_ratio to 0.0

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Feb 9, 2017

@dzafman fixed and repushed

tests: osd-scrub-repair.sh disable scrub backoff in test
there is good chance that the newly scheduled scrub fail to pass the
flip-coin check: the probability of scheduling a new scrub 1/3 every
time we flip the coin. so, instead hardwiring the probability to 1/3,
a new option named "osd_scrub_backoff_ratio" is introduced. so we can
override it when performing test.

Signed-off-by: Kefu Chai <kchai@redhat.com>
@ghost

ghost approved these changes Feb 9, 2017

@dzafman

dzafman approved these changes Feb 9, 2017

@tchaikov tchaikov merged commit 7caa2d7 into ceph:master Feb 10, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@tchaikov tchaikov deleted the tchaikov:wip-osd-scrub-repair-more-patient branch Feb 10, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment