New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

luminous: osd: scrub livelock #24396

Merged
merged 2 commits into from Oct 10, 2018

Conversation

@smithfarm
Contributor

smithfarm commented Oct 3, 2018

@smithfarm smithfarm self-assigned this Oct 3, 2018

@smithfarm smithfarm added this to the luminous milestone Oct 3, 2018

@smithfarm smithfarm requested review from tchaikov, xiexingguo and liewegas Oct 3, 2018

@smithfarm smithfarm changed the title from luminous: scrub livelock to luminous: osd: scrub livelock Oct 3, 2018

@@ -34,6 +34,7 @@
#include "include/types.h"
#include "include/compat.h"
#include "include/random.h"

This comment has been minimized.

@smithfarm

smithfarm Oct 3, 2018

Contributor
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/OSD.cc:37:28: fatal error: include/random.h: No such file or directory
compilation terminated.
src/osd/CMakeFiles/osd.dir/build.make:62: recipe for target 'src/osd/CMakeFiles/osd.dir/OSD.cc.o' failed
make[3]: *** [src/osd/CMakeFiles/osd.dir/OSD.cc.o] Error 1

This comment has been minimized.

@tchaikov

tchaikov Oct 3, 2018

Contributor
#include <random>
@smithfarm

This comment has been minimized.

Contributor

smithfarm commented Oct 3, 2018

Depends on 8ac5472 - how to work around that?

@smithfarm smithfarm added the DNM label Oct 3, 2018

// vary +/- 5% to avoid scrub scheduling livelocks
constexpr auto delta = 0.05;
return (OSD_TICK_INTERVAL *
ceph::util::generate_random_number(1.0 - delta, 1.0 + delta));

This comment has been minimized.

@tchaikov

tchaikov Oct 3, 2018

Contributor
  std::default_random_engine rng;
  return (OSD_TICK_INTERVAL *
          std::uniform_real_distribution{1.0 - delta, 1.0 + delta}(rng));
@@ -34,6 +34,7 @@
#include "include/types.h"
#include "include/compat.h"
#include "include/random.h"

This comment has been minimized.

@tchaikov

tchaikov Oct 3, 2018

Contributor
#include <random>

@smithfarm smithfarm force-pushed the smithfarm:wip-26932-luminous branch 2 times, most recently from 1d630fd to e76d74a Oct 3, 2018

@smithfarm smithfarm removed the DNM label Oct 3, 2018

@smithfarm smithfarm force-pushed the smithfarm:wip-26932-luminous branch from e76d74a to 5128171 Oct 3, 2018

constexpr auto delta = 0.05;
std::default_random_engine rng;
return (OSD_TICK_INTERVAL *
std::uniform_real_distribution<>{1.0 - delta, 1.0 + delta}(rng));

This comment has been minimized.

@smithfarm

smithfarm Oct 3, 2018

Contributor

@chardan Not sure of the syntax here . . . can you take a look?

This comment has been minimized.

@chardan

chardan Oct 3, 2018

Contributor

The syntax looks like it should be ok to me. Is there an error you're seeing? If so, what compiler flags are active (and what version is the compiler)?

Note that the random engine isn't seeded, so it probably will produce the same values each time you run it, which may not be what's desired (or may be fine, depending on what's expected). If not, add a random_device and use the default_random_engine's seed() member function. (In the Ceph/random library, the engine is a thread_local optional, seeding on the first call and then avoiding that hit on subsequent ones, at the expense of redirection via optional<>.)

This comment has been minimized.

@smithfarm

smithfarm Oct 3, 2018

Contributor

The error was because the <> was missing. I wasn't sure if it could be empty. Looks like it's OK now.

This comment has been minimized.

@tchaikov

tchaikov Oct 4, 2018

Contributor

i tested

#include <random>
#include <iostream>

using namespace std;

int main()
{
  constexpr auto delta = 0.05;
  std::default_random_engine rng;
  for (int i = 0; i < 10; i++) {
    cout << std::uniform_real_distribution{1.0 - delta, 1.0 + delta}(rng) << endl;
  }
}

on https://godbolt.org it compiles just fine. could you share with us what exactly the error is?

so it probably will produce the same values each time you run it

yeah. OSD::get_tick_interval() is called multiple times. the tick interval of OSDs would be the same, as all of them share the same seed 1u. so probably we could use whoami as the seed here. like

  std::default_random_engine rng{whoami};
  return (OSD_TICK_INTERVAL *
          std::uniform_real_distribution{1.0 - delta, 1.0 + delta}(rng));

This comment has been minimized.

@smithfarm

smithfarm Oct 5, 2018

Contributor

could you share with us what exactly the error is?

Now that it builds properly I'm reluctant to remove the <> and recompile, but afaicr the error was something like "Missing template argument".

@smithfarm

This comment has been minimized.

Contributor

smithfarm commented Oct 3, 2018

@tchaikov @chardan Thanks for the consultation. Looks like it's good to go.

@yuriw

This comment has been minimized.

Contributor

yuriw commented Oct 3, 2018

@tchaikov

the same seed used by all OSD instances is against the idea of 5128171

@yuriw

This comment has been minimized.

Contributor

yuriw commented Oct 4, 2018

@smithfarm @tchaikov this passed tests, so merge if all is done

@yuriw yuriw removed the wip-yuri-testing label Oct 4, 2018

osd: tick at OSD_TICK_INTERVAL, not heartbeat interval
Heartbeat inveral is not related!

Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 7d76458)

@smithfarm smithfarm force-pushed the smithfarm:wip-26932-luminous branch from 5128171 to 7d5d55d Oct 5, 2018

@smithfarm

This comment has been minimized.

Contributor

smithfarm commented Oct 5, 2018

@tchaikov Added a commit - can you take a look?

@yuriw Once it is approved, this will need another round of testing.

ideally we should squash 7d5d55d into 63ac56a

@liewegas

This comment has been minimized.

Member

liewegas commented Oct 5, 2018

@tchaikov is the seed issue fixed in master?

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented Oct 5, 2018

@liewegas yes, it's fixed in master. because the generate_random_number() helpers use a thread local storage seed initialized by the first call for each specialization of EngineT template argument.

@yuriw

This comment has been minimized.

Contributor

yuriw commented Oct 5, 2018

osd: vary tick interval +/- 5% to avoid scrub livelocks
If you have two pgs that need to scrub on two OSDs, each the primary
for one pg and the replica for the other, you can end up in a livelock:

- both osds locally reserve a scrub slot
- both osds send a scrub schedule request
- both scrub requests are rejected
- both osds wait exactly 1 second
- repeat

Seems a bit unlikely, but I've seen test cases where it goes on more an
hour.

Fixes: http://tracker.ceph.com/issues/26890
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 2011377)

Conflicts:
	src/osd/OSD.cc
- luminous does not have src/include/random.h; use #include <random>
  instead, seeding with whoami so each OSD gets a different series
  of pseudo-random numbers

@smithfarm smithfarm force-pushed the smithfarm:wip-26932-luminous branch from 7d5d55d to 8975880 Oct 5, 2018

@smithfarm

This comment has been minimized.

Contributor

smithfarm commented Oct 5, 2018

@tchaikov Squashed

@yuriw The squash operation should not have changed anything, so your test run can go forward.

@yuriw yuriw merged commit ccd69ef into ceph:luminous Oct 10, 2018

4 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details

@smithfarm smithfarm deleted the smithfarm:wip-26932-luminous branch Oct 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment