Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon/OSDMonitor: do not send_pg_creates with stale info #17065

Merged
merged 1 commit into from Aug 23, 2017

Conversation

tchaikov
Copy link
Contributor

we reset the "creating_pgs" with the newly accepted paxos proposal, but
the creating_pgs_by_osd_epoch is out-of-sync with the new creating_pgs.
so we are at risk of using stale creating_pgs_by_osd_epoch along with
the new creating_pgs.pgs. to avoid this racing, we need to check the
creating_pgs_epoch before sending pg-creates using
creating_pgs_by_osd_epoch.

Fixes: http://tracker.ceph.com/issues/20785
Signed-off-by: Kefu Chai kchai@redhat.com

<< " not using stale creating_pgs@" << creating_pgs_epoch << dendl;
// the subscribers will be updated when the mapping is completed anyway
return next;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we don't send creates on a timer or anything that I can see. Is it really safe to assume we'll get asked to send them again if we just skip it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregsfortytwo C_UpdateCreatingPGs calls update_creating_pgs() and check_pg_creates_subs() when it's done with the mapping. so i think, once creating_pgs_by_osd_epoch is sync'ed with creating_pgs.pgs, we will send the pg creates to OSDs.

and we always try to start the mapping when the osdmap is committed, see OSDMonitor::on_active(). so it's safe to assume so, i believe.

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gregsfortytwo
Copy link
Member

@tchaikov, should run this through testing before merge?

@tchaikov
Copy link
Contributor Author

@gregsfortytwo yes, should run it thru rados qa suite.

@tchaikov
Copy link
Contributor Author

<error>
  <unique>0x0</unique>
  <tid>14</tid>
  <threadname>ms_dispatch</threadname>
  <kind>UninitCondition</kind>
  <what>Conditional jump or move depends on uninitialised value(s)</what>
  <stack>
    <frame>
      <ip>0x5EB725</ip>
      <obj>/usr/bin/ceph-mon</obj>
      <fn>OSDMonitor::send_pg_creates(int, Connection*, unsigned int) const</fn>
      <dir>/usr/src/debug/ceph-12.1.2-888-gb1de3d4/src/mon</dir>
      <file>OSDMonitor.cc</file>
      <line>3264</line>
    </frame>
    <frame>
      <ip>0x5EC7C4</ip>
      <obj>/usr/bin/ceph-mon</obj>
      <fn>OSDMonitor::check_pg_creates_sub(Subscription*)</fn>
      <dir>/usr/src/debug/ceph-12.1.2-888-gb1de3d4/src/mon</dir>
      <file>OSDMonitor.cc</file>
      <line>3152</line>
    </frame>

we reset the "creating_pgs" with the newly accepted paxos proposal, but
the creating_pgs_by_osd_epoch is out-of-sync with the new creating_pgs.
so we are at risk of using stale creating_pgs_by_osd_epoch along with
the new creating_pgs.pgs. to avoid this racing, we need to check the
creating_pgs_epoch before sending pg-creates using
creating_pgs_by_osd_epoch.

Fixes: http://tracker.ceph.com/issues/20785
Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov
Copy link
Contributor Author

changelog

  • send pg create only if creating_pgs_epoch is set

@tchaikov
Copy link
Contributor Author

retest this please

@tchaikov
Copy link
Contributor Author

@tchaikov
Copy link
Contributor Author

/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/test_pidfile.sh:53: TEST_pidfile:  sleep 5
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/test_pidfile.sh:54: TEST_pidfile:  run_mon td/pidfile a --log-to-stderr -f
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/test_pidfile.sh:54: TEST_pidfile:  grep 'failed to lock pidfile'
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/test_pidfile.sh:54: TEST_pidfile:  return 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants