Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: do not forget pg_stat acks which failed to send #16702

Merged
merged 1 commit into from Aug 10, 2017

Conversation

hjwsm1989
Copy link
Contributor

If osd get network error when sending pg_stats, osd will
resend the pg_stats with tid+1, so the former tid will remain
in outstanding_pg_stats. In osd tick(), if the outstanding_pg_stats's
size > osd_mon_report_max_in_flight(default:2), it will refuse to
send pg_stats, that will block pg states from changing.
Finally will fail qa tests like resolve_stuck_peering.py.

Signed-off-by: huangjun huangjun@xsky.com

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Maybe change the commit title to be a little more descriptive. "osd: do not forget pg_stat acks which failed to send"?

src/osd/OSD.cc Outdated
// this can happen when the pg_stats doesn't send
// successfully.
for (auto t : outstanding_pg_stats) {
if (t < ack_tid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well make this <= and remove the separate erase call preceding it?

@hjwsm1989 hjwsm1989 changed the title osd: fix the pg_stats bug osd: do not forget pg_stat acks which failed to send Jul 31, 2017
@hjwsm1989 hjwsm1989 force-pushed the wip-recover-timeout-expired branch from 4a1084a to fd65092 Compare July 31, 2017 23:54
@hjwsm1989
Copy link
Contributor Author

@gregsfortytwo updated

src/osd/OSD.cc Outdated
@@ -6090,7 +6091,13 @@ void OSD::handle_pg_stats_ack(MPGStatsAck *ack)
}
}

outstanding_pg_stats.erase(ack->get_tid());
// if there are earlyer pg_stats doesn't acked,
Copy link
Contributor

@tchaikov tchaikov Aug 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// if there are pg-stats not yet acked,
// this happens when pg-stats is not sent successfully.

src/osd/OSD.cc Outdated
@@ -6058,8 +6058,9 @@ void OSD::handle_pg_stats_ack(MPGStatsAck *ack)
stats_ack_timeout * cct->_conf->osd_stats_ack_timeout_decay);
dout(20) << __func__ << " timeout now " << stats_ack_timeout << dendl;

if (ack->get_tid() > pg_stat_tid_flushed) {
pg_stat_tid_flushed = ack->get_tid();
uint64_t ack_tid = ack->get_tid();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, might want to mark ack_tid as a const, like

const auto ack_tid = ack->get_tid();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but, IMHO, it would be better just to reference ack_tid by ack->get_tid().

src/osd/OSD.cc Outdated
// this can happen when the pg_stats doesn't send
// successfully.
for (auto t : outstanding_pg_stats) {
if (t <= ack_tid)
Copy link
Contributor

@tchaikov tchaikov Aug 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot erase from an std::set<> when iterating through it. instead we should use something like:

  for (auto tid = outstanding_pg_stats.cbegin();
       tid != outstanding_pg_stats.cend(); ) {
    if (*tid <= ack_tid) {
      tid = outstanding_pg_stats.erase(tid);
    } else {
      break;
    }
  }

also, since the connection does not send pg-stats out-of-order, neither does monitor ack the pg-stats out-of-order, we can assume that all tids after ack_tid in outstanding_pg_stats are greater than it. so we can break once *tid > ack_tid.

@hjwsm1989
Copy link
Contributor Author

@tchaikov updated

src/osd/OSD.cc Outdated
outstanding_pg_stats.erase(ack->get_tid());
// if there are earlier pg_stats doesn't acked,
// this can happen when the pg_stats doesn't send
// successfully.
Copy link
Contributor

@tchaikov tchaikov Aug 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjwsm1989

nit, would be ideal if we can fix the syntax error in the comment.

// if there are earlier pg-stats not yet acked, 
// this happens if they are not sent successfully.

  If osd get network error when sending pg_stats, osd will
  resend the pg_stats with tid+1, so the former tid will remain
  in outstanding_pg_stats. In osd tick(), if the outstanding_pg_stats's
  size > osd_mon_report_max_in_flight(default:2), it will refuse to
  send pg_stats, that will block pg states from changing.
  Finally will fail qa tests like resolve_stuck_peering.py.

  Signed-off-by: huangjun <huangjun@xsky.com>
@hjwsm1989
Copy link
Contributor Author

@tchaikov thank you advice

@tchaikov
Copy link
Contributor

tchaikov commented Aug 1, 2017

retest this please.

@gregsfortytwo
Copy link
Member

jenkins test this please

@yuriw
Copy link
Contributor

yuriw commented Aug 10, 2017

@yuriw yuriw merged commit 61e0999 into ceph:master Aug 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants