Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: do not forget pg_stat acks which failed to send #16702

Merged
merged 1 commit into from Aug 10, 2017

Conversation

Projects
None yet
5 participants
@hjwsm1989
Copy link
Contributor

commented Jul 31, 2017

If osd get network error when sending pg_stats, osd will
resend the pg_stats with tid+1, so the former tid will remain
in outstanding_pg_stats. In osd tick(), if the outstanding_pg_stats's
size > osd_mon_report_max_in_flight(default:2), it will refuse to
send pg_stats, that will block pg states from changing.
Finally will fail qa tests like resolve_stuck_peering.py.

Signed-off-by: huangjun huangjun@xsky.com

@tchaikov tchaikov self-requested a review Jul 31, 2017

@gregsfortytwo
Copy link
Member

left a comment

LGTM.

Maybe change the commit title to be a little more descriptive. "osd: do not forget pg_stat acks which failed to send"?

// this can happen when the pg_stats doesn't send
// successfully.
for (auto t : outstanding_pg_stats) {
if (t < ack_tid)

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Jul 31, 2017

Member

Might as well make this <= and remove the separate erase call preceding it?

@hjwsm1989 hjwsm1989 changed the title osd: fix the pg_stats bug osd: do not forget pg_stat acks which failed to send Jul 31, 2017

@hjwsm1989 hjwsm1989 force-pushed the hjwsm1989:wip-recover-timeout-expired branch from 4a1084a to fd65092 Jul 31, 2017

@hjwsm1989

This comment has been minimized.

Copy link
Contributor Author

commented Jul 31, 2017

@gregsfortytwo updated

@@ -6090,7 +6091,13 @@ void OSD::handle_pg_stats_ack(MPGStatsAck *ack)
}
}

outstanding_pg_stats.erase(ack->get_tid());
// if there are earlyer pg_stats doesn't acked,

This comment has been minimized.

Copy link
@tchaikov

tchaikov Aug 1, 2017

Contributor

// if there are pg-stats not yet acked,
// this happens when pg-stats is not sent successfully.

@@ -6058,8 +6058,9 @@ void OSD::handle_pg_stats_ack(MPGStatsAck *ack)
stats_ack_timeout * cct->_conf->osd_stats_ack_timeout_decay);
dout(20) << __func__ << " timeout now " << stats_ack_timeout << dendl;

if (ack->get_tid() > pg_stat_tid_flushed) {
pg_stat_tid_flushed = ack->get_tid();
uint64_t ack_tid = ack->get_tid();

This comment has been minimized.

Copy link
@tchaikov

tchaikov Aug 1, 2017

Contributor

nit, might want to mark ack_tid as a const, like

const auto ack_tid = ack->get_tid();

This comment has been minimized.

Copy link
@tchaikov

tchaikov Aug 1, 2017

Contributor

but, IMHO, it would be better just to reference ack_tid by ack->get_tid().

// this can happen when the pg_stats doesn't send
// successfully.
for (auto t : outstanding_pg_stats) {
if (t <= ack_tid)

This comment has been minimized.

Copy link
@tchaikov

tchaikov Aug 1, 2017

Contributor

we cannot erase from an std::set<> when iterating through it. instead we should use something like:

  for (auto tid = outstanding_pg_stats.cbegin();
       tid != outstanding_pg_stats.cend(); ) {
    if (*tid <= ack_tid) {
      tid = outstanding_pg_stats.erase(tid);
    } else {
      break;
    }
  }

also, since the connection does not send pg-stats out-of-order, neither does monitor ack the pg-stats out-of-order, we can assume that all tids after ack_tid in outstanding_pg_stats are greater than it. so we can break once *tid > ack_tid.

@tchaikov tchaikov removed the needs-qa label Aug 1, 2017

@hjwsm1989 hjwsm1989 force-pushed the hjwsm1989:wip-recover-timeout-expired branch from fd65092 to da492c5 Aug 1, 2017

@hjwsm1989

This comment has been minimized.

Copy link
Contributor Author

commented Aug 1, 2017

@tchaikov updated

@tchaikov tchaikov added the needs-qa label Aug 1, 2017

outstanding_pg_stats.erase(ack->get_tid());
// if there are earlier pg_stats doesn't acked,
// this can happen when the pg_stats doesn't send
// successfully.

This comment has been minimized.

Copy link
@tchaikov

tchaikov Aug 1, 2017

Contributor

@hjwsm1989

nit, would be ideal if we can fix the syntax error in the comment.

// if there are earlier pg-stats not yet acked, 
// this happens if they are not sent successfully.
huangjun
osd: do not forget pg_stat acks which failed to send
  If osd get network error when sending pg_stats, osd will
  resend the pg_stats with tid+1, so the former tid will remain
  in outstanding_pg_stats. In osd tick(), if the outstanding_pg_stats's
  size > osd_mon_report_max_in_flight(default:2), it will refuse to
  send pg_stats, that will block pg states from changing.
  Finally will fail qa tests like resolve_stuck_peering.py.

  Signed-off-by: huangjun <huangjun@xsky.com>

@hjwsm1989 hjwsm1989 force-pushed the hjwsm1989:wip-recover-timeout-expired branch from da492c5 to edc7378 Aug 1, 2017

@hjwsm1989

This comment has been minimized.

Copy link
Contributor Author

commented Aug 1, 2017

@tchaikov thank you advice

@tchaikov

This comment has been minimized.

Copy link
Contributor

commented Aug 1, 2017

retest this please.

@gregsfortytwo

This comment has been minimized.

Copy link
Member

commented Aug 2, 2017

jenkins test this please

@yuriw yuriw merged commit 61e0999 into ceph:master Aug 10, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.