New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: do not forget pg_stat acks which failed to send #16702
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Maybe change the commit title to be a little more descriptive. "osd: do not forget pg_stat acks which failed to send"?
src/osd/OSD.cc
Outdated
// this can happen when the pg_stats doesn't send | ||
// successfully. | ||
for (auto t : outstanding_pg_stats) { | ||
if (t < ack_tid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well make this <= and remove the separate erase call preceding it?
4a1084a
to
fd65092
Compare
@gregsfortytwo updated |
src/osd/OSD.cc
Outdated
@@ -6090,7 +6091,13 @@ void OSD::handle_pg_stats_ack(MPGStatsAck *ack) | |||
} | |||
} | |||
|
|||
outstanding_pg_stats.erase(ack->get_tid()); | |||
// if there are earlyer pg_stats doesn't acked, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if there are pg-stats not yet acked,
// this happens when pg-stats is not sent successfully.
src/osd/OSD.cc
Outdated
@@ -6058,8 +6058,9 @@ void OSD::handle_pg_stats_ack(MPGStatsAck *ack) | |||
stats_ack_timeout * cct->_conf->osd_stats_ack_timeout_decay); | |||
dout(20) << __func__ << " timeout now " << stats_ack_timeout << dendl; | |||
|
|||
if (ack->get_tid() > pg_stat_tid_flushed) { | |||
pg_stat_tid_flushed = ack->get_tid(); | |||
uint64_t ack_tid = ack->get_tid(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, might want to mark ack_tid as a const, like
const auto ack_tid = ack->get_tid();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but, IMHO, it would be better just to reference ack_tid by ack->get_tid()
.
src/osd/OSD.cc
Outdated
// this can happen when the pg_stats doesn't send | ||
// successfully. | ||
for (auto t : outstanding_pg_stats) { | ||
if (t <= ack_tid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot erase from an std::set<>
when iterating through it. instead we should use something like:
for (auto tid = outstanding_pg_stats.cbegin();
tid != outstanding_pg_stats.cend(); ) {
if (*tid <= ack_tid) {
tid = outstanding_pg_stats.erase(tid);
} else {
break;
}
}
also, since the connection does not send pg-stats out-of-order, neither does monitor ack the pg-stats out-of-order, we can assume that all tids after ack_tid
in outstanding_pg_stats
are greater than it. so we can break
once *tid > ack_tid
.
fd65092
to
da492c5
Compare
@tchaikov updated |
src/osd/OSD.cc
Outdated
outstanding_pg_stats.erase(ack->get_tid()); | ||
// if there are earlier pg_stats doesn't acked, | ||
// this can happen when the pg_stats doesn't send | ||
// successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, would be ideal if we can fix the syntax error in the comment.
// if there are earlier pg-stats not yet acked,
// this happens if they are not sent successfully.
If osd get network error when sending pg_stats, osd will resend the pg_stats with tid+1, so the former tid will remain in outstanding_pg_stats. In osd tick(), if the outstanding_pg_stats's size > osd_mon_report_max_in_flight(default:2), it will refuse to send pg_stats, that will block pg states from changing. Finally will fail qa tests like resolve_stuck_peering.py. Signed-off-by: huangjun <huangjun@xsky.com>
da492c5
to
edc7378
Compare
@tchaikov thank you advice |
retest this please. |
jenkins test this please |
If osd get network error when sending pg_stats, osd will
resend the pg_stats with tid+1, so the former tid will remain
in outstanding_pg_stats. In osd tick(), if the outstanding_pg_stats's
size > osd_mon_report_max_in_flight(default:2), it will refuse to
send pg_stats, that will block pg states from changing.
Finally will fail qa tests like resolve_stuck_peering.py.
Signed-off-by: huangjun huangjun@xsky.com