New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: add clear_shards_repaired command #54954
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs look okay, one request
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two minor comments. LGTM otherwise.
Useful for my tests...
src/osd/OSD.cc
Outdated
@@ -4350,6 +4362,12 @@ void OSD::final_init() | |||
asok_hook, | |||
"debug the scrubber"); | |||
ceph_assert(r == 0); | |||
r = admin_socket->register_command( | |||
"clear_shards_repaired " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for the '\'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL - fixed.
src/osd/PGBackend.h
Outdated
@@ -299,6 +299,7 @@ typedef std::shared_ptr<const OSDMap> OSDMapRef; | |||
virtual bool check_failsafe_full() = 0; | |||
|
|||
virtual void inc_osd_stat_repaired() = 0; | |||
virtual void set_osd_stat_repaired(int64_t) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? unlike inc_osd_stat_repaired(), there is no clearing from the backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think I did that mostly because I was following what the Nautilus commit was doing.
@diffs : ping? |
This command will allow us to clear the OSD_TOO_MANY_REPAIRS alert by setting the shard repair count to 0. This will help in cases where the alert was a false positive, or a condition that has since cleared at the disk level. Often, zeroing out the repair count is better than muting the alert or restarting the OSD. Fixes: https://tracker.ceph.com/issues/54182 Co-authored-by: David Zafman <dzafman@redhat.com> Signed-off-by: Daniel Radjenovic <dradjenovic@digitalocean.com>
Hey @ronen-fr - sorry about the delay, I've pushed an updated commit that addresses your comments. |
jenkins test api |
src/osd/OSD.cc
Outdated
{ | ||
std::lock_guard l(stat_lock); | ||
osd_stat.num_shards_repaired = count; | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sorry for missing this in the first round)
Please remove the 'return' line.
(and - if you can - please remove it from the inc_osd_stat_repaired() above, too.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
LGTM apart from a minor comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an extra 'return' line
@diffs - the PR is approved, but fails CI tests, including: Please sign that commit & repush. Actually - the 2nd commit fixes the 1st. Thus: please squash the two, make sure the resulting commit has the correct description and is signed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see latest comment
Signed-off-by: Daniel Radjenovic <dradjenovic@digitalocean.com>
My bad! I had pushed the previous commit too quickly and forgot about the signoff requirement. Now it's fixed. 🙂 |
jenkins test api |
This command will allow us to clear the
OSD_TOO_MANY_REPAIRS
alert by setting the shard repair count to 0. This will help in cases where the alert was a false positive, or a condition that has since cleared at the disk level. Often, zeroing out the repair count is better than muting the alert or restarting the OSD.Fixes: https://tracker.ceph.com/issues/54182
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e