Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: add clear_shards_repaired command #54954

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

osd: add clear_shards_repaired command #54954

wants to merge 2 commits into from

Conversation

diffs
Copy link

@diffs diffs commented Dec 18, 2023

This command will allow us to clear the OSD_TOO_MANY_REPAIRS alert by setting the shard repair count to 0. This will help in cases where the alert was a false positive, or a condition that has since cleared at the disk level. Often, zeroing out the repair count is better than muting the alert or restarting the OSD.

Fixes: https://tracker.ceph.com/issues/54182

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@diffs
Copy link
Author

diffs commented Dec 18, 2023

FYI this was originally based on commit 1a63a63 from PR #36379 that added this feature to Nautilus, hence the co-authoring.

Copy link
Contributor

@anthonyeleven anthonyeleven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs look okay, one request

doc/rados/operations/health-checks.rst Outdated Show resolved Hide resolved
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two minor comments. LGTM otherwise.
Useful for my tests...

src/osd/OSD.cc Outdated
@@ -4350,6 +4362,12 @@ void OSD::final_init()
asok_hook,
"debug the scrubber");
ceph_assert(r == 0);
r = admin_socket->register_command(
"clear_shards_repaired " \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for the '\'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL - fixed.

@@ -299,6 +299,7 @@ typedef std::shared_ptr<const OSDMap> OSDMapRef;
virtual bool check_failsafe_full() = 0;

virtual void inc_osd_stat_repaired() = 0;
virtual void set_osd_stat_repaired(int64_t) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? unlike inc_osd_stat_repaired(), there is no clearing from the backend

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think I did that mostly because I was following what the Nautilus commit was doing.

@ronen-fr
Copy link
Contributor

ronen-fr commented Mar 4, 2024

@diffs : ping?

This command will allow us to clear the OSD_TOO_MANY_REPAIRS alert
by setting the shard repair count to 0. This will help in cases where
the alert was a false positive, or a condition that has since cleared
at the disk level. Often, zeroing out the repair count is
better than muting the alert or restarting the OSD.

Fixes: https://tracker.ceph.com/issues/54182
Co-authored-by: David Zafman <dzafman@redhat.com>
Signed-off-by: Daniel Radjenovic <dradjenovic@digitalocean.com>
@diffs
Copy link
Author

diffs commented Mar 4, 2024

Hey @ronen-fr - sorry about the delay, I've pushed an updated commit that addresses your comments.

@ronen-fr
Copy link
Contributor

ronen-fr commented Mar 5, 2024

jenkins test api

src/osd/OSD.cc Outdated
{
std::lock_guard l(stat_lock);
osd_stat.num_shards_repaired = count;
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry for missing this in the first round)
Please remove the 'return' line.
(and - if you can - please remove it from the inc_osd_stat_repaired() above, too.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@ronen-fr
Copy link
Contributor

ronen-fr commented Mar 5, 2024

LGTM apart from a minor comment

Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an extra 'return' line

@ronen-fr
Copy link
Contributor

ronen-fr commented Mar 6, 2024

@diffs - the PR is approved, but fails CI tests, including:
"Signed-off-by — one or more commits in this PR are not signed"

Please sign that commit & repush.

Actually - the 2nd commit fixes the 1st. Thus: please squash the two, make sure the resulting commit has the correct description and is signed.

@ronen-fr ronen-fr self-requested a review March 6, 2024 11:48
Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see latest comment

@ronen-fr ronen-fr removed the needs-qa label Mar 6, 2024
Signed-off-by: Daniel Radjenovic <dradjenovic@digitalocean.com>
@diffs
Copy link
Author

diffs commented Mar 6, 2024

My bad! I had pushed the previous commit too quickly and forgot about the signoff requirement. Now it's fixed. 🙂

@ljflores
Copy link
Contributor

jenkins test api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants