Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrub/osd: add clearer reminders that a scrub is blocked #46643

Merged
merged 2 commits into from
Jun 23, 2022

Conversation

ronen-fr
Copy link
Contributor

Whenever a scrub session is waiting for an excessive length
of time for a locked object to be unlocked, the total
number of concurrent scrubs in the system is reduced.

The existing cluster warning issued on such occurrences is
easily overlooked. Here we add a constant reminder each time
the OSD tries to schedule scrubs.

Signed-off-by: Ronen Friedman rfriedma@redhat.com

@ronen-fr ronen-fr requested a review from a team as a code owner June 13, 2022 14:07
@github-actions github-actions bot added the core label Jun 13, 2022
@ronen-fr
Copy link
Contributor Author

Note: the 'since' seconds counters in the 'dump pgs' listing is not updated fast enough: currently, an unrelated
event must trigger a stat update from the PG to the OSD.
This is a known issue, and applies to the 'active scrub time' as well.
I will suggest a separate PR to periodically (with seconds frequency) 'publish_stats' calls for all PGs being scrubbed.

@ronen-fr ronen-fr changed the title scrub/osd: add clearer reminders that some scrub was blocked scrub/osd: add clearer reminders that a scrub was blocked Jun 13, 2022
@ronen-fr ronen-fr changed the title scrub/osd: add clearer reminders that a scrub was blocked scrub/osd: add clearer reminders that a scrub is blocked Jun 13, 2022
@ronen-fr
Copy link
Contributor Author

jenkins retest this please

@ljflores
Copy link
Contributor

jenkins test make check

@ljflores ljflores self-requested a review June 15, 2022 15:07
Copy link
Contributor

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor clarification comments.

src/osd/scrubber/osd_scrub_sched.cc Outdated Show resolved Hide resolved
src/osd/scrubber/osd_scrub_sched.cc Outdated Show resolved Hide resolved
src/osd/scrubber/pg_scrubber.cc Outdated Show resolved Hide resolved
@ronen-fr ronen-fr force-pushed the wip-rf-blocked branch 2 times, most recently from bcb1a15 to acee06d Compare June 15, 2022 18:11
Copy link
Contributor

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Might still be good to have a +1 from another reviewer.

Whenever a scrub session is waiting for an excessive length
of time for a locked object to be unlocked, the total
number of concurrent scrubs in the system is reduced.

The existing cluster warning issued on such occurrences is
easily overlooked. Here we add a constant reminder each time
the OSD tries to schedule scrubs.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
src/osd/OSD.cc Show resolved Hide resolved
src/osd/scrubber/osd_scrub_sched.cc Show resolved Hide resolved
src/osd/scrubber/pg_scrubber.cc Show resolved Hide resolved
src/osd/scrubber/osd_scrub_sched.cc Show resolved Hide resolved
@ronen-fr
Copy link
Contributor Author

@ljflores, @Matan-B, @neha-ojha : added a 2'nd commit, to disable the warning message in some
thrashers. It looks like some tests block objects for up to minutes. As I assume that for these tests
it's acceptable, I had to avoid the WRN message from being triggered.

@ronen-fr
Copy link
Contributor Author

See
http://pulpito.front.sepia.ceph.com/rfriedma-2022-06-20_15:04:04-rados::thrash-main-distro-default-smithi/
for the type of failures solved by the 2nd commit

src/osd/scrubber/pg_scrubber.cc Outdated Show resolved Hide resolved
src/osd/scrubber/pg_scrubber.cc Outdated Show resolved Hide resolved
@ljflores
Copy link
Contributor

jenkins test make check

@ljflores ljflores self-requested a review June 21, 2022 18:57
As some Teuthology tests seem to block objects for long minutes,
we must not issue the "scrub is blocked for too long" warning
(that warning causes the tests to fail).

A new configuration parameter now controls the grace period before
the warning is issued. Some tests were modified to set this
configuration parameter to a large value.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
@ronen-fr
Copy link
Contributor Author

http://pulpito.front.sepia.ceph.com/?branch=wip-rf-blocked
(and some more tests as part of other branches).
All failures are accounted for (thanks to Laura).
I will go ahead and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants