New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: sched_scrub() lock pg only if all scrubbing conditions are fulfilled #14968

Merged
merged 1 commit into from May 8, 2017

Conversation

Projects
None yet
4 participants
@TsaiJin
Contributor

TsaiJin commented May 5, 2017

There are multiple constraints when determining whether scheduling a scrub or not .We should make sure all conditional checks have passed before locking this pg.

Signed-off-by: Jin Cai caijin.caij@alibaba-inc.com

@liupan1111 liupan1111 requested a review from liewegas May 5, 2017

@liupan1111 liupan1111 added the core label May 5, 2017

if ((scrub.deadline >= now) && !(time_permit && load_is_low)) {
dout(10) << __func__ << " not scheduling scrub of " << scrub.pgid << " due to "
<< (!time_permit ? "time not permit" : "load not low") << dendl;
break;

This comment has been minimized.

@liupan1111

liupan1111 May 5, 2017

Contributor

The logic after your modification is changed: in original code, if ((scrub.deadline >= now) && !(time_permit && load_is_low)) is false, this func will continue, and check next_scrub_stamp, but your in your change, will break "while" ...

This deadline is dedicated to each scrubjob, so here should be continue, not break...

This comment has been minimized.

@TsaiJin

TsaiJin May 5, 2017

Contributor

Yes, you are right. I will modify the "break" into "continue".

@liupan1111 liupan1111 removed the request for review from liewegas May 5, 2017

@@ -6833,11 +6833,16 @@ void OSD::sched_scrub()
break;
}
if ((scrub.deadline >= now) && !(time_permit && load_is_low)) {

This comment has been minimized.

@tchaikov

tchaikov May 5, 2017

Contributor

even if it's not allowed to schedule this scrub job at this moment, we can still try with the next one. so i think we should "continue" instead of "break" here.

@@ -6833,11 +6833,16 @@ void OSD::sched_scrub()
break;
}
if ((scrub.deadline >= now) && !(time_permit && load_is_low)) {
dout(10) << __func__ << " not scheduling scrub of " << scrub.pgid << " due to "
<< (!time_permit ? "time not permit" : "load not low") << dendl;

This comment has been minimized.

@tchaikov

tchaikov May 5, 2017

Contributor

s/of/for/ s/load not low/high load/

PG *pg = _lookup_lock_pg(scrub.pgid);
if (!pg)
continue;
if (pg->get_pgbackend()->scrub_supported() && pg->is_active() &&
(scrub.deadline < now || (time_permit && load_is_low))) {
if (pg->get_pgbackend()->scrub_supported() && pg->is_active()) {
dout(10) << "sched_scrub scrubbing " << scrub.pgid << " at " << scrub.sched_time
<< (pg->scrubber.must_scrub ? ", explicitly requested" :
(load_is_low ? ", load_is_low" : " deadline < now"))

This comment has been minimized.

@liupan1111

liupan1111 May 5, 2017

Contributor

should remove out-of-date output.

This comment has been minimized.

@TsaiJin

TsaiJin May 5, 2017

Contributor

This output shows in which condition the ScrubJob is scheduled.
I think this modification has no conflicts with the output.

This comment has been minimized.

@liupan1111

liupan1111 May 5, 2017

Contributor

fine to me

@tchaikov tchaikov changed the title from osd/OSD.cc: OSD::sched_scrub() should only locks this pg when all scrubbing conditional checks have passed to osd: sched_scrub() should not lock pg unless all scrubbing conditions are fulfilled May 5, 2017

@tchaikov tchaikov changed the title from osd: sched_scrub() should not lock pg unless all scrubbing conditions are fulfilled to osd: sched_scrub() lock pg only if all scrubbing conditions are fulfilled May 5, 2017

osd/OSD.cc: lock the pg only when all conditional checks are passed when
scheduling a  scrub

Signed-off-by: Jin Cai <caijin.caij@alibaba-inc.com>
@TsaiJin

This comment has been minimized.

Contributor

TsaiJin commented May 5, 2017

@tchaikov @liupan1111 please take a look again. Thanks.

@liupan1111

This comment has been minimized.

Contributor

liupan1111 commented May 5, 2017

@TsaiJin lgtm, thanks.

@tchaikov tchaikov added the performance label May 5, 2017

@liewegas liewegas merged commit fd61780 into ceph:master May 8, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@TsaiJin TsaiJin deleted the TsaiJin:wip-lock-pg-when-scrub-available branch May 8, 2017

@liupan1111

This comment has been minimized.

Contributor

liupan1111 commented May 9, 2017

@liewegas I have one question about scrub:
I find scrub_random_backoff is used to control the chance to call sched_scrub. Could you explain why? Just in my opinion, we already provide enough parameters to control scrub, if we add this random mechnism, scrub may not be called in control. What is your opinion? Thanks.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented May 9, 2017

@liupan1111 we'd better bring this kind of discussion to the mailing list instead of asking it right on the github's PR page, it would be more visible to other developer. and since the mails will be archived, they will benefit other developers in a better way.

i think, we use scrub_random_backoff to introduce some randomness to the start time of scrubbing, to prevent multiple OSDs from starting the scrub in a big wave. with the fix of #3946, i think this problem is resolved. but since OSD ticks every 1 second, i don't think it's advisable to check for scrub job with that short interval especially in a large cluster.

@liewegas

This comment has been minimized.

Member

liewegas commented May 9, 2017

@liupan1111

This comment has been minimized.

Contributor

liupan1111 commented May 10, 2017

@tchaikov thanks, I will remember to send to mail list next time.

but since OSD ticks every 1 second, don't think it's advisable to check for scrub job with that short interval especially in a large cluster.

I don't think so ... I traced the code: 
if (!scrub_random_backoff()) {
     sched_scrub();
}
tick_timer_without_osd_lock.add_event_after(OSD_TICK_INTERVAL, new C_Tick_WithoutOSDLock(this));

The safetimer will be trigered in future 1 second after sched_scrub();

I also researched the code in #3946, it is not related about this scrub_random_backoff, the randomness introduced by #3946 is about osd_scrub_interval_randomize_ratio.

@liupan1111

This comment has been minimized.

Contributor

liupan1111 commented May 10, 2017

@liewegas I will remote it locally and test it to make sure the behavior.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented May 10, 2017

The safetimer will be trigered in future 1 second after sched_scrub();

i don't understand why this is a counter evidence of my comment.

@liupan1111 it is not. but they are related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment