Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: auto repair EC pool #6196

Merged
merged 4 commits into from Oct 20, 2015

Conversation

Projects
None yet
5 participants
@guangyy
Copy link
Member

guangyy commented Oct 7, 2015

Fixes: #12754

Signed-off-by: Guang Yang yguang@yahoo-inc.com

@ghost

This comment has been minimized.

Copy link

ghost commented Oct 8, 2015

@guangyy the failure from the bot is not very informative. You can repeat it by running ./run-make-check.sh locally. It will then be easier to figure out which test is responsible for the timeout.

@guangyy

This comment has been minimized.

Copy link
Member Author

guangyy commented Oct 8, 2015

@dachary , yep, I will do that (looks like it was caused by the new code change). Will update once I have more information.

state_set(PG_STATE_REPAIR);
scrubber.must_repair = false;

This comment has been minimized.

@dzafman

dzafman Oct 8, 2015

Member

I think this line should stay.

This comment has been minimized.

@guangyy

guangyy Oct 13, 2015

Author Member

The reason I removed it is because later (when finish scrubbing) I would like distinguish the repair request is manually or auto, and if it is the former case, we always go ahead to fix the corruptions (via recovery), for later one, we put a threshold to cancel the repair for manual triage (e.g. if lots of corruptions in a PG, there might be a broken disk which needs replacement).

I didn't find a better way other using this flag this purpose (adding a new flag should work but seems overkill). Do you see any potential problem by using this flag for such purpose?

This comment has been minimized.

@dzafman

dzafman Oct 13, 2015

Member

I'd rather keep must_scrub, must_deepscrub, and must_repair functioning consistently with each other. All you need to do is make sure that auto_repair is set to false when doing a must_repair, so that later if auto_repair is true you know it can only be an auto_repair.

This comment has been minimized.

@guangyy

guangyy Oct 13, 2015

Author Member

Good point! Let me do that then. Thanks.

@@ -3161,7 +3160,7 @@ bool PG::sched_scrub()
return false;
}

bool time_for_deep = (ceph_clock_now(cct) >
bool time_for_deep = (ceph_clock_now(cct) >=

This comment has been minimized.

@dzafman

dzafman Oct 8, 2015

Member

No need to make this change.

This comment has been minimized.

@guangyy

guangyy Oct 13, 2015

Author Member

It is kinda confusing if osd_deep_scrub_interval and osd_scrub_min_interval have the same value, but not each time the scrub is a deep scrub..

This comment has been minimized.

@dzafman

dzafman Oct 13, 2015

Member

ok, that's good then.

@dzafman

This comment has been minimized.

Copy link
Member

dzafman commented Oct 8, 2015

To be honest I'm not sure I like adding this complication to the scrubbing stuff. In the next release (Jewel) there are going to be improvements to save the objects with scrub errors and allow an administrator to control each individual repair. This change runs counter to that approach.

@dzafman

This comment has been minimized.

Copy link
Member

dzafman commented Oct 8, 2015

Now that I see that this is for EC only, I can see the benefit.

@dzafman dzafman self-assigned this Oct 8, 2015

@guangyy guangyy force-pushed the guangyy:wip-12754 branch from 2935247 to 48a15ef Oct 13, 2015

@guangyy

This comment has been minimized.

Copy link
Member Author

guangyy commented Oct 13, 2015

@dzafman , @tchaikov , thanks for the review:)

}

// the part that actually finalizes a scrub
void PG::scrub_finish()
{
bool repair = state_test(PG_STATE_REPAIR);
// if the repair request comes from auto-repair and large number of errors,
// we would like to cancel auto-repair
if (repair && !scrubber.must_repair

This comment has been minimized.

@dzafman

dzafman Oct 13, 2015

Member

if (repair && scrubber.auto_repair...

@guangyy guangyy force-pushed the guangyy:wip-12754 branch from 48a15ef to 6293e83 Oct 14, 2015

@guangyy

This comment has been minimized.

Copy link
Member Author

guangyy commented Oct 14, 2015

Hi @dzafman , updated according to the review comments, please help to review again. Thanks!

@dzafman

This comment has been minimized.

Copy link
Member

dzafman commented Oct 14, 2015

LGTM What about testing?

@guangyy

This comment has been minimized.

Copy link
Member Author

guangyy commented Oct 14, 2015

Thanks @dzafman , I tested locally (with a vstart cluster), I didn't find a good way to automate the test though and put it into 'make check', the main challenges I came across is that I couldn't find a way to inject the settings of scrub, and at the same time make them take effect immediately (with the default setting, the deep scrub would be scheduled to a week later and the injection of settings (scrub_interval, etc) does not impact that).. Any suggestion?

@dzafman

This comment has been minimized.

Copy link
Member

dzafman commented Oct 14, 2015

set osd_scrub_min_interval, osd_scrub_max_interval and osd_deep_scrub_interval to the same value. Also set osd_scrub_interval_randomize_ratio to 0 to turn of randomization. Setting the intervals to say 120 (or even less) should cause a deep scrub on a test setup to occur every 2 minutes.

Guang Yang added some commits Oct 6, 2015

Guang Yang
pg: only queue for recovery if there is any objects to repair after s…
…crubbing

Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Guang Yang
pg: add auto-repair for EC pool
Fixes: #12754
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Guang Yang
test: add integration test for the auto repair feature
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Guang Yang
osd: off-by-one when check deep scrubbing
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>

@guangyy guangyy force-pushed the guangyy:wip-12754 branch 2 times, most recently from 04551a2 to e826213 Oct 16, 2015

@guangyy

This comment has been minimized.

Copy link
Member Author

guangyy commented Oct 16, 2015

Hi @dzafman , the test cases is added for the change, 'make check' is happy locally on my host.

@dzafman

This comment has been minimized.

Copy link
Member

dzafman commented Oct 20, 2015

@liewegas I actually ran this in wip-zafman-testing but didn't mark it.

@dzafman dzafman removed the needs-qa label Oct 20, 2015

dzafman added a commit that referenced this pull request Oct 20, 2015

Merge pull request #6196 from guangyy/wip-12754
osd: auto repair EC pool

Reviewed-by: David Zafman <dzafman@redhat.com>

@dzafman dzafman merged commit 6e002c6 into ceph:master Oct 20, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.