Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: allow FULL_TRY after failsafe #17177

Merged
merged 1 commit into from Aug 29, 2017

Conversation

Projects
None yet
3 participants
@liupan1111
Copy link
Contributor

commented Aug 23, 2017

In #12627 and #14193, I've supported "rbd rm" when osd is full. But I find that support is not enough: only when the "full osd" is not primary, "rbd rm" could work. I did experiment: use vstart to create only one osd, and write until full, then rm, it still hangs there. This fix in this pr could resolve it.

Signed-off-by: Pan Liu wanjun.lp@alibaba-inc.com

@liupan1111 liupan1111 requested a review from liewegas Aug 23, 2017

@liupan1111

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2017

retest this please

@liewegas

This comment has been minimized.

Copy link
Member

commented Aug 23, 2017

Hmm, I doesn't seem like you should be hitting the failsafe threshold.

Oh, it's because vstart sets the thresholds too high:

        osd failsafe full ratio = .99
        mon osd nearfull ratio = .99
        mon osd backfillfull ratio = .99
        mon osd full ratio = .99

should should be .99, .96, .97, .98, or similar. Update vstart.sh?

@liupan1111

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2017

@liewegas i got the test result by seting these options all to 15, so that This osd Could be filled full quickly. I donnot think This issue is related to option values... Could you give me some suggestion if we dont do This change to resolve this issue?

@liewegas

This comment has been minimized.

Copy link
Member

commented Aug 24, 2017

The important thing is that the full_ratio is less than the failsafe ratio, so that the clsuter is marked full and clients stop writing before hitting the failsafe.

The failsafe is a last-ditch safety check to prevent the OSD from filling itself up. You shouldn't be allowed to override it with the force flag.

@liupan1111

This comment has been minimized.

Copy link
Contributor Author

commented Aug 24, 2017

@liewegas yes, full_ratio is normally less than the failsafe ratio, but in my case, there is possible the failsafe reached first: that is because the "statfs" is called in osd(every one or five seconds?), and set cur_stat of this OSD to full by fail_safe ratio, and then send to monitor, and check osdmap change by full_ratio, and send to client, then pause client io... So there is a time interval...

In addition, I didn't override with full_force, but full_try.

I will not insist on if you think we should tune failsafe and full_ratio to avoid this issue... But I think it maybe a little tricky for this tuning...

@liupan1111

This comment has been minimized.

Copy link
Contributor Author

commented Aug 24, 2017

@liewegas And in this case, I set all this options to 15%, but I found both full_ratio and full_try are 20% when fio pause... I use 1m bs to write it... For this case, I think 1 or 2 percent difference between full_try and fail_safe could not reolve it...

@liewegas
Copy link
Member

left a comment

Ok, since this is just a FULL_TRY, it's probably harmless... we will only proceed if the transaction is a net reduction in usage. There is still some risk, though: it may be that the operation forces recovery of an object that then fills things up. The failsafe should block that from happening, though!

@liewegas liewegas added the needs-qa label Aug 25, 2017

@liewegas liewegas changed the title osd: support "rbd rm" when osd is full osd: allow FULL_TRY after failsafe Aug 25, 2017

@liewegas

This comment has been minimized.

Copy link
Member

commented Aug 25, 2017

Do you mind updating the commit description?

osd: allow FULL_TRY after failsafe
Signed-off-by: Pan Liu <wanjun.lp@alibaba-inc.com>

@liupan1111 liupan1111 force-pushed the liupan1111:wip-fix-rm branch from dd1c347 to 40cf32b Aug 26, 2017

@liupan1111

This comment has been minimized.

Copy link
Contributor Author

commented Aug 26, 2017

@liewegas commit description has been updated, thanks.

Ok, since this is just a FULL_TRY, it's probably harmless... we will only proceed if the transaction is a net reduction in usage. There is still some risk, though: it may be that the operation forces recovery of an object that then fills things up. The failsafe should block that from happening, though!

I searched the code, and found there were no other places set this CEPH_OSD_FLAG_FULL_TRY... I think we could avoid this risk by strictly limit the operations to set FULL_TRY/FULL_FORCE flag, so that failsafe will really safe to block that.

@liewegas liewegas merged commit 7abe19e into ceph:master Aug 29, 2017

4 of 5 checks passed

make check make check failed
Details
Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check (arm64) make check succeeded
Details

@liupan1111 liupan1111 deleted the liupan1111:wip-fix-rm branch Aug 30, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.