New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kraken: core: improve control and throttling of the snap trimmer #14597
Conversation
The jewel version (much more complicated) was reviewed and merged in #14492. This one hasn't gone through any integration testing so needs that, but should be fine. |
Needs rebase (please remove DNM when the conflict is gone) |
c7723c5
to
54301f3
Compare
@smithfarm, rebased. |
@gregsfortytwo This PR is included in the latest |
@gregsfortytwo That run doesn't include a bunch of new commits I just added to #14612 |
sigh @smithfarm, any preferences? |
This patch introduces an AsyncReserver for snap trimming to limit the number of pgs on any single OSD which can be trimming, as with backfill. Unlike backfill, we don't take remote reservations on the assumption that the set of pgs with trimming work to do is already well distributed, so it doesn't seem worth the implementation overhead to get reservations from the peers as well. Signed-off-by: Samuel Just <sjust@redhat.com> (cherry picked from commit 21cc515) Conflicts: src/osd/PrimaryLogPG.cc src/osd/PrimaryLogPG.h Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com> (cherry picked from commit c2eac34)
Signed-off-by: Samuel Just <sjust@redhat.com> (cherry picked from commit 4aebf59)
Rather than blocking the main op queue, just pause for that amount of time between state machine cycles. Also, add osd_snap_trim_sleep to a few of the thrasher yamls. Signed-off-by: Samuel Just <sjust@redhat.com> (cherry picked from commit 2ed7759) Conflicts: src/osd/PrimaryLogPG.cc Signed-off-by: Greg Farnum <gfarnum@redhat.com>
My preference would be to fix the tests. I opened master PR #14811 and jewel backport #14812 - these double the timeout from 600 to 1200 seconds. Let me know if you think that's too much, or too little. Also, I merged Sage's PendingReleaseNotes change - could you please rebase and remove DNM. Then I'll repopulate the integration branch and do another round of tests. |
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
54301f3
to
33c211b
Compare
Testing at #14813 |
Jenkins re-test this please |
@gregsfortytwo The make check failure has hopefully been fixed by #16069 - could you please rebase to pick it up? |
@gregsfortytwo Nevermind; in the meantime I learned that Jenkins merges the PR into the latest version of the base branch. |
@jdurgin This passed a rados suite at http://tracker.ceph.com/issues/19009#note-51http://tracker.ceph.com/issues/19009#note-51 with two failures that I believe are unrelated. Please review/approve the PR if you agree it can be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we'll need http://tracker.ceph.com/issues/19931 backport on top of this - added kraken as a backport version to the ticket
JFYI, this PR is holding up #15526 because the two PRs (this one and that one) conflict with eachother. Until now I assumed this one should get priority, but since we're now waiting for something to be added, I'm tempted to mark this one DNM and put the other one in the next integration branch. |
Merging #15526 first is fine with me |
Cherry-picked all the stuff from https://github.com/ceph/ceph/pull/15214/commits (didn't test the changes myself, but it was pretty clean). I'm happy to rebase on top of #15526 if you let me know (the conflicts are pretty minimal) but am not doing so now since it's marked DNM. Let me know, @smithfarm! :) |
DNM removed and will re-test |
test this please |
@gregsfortytwo Compilation error:
|
Signed-off-by: Greg Farnum <gfarnum@redhat.com> (cherry picked from commit a3b028a)
We were failing to exit various wait states which held PGRefs. Error! Fixes: http://tracker.ceph.com/issues/19931 Signed-off-by: Greg Farnum <gfarnum@redhat.com> (cherry picked from commit b0e9dee)
This finisher thread has a lot of callbacks which can hold PGRefs. Make sure we drain them out before checking that all the PGs have finished and have no outstanding references. Moving this should be safe; we've already stopped the op thread et al and the only things still running are the OSDService's objecter_finisher, recovery_request_timer, and snap_sleep_timer (which has definitely been emptied by the time we get here as it's synchronously cleared out on PG shutdown). Signed-off-by: Greg Farnum <gfarnum@redhat.com> (cherry picked from commit 66ea9c1)
Signed-off-by: Greg Farnum <gfarnum@redhat.com> (cherry picked from commit 4caf2df)
6f18526
to
838d784
Compare
Whoops, that bit doesn't exist until Luminous. My bad, dropped the patch! |
@gregsfortytwo @jdurgin Good news and bad news. The good news is: this passed a rados suite at http://tracker.ceph.com/issues/19009#note-71 The bad news is that this PR causes some I already made one attempt to fix this by increasing the timeout, but it had no effect. @gregsfortytwo suggested that we could just ignore these failures. Given the time frame and the desirability of having a 11.2.1 point release before kraken goes EOL, I'd say just ignore the failures and merge. What do you think? |
Note that the "failed to complete snap trimming before timeout" failures happen even when the ugprade tests are run on smithi instead of vps. See http://pulpito.front.sepia.ceph.com/smithfarm-2017-07-21_06:55:36-upgrade:jewel-x-wip-kraken-backports-distro-basic-smithi/ (rerun of dead/failed jobs from the run cited in the previous comment) |
This PR drops two upgrade/jewel-x test cases that are not compatible with ceph#14597 Signed-off-by: Nathan Cutler <ncutler@suse.com>
Update: the test failures happen in only two test cases within |
Yeah, I approved the other PR and spot-checked one of the failures and it didn't have any crashes so this all seems fine to me. :) |
Fixes: http://tracker.ceph.com/issues/19329