Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kraken: core: improve control and throttling of the snap trimmer #14597

Merged
merged 11 commits into from Jul 25, 2017

Conversation

gregsfortytwo
Copy link
Member

@gregsfortytwo gregsfortytwo added this to the kraken milestone Apr 18, 2017
@gregsfortytwo
Copy link
Member Author

The jewel version (much more complicated) was reviewed and merged in #14492. This one hasn't gone through any integration testing so needs that, but should be fine.

@smithfarm
Copy link
Contributor

Needs rebase (please remove DNM when the conflict is gone)

@smithfarm smithfarm changed the title Wip kraken snaptrim [DNM] Wip kraken snaptrim Apr 18, 2017
@gregsfortytwo gregsfortytwo changed the title [DNM] Wip kraken snaptrim Wip kraken snaptrim Apr 19, 2017
@gregsfortytwo
Copy link
Member Author

@smithfarm, rebased.

@smithfarm
Copy link
Contributor

smithfarm commented Apr 20, 2017

@gregsfortytwo This PR is included in the latest wip-kraken-backports upon which I scheduled an upgrade/jewel-x run. That run [1] has a number of "failed to complete snap trimming before timeout" failures that were not appearing in previous runs.

[1] http://pulpito.ceph.com/smithfarm-2017-04-20_19:15:47-upgrade:jewel-x-wip-kraken-backports-distro-basic-vps/

@smithfarm
Copy link
Contributor

@gregsfortytwo That run doesn't include a bunch of new commits I just added to #14612

@gregsfortytwo
Copy link
Member Author

sigh
Just searching that string in my email, it's popping up pretty frequently in upgrade tests to master as well. My best guess is the VPS machines are just taking too long and we'd have to bump up the fatal timeout value. Or we could drop this test, or ignore it on upgrade/vps runs.

@smithfarm, any preferences?

gregsfortytwo and others added 4 commits April 26, 2017 11:25
This patch introduces an AsyncReserver for snap trimming to limit the
number of pgs on any single OSD which can be trimming, as with backfill.
Unlike backfill, we don't take remote reservations on the assumption
that the set of pgs with trimming work to do is already well
distributed, so it doesn't seem worth the implementation overhead to get
reservations from the peers as well.

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 21cc515)

Conflicts:
	src/osd/PrimaryLogPG.cc
	src/osd/PrimaryLogPG.h

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit c2eac34)
Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 4aebf59)
Rather than blocking the main op queue, just pause for that amount of
time between state machine cycles.

Also, add osd_snap_trim_sleep to a few of the thrasher yamls.

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 2ed7759)

Conflicts:
	src/osd/PrimaryLogPG.cc

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@smithfarm
Copy link
Contributor

any preferences?

My preference would be to fix the tests. I opened master PR #14811 and jewel backport #14812 - these double the timeout from 600 to 1200 seconds. Let me know if you think that's too much, or too little.

Also, I merged Sage's PendingReleaseNotes change - could you please rebase and remove DNM.

Then I'll repopulate the integration branch and do another round of tests.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@smithfarm smithfarm changed the title Wip kraken snaptrim kraken: core: improve control and throttling of the snap trimmer Apr 26, 2017
@smithfarm smithfarm changed the title kraken: core: improve control and throttling of the snap trimmer [DNM] kraken: core: improve control and throttling of the snap trimmer Apr 26, 2017
@smithfarm
Copy link
Contributor

Testing at #14813

@liewegas liewegas removed the needs-qa label Apr 28, 2017
@smithfarm
Copy link
Contributor

Jenkins re-test this please

@smithfarm
Copy link
Contributor

@gregsfortytwo The make check failure has hopefully been fixed by #16069 - could you please rebase to pick it up?

@smithfarm
Copy link
Contributor

@gregsfortytwo Nevermind; in the meantime I learned that Jenkins merges the PR into the latest version of the base branch.

@smithfarm smithfarm changed the title [DNM] kraken: core: improve control and throttling of the snap trimmer kraken: core: improve control and throttling of the snap trimmer Jul 3, 2017
@smithfarm
Copy link
Contributor

@jdurgin This passed a rados suite at http://tracker.ceph.com/issues/19009#note-51http://tracker.ceph.com/issues/19009#note-51 with two failures that I believe are unrelated. Please review/approve the PR if you agree it can be merged.

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we'll need http://tracker.ceph.com/issues/19931 backport on top of this - added kraken as a backport version to the ticket

@smithfarm
Copy link
Contributor

smithfarm commented Jul 5, 2017

JFYI, this PR is holding up #15526 because the two PRs (this one and that one) conflict with eachother. Until now I assumed this one should get priority, but since we're now waiting for something to be added, I'm tempted to mark this one DNM and put the other one in the next integration branch.

@jdurgin
Copy link
Member

jdurgin commented Jul 5, 2017

Merging #15526 first is fine with me

@smithfarm smithfarm changed the title kraken: core: improve control and throttling of the snap trimmer [DNM] kraken: core: improve control and throttling of the snap trimmer Jul 6, 2017
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
(cherry picked from commit 306ad85)
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
(cherry picked from commit ec4185d)

Conflicts:
	src/osd/OSD.cc
	src/osd/Session.h
@gregsfortytwo
Copy link
Member Author

Cherry-picked all the stuff from https://github.com/ceph/ceph/pull/15214/commits (didn't test the changes myself, but it was pretty clean).

I'm happy to rebase on top of #15526 if you let me know (the conflicts are pretty minimal) but am not doing so now since it's marked DNM. Let me know, @smithfarm! :)

@smithfarm smithfarm changed the title [DNM] kraken: core: improve control and throttling of the snap trimmer kraken: core: improve control and throttling of the snap trimmer Jul 10, 2017
@smithfarm
Copy link
Contributor

DNM removed and will re-test

@smithfarm
Copy link
Contributor

test this please

@smithfarm
Copy link
Contributor

@gregsfortytwo Compilation error:

/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/PrimaryLogPG.cc: In member function ‘virtual void PrimaryLogPG::on_shutdown()’:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/PrimaryLogPG.cc:10143:18: error: ‘clear_backoffs’ was not declared in this scope
   clear_backoffs(); 
                  ^

@smithfarm smithfarm changed the title kraken: core: improve control and throttling of the snap trimmer [DNM] kraken: core: improve control and throttling of the snap trimmer Jul 18, 2017
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
(cherry picked from commit a3b028a)
We were failing to exit various wait states which held PGRefs. Error!

Fixes: http://tracker.ceph.com/issues/19931

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
(cherry picked from commit b0e9dee)
This finisher thread has a lot of callbacks which can hold PGRefs. Make
sure we drain them out before checking that all the PGs have finished
and have no outstanding references.

Moving this should be safe; we've already stopped the op thread et al
and the only things still running are the OSDService's objecter_finisher,
recovery_request_timer, and snap_sleep_timer (which has definitely been emptied
by the time we get here as it's synchronously cleared out on PG shutdown).

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
(cherry picked from commit 66ea9c1)
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
(cherry picked from commit 4caf2df)
@gregsfortytwo
Copy link
Member Author

Whoops, that bit doesn't exist until Luminous. My bad, dropped the patch!

@smithfarm smithfarm changed the title [DNM] kraken: core: improve control and throttling of the snap trimmer kraken: core: improve control and throttling of the snap trimmer Jul 19, 2017
@smithfarm
Copy link
Contributor

smithfarm commented Jul 19, 2017

@smithfarm
Copy link
Contributor

@gregsfortytwo @jdurgin Good news and bad news. The good news is: this passed a rados suite at http://tracker.ceph.com/issues/19009#note-71

The bad news is that this PR causes some upgrade/jewel-x jobs to fail with "failed to complete snap trimming before timeout". See http://pulpito.front.sepia.ceph.com/smithfarm-2017-07-20_08:53:03-upgrade:jewel-x-wip-kraken-backports-distro-basic-vps/

I already made one attempt to fix this by increasing the timeout, but it had no effect. @gregsfortytwo suggested that we could just ignore these failures.

Given the time frame and the desirability of having a 11.2.1 point release before kraken goes EOL, I'd say just ignore the failures and merge. What do you think?

@smithfarm
Copy link
Contributor

smithfarm commented Jul 21, 2017

Note that the "failed to complete snap trimming before timeout" failures happen even when the ugprade tests are run on smithi instead of vps. See http://pulpito.front.sepia.ceph.com/smithfarm-2017-07-21_06:55:36-upgrade:jewel-x-wip-kraken-backports-distro-basic-smithi/ (rerun of dead/failed jobs from the run cited in the previous comment)

smithfarm added a commit to smithfarm/ceph that referenced this pull request Jul 21, 2017
This PR drops two upgrade/jewel-x test cases that are not compatible with
ceph#14597

Signed-off-by: Nathan Cutler <ncutler@suse.com>
@smithfarm
Copy link
Contributor

Update: the test failures happen in only two test cases within upgrade/jewel-x: parallel and stress-split-erasure-code on Trusty. Opened #16493 to drop those two.

@gregsfortytwo
Copy link
Member Author

Yeah, I approved the other PR and spot-checked one of the failures and it didn't have any crashes so this all seems fine to me. :)

@smithfarm smithfarm merged commit 0242721 into ceph:kraken Jul 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants