Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multimds:thrash sub-suite and fix bugs in thrasher for multimds #13262

Merged
merged 18 commits into from Mar 7, 2017

Conversation

Projects
None yet
2 participants
@batrick
Copy link
Member

batrick commented Feb 5, 2017

No description provided.

@batrick

This comment has been minimized.

@batrick batrick force-pushed the batrick:multimds-thrasher branch from 140817a to 3fe8d3e Feb 6, 2017

@batrick

This comment has been minimized.

Copy link
Member Author

batrick commented Feb 6, 2017

Rebased onto slightly older 9431987^ to avoid kclient failures.

batrick added some commits Jan 12, 2017

qa: turn on multimds thrashing
Fixes: http://tracker.ceph.com/issues/10792

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: remove snap tests from multimds:thrash
Snapshots are known to not work with multimds presently.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: handle thrashing ranks with holes
During the course of thrashing max_mds, the ranks assigned to MDSs may
develop holes. This causes the thrasher to try to wrongly deactivate
ranks that are not assigned.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: check replacement MDS is active in thrasher
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: timeout thrasher if fs does not stabilize
After 5 minutes of waiting, it's reasonable to stop as the cluster is
probably stuck.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: avoid infinite wait if no repl. can be made
The thrasher can enter an infinite loop waiting for an MDS to take a
certain rank when a replacement may not be possible. For example,
max_mds actives are already running.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: add deactivation log message
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: configure thrashing while MDS are stopping
Currently multimds is prone to many failures when killing an active or
stopping MDS when there are MDS in the cluster which have been
deactivated (stopping). Have this turned off by default for now.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: timeout waiting for thrashed MDS to revive
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: allow revived MDS to be up:active
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: add standbys to take over during thrashing
The thrasher expects in some scenarios for the cluster to stabilize with
a new MDS taking over when there are no standbys available. This can
cause the thrasher to quit because the cluster never stabilizes.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: remove old comment
Filesystem is now cluster aware.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: use fs methods for setting configs
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: use gevent.sleep so greenlet yields
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: do not pretty the json to shorten stdout log
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: disable max_mds changes during thrashing
While the trasher supports the behavior desired by issue 10792 [1], the
bugs uncovered due to deactivating MDS (and sometimes killing
deactivating MDS) are presently a distraction from addressing issues
during normal failures. So now thrashing max_mds is turned off by
default. I have added a TODO to deactivate ranks in order (configurably)
as random deactivation causes a lot of other problems.

This also fixes a bug: random.randrange(0.0, 1.0) always returns 0.
Oops.

[1] http://tracker.ceph.com/issues/10792

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
qa: add DaemonWatchdog to stop tests on failure
Thrashing MDS will often result in failures which often do not stop the
test. The failure may also cause the test to stall which will force the
machines to needlessly be locked until a timeout is reached. This
watchdog will unmount mounts and kill daemons when a failure is
detected.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@batrick

This comment has been minimized.

Copy link
Member Author

batrick commented Feb 6, 2017

http://pulpito.ceph.com/pdonnell-2017-02-06_16:25:22-multimds:thrash-wip-multimds-thrasher-testing-basic-smithi/

Run ^ after rebase.

I'm rebasing to master again now that a fix for the kclient issue is merged. I will also add debug_ms so Haomai can debug http://tracker.ceph.com/issues/18690.

qa: increase debug_ms level for thrashing
This is to help locate the cause of [1].

[1] http://tracker.ceph.com/issues/18690

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

@batrick batrick force-pushed the batrick:multimds-thrasher branch from 3fe8d3e to 1183f09 Feb 6, 2017

@batrick

This comment has been minimized.

@jcsp

jcsp approved these changes Mar 7, 2017

@jcsp jcsp merged commit 7310030 into ceph:master Mar 7, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@batrick batrick deleted the batrick:multimds-thrasher branch Mar 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.