Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tasks/cephfs: time out on ceph-fuses that don't die #453

Merged
merged 2 commits into from Sep 21, 2015
Merged

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jun 3, 2015

For cases where we have e.g. poked the fuse abort
file for a process, but it's still not dying. Because
this is a special class of error (unlike e.g. when
we force umount something because the network is gone)
raise the error instead of trying again to kill
the client.

Fixes: #11835
Signed-off-by: John Spray john.spray@redhat.com

@ceph-jenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/384/
Tests passed for this pull request.

except MaxWhileTries:
log.error("process failed to terminate after unmount. This probably"
"indicates a bug within ceph-fuse.")
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this actually do to the teuthology test runs that hit it? I've no idea but I think most throws in the post-yield cleanup result in hung jobs and nothing getting cleaned up...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregsfortytwo in run_tasks.py, in the part following the "Unwinding manager..." log message, it looks to me like exceptions in teardown are explicitly handled.

Entirely possible for a rogue process to be left lying around, but that's why we have nuke

@gregsfortytwo
Copy link
Member

Looks like this is busting things up; several failures in ubuntu-2015-06-09_11:25:56-fs-greg-fs-testing---basic-multi.

2015-06-09T12:18:44.892 INFO:teuthology.orchestra.run.plana91:Running: 'sudo fusermount -u /home/ubuntu/cephtest/mnt.0'
2015-06-09T12:18:45.017 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.plana91.stderr:ceph-fuse[19072]: fuse finished with error 0 and tester_r 0
2015-06-09T12:18:45.041 INFO:teuthology.orchestra.run.plana91:Running: "stat --file-system '--printf=%T\n' -- /home/ubuntu/cephtest/mnt.0"
2015-06-09T12:18:45.050 DEBUG:tasks.cephfs.fuse_mount:ceph-fuse not mounted, got fs type 'ext2/ext3'
2015-06-09T12:18:45.050 ERROR:teuthology.run_tasks:Manager failed: ceph-fuse
Traceback (most recent call last):
File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 125, in run_tasks
suppress = manager.exit(*exc_info)
File "/usr/lib/python2.7/contextlib.py", line 24, in exit
self.gen.next()
File "/var/lib/teuthworker/src/ceph-qa-suite_greg-fs-testing/tasks/ceph_fuse.py", line 135, in task
mount.umount_wait()
File "/var/lib/teuthworker/src/ceph-qa-suite_greg-fs-testing/tasks/cephfs/fuse_mount.py", line 239, in umount_wait
run.wait(self.fuse_daemon, 30)
File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 393, in wait
not_ready = list(processes)
TypeError: 'RemoteProcess' object is not iterable

@gregsfortytwo gregsfortytwo assigned jcsp and unassigned gregsfortytwo Jun 9, 2015
For cases where we have e.g. poked the fuse abort
file for a process, but it's still not dying.  Because
this is a special class of error (unlike e.g. when
we force umount something because the network is gone)
raise the error instead of trying again to kill
the client.

Fixes: #11835
Signed-off-by: John Spray <john.spray@redhat.com>
@ceph-jenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/395/
Tests passed for this pull request.

@jcsp
Copy link
Contributor Author

jcsp commented Jun 10, 2015

Oops, that was a typo to run.wait() (it wants a list of processes). Updated.

Our ffsb and fsync tests contain so many small writes at random offsets
that it can take >10 minutes to commit all of them to disk if we get
a slower OSD cluster. 15 minutes is still a plenty-fast timeout for
this stage compared to just hanging and losing the logs!

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@ceph-jenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-qa-suite-pull-requests/640/
Test FAILed.

gregsfortytwo added a commit that referenced this pull request Sep 21, 2015
tasks/cephfs: time out on ceph-fuses that don't die

Reviewed-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo gregsfortytwo merged commit e3c9947 into master Sep 21, 2015
@gregsfortytwo gregsfortytwo deleted the wip-11835 branch September 21, 2015 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants