Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If we're going to fail, fail sooner #1059

Merged
merged 2 commits into from Apr 10, 2017
Merged

If we're going to fail, fail sooner #1059

merged 2 commits into from Apr 10, 2017

Conversation

zmc
Copy link
Member

@zmc zmc commented Apr 5, 2017

The change to misc.wait_until_osds_up() is the follow-up to #1056, #1045 and #1033.

The change to parallel fixes a very longstanding bug that prevented jobs from failing properly. If one of the parallel greenlets encountered an exception, it did not interrupt any others. In certain common situations, e.g. the rados task running ceph_test_rados, it meant the job kept running until it hit the global timeout and was killed - without any useful error message.

An example of this in action:
This upgrade run using teuthology master timed out.
This run, which is otherwise identical, failed very quickly and with a much more useful error message: Command failed on ovh011 with status 1: 'sudo [...] ceph-osd -f --cluster ceph -i 0'

@zmc zmc requested a review from dmick April 5, 2017 18:56
# Emit message here because traceback gets stomped when we re-raise
log.exception("Exception in parallel execution")
raise
# raises if any greenlets exited with an exception
Copy link
Member

@dmick dmick Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth noting that it also blocks until completion, IMO, so readers of __exit__ understand that this is essentially a join().

Also, I think this implies that if there are N tasks, and 1..N-1 are still running, but N has raised, we will not notice the exception until N is visited (last)? I'm not saying that matters, but it's a thing to be aware of if true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was misreading this as greenlet.get(), not queue.get(). The way the queuing works, any reason for greenlet termination is available immediately. So I rescind my request for a comment too; either you understand queues or you don't, and I didn't, but now I do

@@ -911,6 +911,9 @@ def wait_until_osds_up(ctx, cluster, remote, ceph_cluster='ceph'):
testdir = get_testdir(ctx)
with safe_while(sleep=6, tries=50) as proceed:
while proceed():
daemons = ctx.daemons.iter_daemons_of_role('osd', ceph_cluster)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we also want to do something similar in wait_until_healthy, perhaps with more daemon types (don't remember if non-osds are also started without waiting)? Can be a followon, but seems similarly useful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly - but that would warrant more testing IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, hence 'followon'

Copy link
Member

@dmick dmick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filing an issue for the followon "do it for other daemons in other places"; this looks fine to me

Copy link
Member

@dmick dmick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, need to fix the unit test first

zmc added 2 commits April 5, 2017 17:03
Signed-off-by: Zack Cerza <zack@redhat.com>
Signed-off-by: Zack Cerza <zack@redhat.com>
@dmick dmick self-assigned this Apr 10, 2017
@dmick dmick merged commit 0bc1a1b into master Apr 10, 2017
@dmick dmick deleted the wip-daemon-failure branch April 10, 2017 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants