If we're going to fail, fail sooner #1059

zmc · 2017-04-05T18:56:20Z

The change to misc.wait_until_osds_up() is the follow-up to #1056, #1045 and #1033.

The change to parallel fixes a very longstanding bug that prevented jobs from failing properly. If one of the parallel greenlets encountered an exception, it did not interrupt any others. In certain common situations, e.g. the rados task running ceph_test_rados, it meant the job kept running until it hit the global timeout and was killed - without any useful error message.

An example of this in action:
This upgrade run using teuthology master timed out.
This run, which is otherwise identical, failed very quickly and with a much more useful error message: Command failed on ovh011 with status 1: 'sudo [...] ceph-osd -f --cluster ceph -i 0'

dmick · 2017-04-05T19:53:08Z

teuthology/parallel.py

-            # Emit message here because traceback gets stomped when we re-raise
-            log.exception("Exception in parallel execution")
-            raise
+        # raises if any greenlets exited with an exception


worth noting that it also blocks until completion, IMO, so readers of __exit__ understand that this is essentially a join().

Also, I think this implies that if there are N tasks, and 1..N-1 are still running, but N has raised, we will not notice the exception until N is visited (last)? I'm not saying that matters, but it's a thing to be aware of if true.

I was misreading this as greenlet.get(), not queue.get(). The way the queuing works, any reason for greenlet termination is available immediately. So I rescind my request for a comment too; either you understand queues or you don't, and I didn't, but now I do

dmick · 2017-04-05T19:54:15Z

teuthology/misc.py

@@ -911,6 +911,9 @@ def wait_until_osds_up(ctx, cluster, remote, ceph_cluster='ceph'):
    testdir = get_testdir(ctx)
    with safe_while(sleep=6, tries=50) as proceed:
        while proceed():
+            daemons = ctx.daemons.iter_daemons_of_role('osd', ceph_cluster)


maybe we also want to do something similar in wait_until_healthy, perhaps with more daemon types (don't remember if non-osds are also started without waiting)? Can be a followon, but seems similarly useful

Possibly - but that would warrant more testing IMO

yeah, hence 'followon'

dmick

Filing an issue for the followon "do it for other daemons in other places"; this looks fine to me

dmick

Oops, need to fix the unit test first

Signed-off-by: Zack Cerza <zack@redhat.com>

zmc requested a review from dmick April 5, 2017 18:56

dmick reviewed Apr 5, 2017

View reviewed changes

dmick approved these changes Apr 5, 2017

View reviewed changes

dmick requested changes Apr 5, 2017

View reviewed changes

zmc added 2 commits April 5, 2017 17:03

misc.wait_until_osds_up(): Check for daemon death

d897abb

Signed-off-by: Zack Cerza <zack@redhat.com>

parallel: Let exceptions interrupt other greenlets

5cc0774

Signed-off-by: Zack Cerza <zack@redhat.com>

zmc force-pushed the wip-daemon-failure branch from 0a0df93 to 5cc0774 Compare April 5, 2017 23:03

dmick approved these changes Apr 6, 2017

View reviewed changes

dmick self-assigned this Apr 10, 2017

dmick merged commit 0bc1a1b into master Apr 10, 2017

dmick deleted the wip-daemon-failure branch April 10, 2017 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If we're going to fail, fail sooner #1059

If we're going to fail, fail sooner #1059

zmc commented Apr 5, 2017 •

edited

dmick Apr 5, 2017 •

edited

dmick Apr 5, 2017

dmick Apr 5, 2017

zmc Apr 5, 2017

dmick Apr 5, 2017

dmick left a comment

dmick left a comment

If we're going to fail, fail sooner #1059

If we're going to fail, fail sooner #1059

Conversation

zmc commented Apr 5, 2017 • edited

dmick Apr 5, 2017 • edited

Choose a reason for hiding this comment

dmick Apr 5, 2017

Choose a reason for hiding this comment

dmick Apr 5, 2017

Choose a reason for hiding this comment

zmc Apr 5, 2017

Choose a reason for hiding this comment

dmick Apr 5, 2017

Choose a reason for hiding this comment

dmick left a comment

Choose a reason for hiding this comment

dmick left a comment

Choose a reason for hiding this comment

zmc commented Apr 5, 2017 •

edited

dmick Apr 5, 2017 •

edited