New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If we're going to fail, fail sooner #1059
Conversation
# Emit message here because traceback gets stomped when we re-raise | ||
log.exception("Exception in parallel execution") | ||
raise | ||
# raises if any greenlets exited with an exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth noting that it also blocks until completion, IMO, so readers of __exit__
understand that this is essentially a join().
Also, I think this implies that if there are N tasks, and 1..N-1 are still running, but N has raised, we will not notice the exception until N is visited (last)? I'm not saying that matters, but it's a thing to be aware of if true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was misreading this as greenlet.get(), not queue.get(). The way the queuing works, any reason for greenlet termination is available immediately. So I rescind my request for a comment too; either you understand queues or you don't, and I didn't, but now I do
@@ -911,6 +911,9 @@ def wait_until_osds_up(ctx, cluster, remote, ceph_cluster='ceph'): | |||
testdir = get_testdir(ctx) | |||
with safe_while(sleep=6, tries=50) as proceed: | |||
while proceed(): | |||
daemons = ctx.daemons.iter_daemons_of_role('osd', ceph_cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we also want to do something similar in wait_until_healthy, perhaps with more daemon types (don't remember if non-osds are also started without waiting)? Can be a followon, but seems similarly useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly - but that would warrant more testing IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, hence 'followon'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filing an issue for the followon "do it for other daemons in other places"; this looks fine to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, need to fix the unit test first
Signed-off-by: Zack Cerza <zack@redhat.com>
Signed-off-by: Zack Cerza <zack@redhat.com>
The change to
misc.wait_until_osds_up()
is the follow-up to #1056, #1045 and #1033.The change to
parallel
fixes a very longstanding bug that prevented jobs from failing properly. If one of the parallel greenlets encountered an exception, it did not interrupt any others. In certain common situations, e.g. therados
task runningceph_test_rados
, it meant the job kept running until it hit the global timeout and was killed - without any useful error message.An example of this in action:
This upgrade run using teuthology master timed out.
This run, which is otherwise identical, failed very quickly and with a much more useful error message:
Command failed on ovh011 with status 1: 'sudo [...] ceph-osd -f --cluster ceph -i 0'