New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Better track worker failures in SpecificationCluster #2768
base: main
Are you sure you want to change the base?
Conversation
This refactores `SpecificationCluster` to allow responding to errors raised during worker startup, allowing better error messages to be reported to the user. All started tasks now have their results inspected, removing `asyncio` default handling of logging background errors. The end goal of this is to be able to provide better user errors during cluster startup failure (not necessarily cluster scale up errors).
This currently restructures the code to allow tracking errors (the diff is larger than it needed to be, as I reordered some methods to group things - sorry). I'm currently struggling with debugging the Nanny process, the restart behavior seems and process management seems resistant to shutting down during these error cases. I'm going to take a break from this and come back later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. I had a review sitting here un-submitted.
if self.status == "closed": | ||
raise ValueError("Cluster is closed") | ||
def __repr__(self): | ||
return "SpecCluster(%r, workers=%d)" % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return "SpecCluster(%r, workers=%d)" % ( | |
return "%s(%r, workers=%d)" % ( | |
type(self).__name__, |
async with SpecCluster( | ||
asynchronous=True, | ||
workers={"good": {"cls": Worker}, "bad": {"cls": BrokenWorker}}, | ||
scheduler={"cls": Scheduler, "options": {"port": 0}}, | ||
) as cluster: | ||
pass | ||
|
||
assert "Broken" in str(info.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "Broken" still be in the exception?
@jcrist this is marked as WIP. Are there aspects here that you're still uncomfortable with? |
Also, do you think that this should block a 2.0 release? |
No, this shouldn't be a blocker.
Yeah, this still doesn't handle detecting the original issue (some logic in the nanny is complicating things). Hope to get back to this soon. |
Cool. Thanks for surfacing for the update.
…On Tue, Jun 25, 2019 at 5:59 PM Jim Crist ***@***.***> wrote:
Also, do you think that this should block a 2.0 release?
No, this shouldn't be a blocker.
Are there aspects here that you're still uncomfortable with?
Yeah, this still doesn't handle detecting the original issue (some logic
in the nanny is complicating things). Hope to get back to this soon.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2768?email_source=notifications&email_token=AACKZTDUDZVBWTLJT4KW7ELP4I6FHA5CNFSM4HXDIQK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYQXM3I#issuecomment-505509485>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTGSPSYUNJNS6HGERSDP4I6FHANCNFSM4HXDIQKQ>
.
|
This refactores
SpecificationCluster
to allow responding to errorsraised during worker startup, allowing better error messages to be
reported to the user. All started tasks now have their results
inspected, removing
asyncio
default handling of logging backgrounderrors.
The end goal of this is to be able to provide better user errors during
cluster startup failure (not necessarily cluster scale up errors).
Aims to address #2708.