Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Better track worker failures in SpecificationCluster #2768

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Jun 11, 2019

This refactores SpecificationCluster to allow responding to errors
raised during worker startup, allowing better error messages to be
reported to the user. All started tasks now have their results
inspected, removing asyncio default handling of logging background
errors.

The end goal of this is to be able to provide better user errors during
cluster startup failure (not necessarily cluster scale up errors).

Aims to address #2708.

This refactores `SpecificationCluster` to allow responding to errors
raised during worker startup, allowing better error messages to be
reported to the user. All started tasks now have their results
inspected, removing `asyncio` default handling of logging background
errors.

The end goal of this is to be able to provide better user errors during
cluster startup failure (not necessarily cluster scale up errors).
@jcrist
Copy link
Member Author

jcrist commented Jun 11, 2019

This currently restructures the code to allow tracking errors (the diff is larger than it needed to be, as I reordered some methods to group things - sorry). I'm currently struggling with debugging the Nanny process, the restart behavior seems and process management seems resistant to shutting down during these error cases. I'm going to take a break from this and come back later.

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. I had a review sitting here un-submitted.

if self.status == "closed":
raise ValueError("Cluster is closed")
def __repr__(self):
return "SpecCluster(%r, workers=%d)" % (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "SpecCluster(%r, workers=%d)" % (
return "%s(%r, workers=%d)" % (
type(self).__name__,

async with SpecCluster(
asynchronous=True,
workers={"good": {"cls": Worker}, "bad": {"cls": BrokenWorker}},
scheduler={"cls": Scheduler, "options": {"port": 0}},
) as cluster:
pass

assert "Broken" in str(info.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "Broken" still be in the exception?

@mrocklin
Copy link
Member

@jcrist this is marked as WIP. Are there aspects here that you're still uncomfortable with?

@mrocklin
Copy link
Member

Also, do you think that this should block a 2.0 release?

@jcrist
Copy link
Member Author

jcrist commented Jun 25, 2019

Also, do you think that this should block a 2.0 release?

No, this shouldn't be a blocker.

Are there aspects here that you're still uncomfortable with?

Yeah, this still doesn't handle detecting the original issue (some logic in the nanny is complicating things). Hope to get back to this soon.

@mrocklin
Copy link
Member

mrocklin commented Jun 25, 2019 via email

@jakirkham jakirkham self-assigned this Jul 30, 2019
Base automatically changed from master to main March 8, 2021 19:03
@jakirkham jakirkham removed their assignment Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants