New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove bad nodes from cluster during the test execution #325
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much stan, partial review for now, going to go look into scheduler and runner this afternoon but appreciate the fix!
allocated, bad, err = self._available_nodes.remove_spec(cluster_spec) | ||
if err: | ||
raise InsufficientResourcesError("Not enough nodes available to allocate. " + err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
man not a huge fan of this go-like error handling here. Why dont we do error handling the more pythonic way here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan as well, but that's how it worked before too in node_container (check can_remove_spec method). The problem is that outside of a larger refactor, I don't see a good way of returning multiple types of information here, cause there are multiple outcomes of this method. I guess I could wrap this in an AllocationResult object, which would have 'success' field set to True or False, I'll try that.
There's a more fundamental issue here of using exceptions for control flow, which is not a good pattern in general, but we do use it since I didn't want to modify the signature of alloc method on the cluster. Like this method should not raise an exception when there are not enough nodes, this is one of the expected conditions, not an exception, but we'd have to modify the signature of do_alloc and alloc to get rid of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using exceptions for control flow is considered pretty pythonic https://docs.quantifiedcode.com/python-anti-patterns/readability/asking_for_permission_instead_of_forgiveness_when_working_with_files.html
generally try catching is preferred in python in my understanding, as its more of a do thing and see if it works language rather than the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some more comments for you thanks for this work stan itll make a huge difference on ducktape run stability! One question I do have is how are these unscheduled tests reported? Should we create a new category for unscheduled?
…r other possible interface names
ducktape/cluster/node_container.py
Outdated
@dataclass | ||
class RemoveSpecResult: | ||
good_nodes: List = field(default_factory=list) | ||
bad_nodes: List = field(default_factory=list) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be convenient to just add
@dataclass | |
class RemoveSpecResult: | |
good_nodes: List = field(default_factory=list) | |
bad_nodes: List = field(default_factory=list) | |
@dataclass | |
class RemoveSpecResult: | |
good_nodes: List = field(default_factory=list) | |
bad_nodes: List = field(default_factory=list) | |
def __iter__(self): | |
yield self.good_nodes | |
yield self.bad_nodes |
if you want to do
good_nodes, bad_nodes = remove_spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but then why would you even need a result spec? can simply do return good_nodes, bad_nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did exactly that, seems more pythonic and occam-ic :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well sometimes you want both i guess. if you only want the bad nodes you can do get_results().bad_nodes
but if you want both you can do good_nodes, bad_nodes = get_results()
but hey whatever you'd like here to be honest not super important
This is kinda the use case of the named tuple here that way its still explicit and if your a third party its easier to understand what you are holding after the return but its kinda a small detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I see your point. It's expressiveness vs simplicity I think. In this case I think an unnamed result is a bit more 'pythonic', but it can be argued about for sure. You're right about it being a small detail, so I'll leave it unnamed :)
…instead; occams razor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good thanks stan!
Oh, I missed this question:
They are reported as FAILED, same as before - any test that doesn't fit existing cluster will be marked as failed, and the message in the failure should include the reason (ie not enough nodes available) I haven't changed the code to report unscheduled tests, I've only made it run every time a cluster changes size, rather than once at the beginning. |
…#325) * work in progress - hacking around removing nodes from the cluster * temporarily revert to older paramiko * work in progress - hacking around removing nodes from the cluster * fixed most of this crap * fixed the rest of the issues * added couple of tests; more needed * fixed more tests, fixed a bug when cluster becomes empty * fixed the rest of the tests * added another test, plus comments and simplify check cluster size changed * removed unused var * style * merge fixes * remove debug output * refactor and more tests * pr comment * moved kwarg after positional args in json and vagrant * updated vagrant to ubuntu20 and fixed network discovery to account for other possible interface names * rever requirements.txt change * use exception instead of success variable * another test case + style * just create ssh client instead of sending ping * pr comments * removed a separate class for a remove_node result and return a tuple instead; occams razor * added type annotation * unused import
…#325) * work in progress - hacking around removing nodes from the cluster * temporarily revert to older paramiko * work in progress - hacking around removing nodes from the cluster * fixed most of this crap * fixed the rest of the issues * added couple of tests; more needed * fixed more tests, fixed a bug when cluster becomes empty * fixed the rest of the tests * added another test, plus comments and simplify check cluster size changed * removed unused var * style * merge fixes * remove debug output * refactor and more tests * pr comment * moved kwarg after positional args in json and vagrant * updated vagrant to ubuntu20 and fixed network discovery to account for other possible interface names * rever requirements.txt change * use exception instead of success variable * another test case + style * just create ssh client instead of sending ping * pr comments * removed a separate class for a remove_node result and return a tuple instead; occams razor * added type annotation * unused import
Summary
If the node stops responding, we don't want to schedule any more tests on it.
The way we do this is we ping every node as we allocate a subcluster for the test. If the node does not respond, we remove it from the cluster's available nodes permanently.
Implementation notes
I've considered multiple ways of implementing this.
The cleanest by far would be to move
remove_spec
and its "sister" methods out from node_container and into the cluster.However, it would also be the most disruptive, especially if other people are using their own custom cluster implementations.
Instead, I've opted to do a relatively minimalistic patch to node_container, cluster implementations and runner. The core of the change is in the
node_container.remove_spec()
method, where we will now check node health before allocating it - if the node if of type RemoteAccount. It will return list of good nodes and a list of bad nodes, or raise an exception with bad nodes included as part of the exception object.I'm not super happy about this implementation for two reasons:
Testing
Tested with unit tests and live runs with vagrant.