Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove bad nodes from cluster during the test execution #325

Merged
merged 29 commits into from Jun 29, 2022

Conversation

stan-is-hate
Copy link
Member

@stan-is-hate stan-is-hate commented Jun 10, 2022

Summary
If the node stops responding, we don't want to schedule any more tests on it.
The way we do this is we ping every node as we allocate a subcluster for the test. If the node does not respond, we remove it from the cluster's available nodes permanently.

Implementation notes
I've considered multiple ways of implementing this.
The cleanest by far would be to move remove_spec and its "sister" methods out from node_container and into the cluster.

However, it would also be the most disruptive, especially if other people are using their own custom cluster implementations.

Instead, I've opted to do a relatively minimalistic patch to node_container, cluster implementations and runner. The core of the change is in the node_container.remove_spec() method, where we will now check node health before allocating it - if the node if of type RemoteAccount. It will return list of good nodes and a list of bad nodes, or raise an exception with bad nodes included as part of the exception object.

I'm not super happy about this implementation for two reasons:

  • NodeContainer behaves differently depending on whether in contains ClusterNode or RemoteAccount objects, and may do health checks or not

Testing
Tested with unit tests and live runs with vagrant.

@stan-is-hate stan-is-hate changed the title DRAFT: Remove bad nodes from cluster during the test execution Remove bad nodes from cluster during the test execution Jun 16, 2022
@stan-is-hate stan-is-hate marked this pull request as ready for review June 16, 2022 05:02
@stan-is-hate stan-is-hate requested a review from a team June 16, 2022 05:02
Copy link
Member

@imcdo imcdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much stan, partial review for now, going to go look into scheduler and runner this afternoon but appreciate the fix!

Comment on lines 37 to 39
allocated, bad, err = self._available_nodes.remove_spec(cluster_spec)
if err:
raise InsufficientResourcesError("Not enough nodes available to allocate. " + err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

man not a huge fan of this go-like error handling here. Why dont we do error handling the more pythonic way here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan as well, but that's how it worked before too in node_container (check can_remove_spec method). The problem is that outside of a larger refactor, I don't see a good way of returning multiple types of information here, cause there are multiple outcomes of this method. I guess I could wrap this in an AllocationResult object, which would have 'success' field set to True or False, I'll try that.

There's a more fundamental issue here of using exceptions for control flow, which is not a good pattern in general, but we do use it since I didn't want to modify the signature of alloc method on the cluster. Like this method should not raise an exception when there are not enough nodes, this is one of the expected conditions, not an exception, but we'd have to modify the signature of do_alloc and alloc to get rid of that.

Copy link
Member

@imcdo imcdo Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using exceptions for control flow is considered pretty pythonic https://docs.quantifiedcode.com/python-anti-patterns/readability/asking_for_permission_instead_of_forgiveness_when_working_with_files.html

generally try catching is preferred in python in my understanding, as its more of a do thing and see if it works language rather than the other way around.

ducktape/cluster/node_container.py Outdated Show resolved Hide resolved
ducktape/cluster/node_container.py Outdated Show resolved Hide resolved
Copy link
Member

@imcdo imcdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some more comments for you thanks for this work stan itll make a huge difference on ducktape run stability! One question I do have is how are these unscheduled tests reported? Should we create a new category for unscheduled?

ducktape/cluster/node_container.py Outdated Show resolved Hide resolved
ducktape/cluster/node_container.py Outdated Show resolved Hide resolved
ducktape/cluster/node_container.py Outdated Show resolved Hide resolved
ducktape/cluster/remoteaccount.py Outdated Show resolved Hide resolved
ducktape/tests/runner.py Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
@stan-is-hate stan-is-hate requested review from imcdo and a team June 17, 2022 22:48
ducktape/cluster/node_container.py Show resolved Hide resolved
Comment on lines 37 to 41
@dataclass
class RemoveSpecResult:
good_nodes: List = field(default_factory=list)
bad_nodes: List = field(default_factory=list)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be convenient to just add

Suggested change
@dataclass
class RemoveSpecResult:
good_nodes: List = field(default_factory=list)
bad_nodes: List = field(default_factory=list)
@dataclass
class RemoveSpecResult:
good_nodes: List = field(default_factory=list)
bad_nodes: List = field(default_factory=list)
def __iter__(self):
yield self.good_nodes
yield self.bad_nodes

if you want to do
good_nodes, bad_nodes = remove_spec

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then why would you even need a result spec? can simply do return good_nodes, bad_nodes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did exactly that, seems more pythonic and occam-ic :)

Copy link
Member

@imcdo imcdo Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well sometimes you want both i guess. if you only want the bad nodes you can do get_results().bad_nodes but if you want both you can do good_nodes, bad_nodes = get_results() but hey whatever you'd like here to be honest not super important
This is kinda the use case of the named tuple here that way its still explicit and if your a third party its easier to understand what you are holding after the return but its kinda a small detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I see your point. It's expressiveness vs simplicity I think. In this case I think an unnamed result is a bit more 'pythonic', but it can be argued about for sure. You're right about it being a small detail, so I'll leave it unnamed :)

ducktape/cluster/node_container.py Show resolved Hide resolved
ducktape/cluster/node_container.py Outdated Show resolved Hide resolved
ducktape/cluster/remoteaccount.py Show resolved Hide resolved
ducktape/tests/runner.py Outdated Show resolved Hide resolved
ducktape/tests/runner_client.py Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
systests/cluster/test_debug.py Outdated Show resolved Hide resolved
systests/cluster/test_debug.py Outdated Show resolved Hide resolved
Copy link
Member

@imcdo imcdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good thanks stan!

@stan-is-hate
Copy link
Member Author

Oh, I missed this question:

One question I do have is how are these unscheduled tests reported? Should we create a new category for unscheduled?

They are reported as FAILED, same as before - any test that doesn't fit existing cluster will be marked as failed, and the message in the failure should include the reason (ie not enough nodes available)

I haven't changed the code to report unscheduled tests, I've only made it run every time a cluster changes size, rather than once at the beginning.

@stan-is-hate stan-is-hate merged commit dc44ae9 into master Jun 29, 2022
@stan-is-hate stan-is-hate deleted the tainted-nodes branch June 29, 2022 22:58
gousteris pushed a commit to gousteris/ducktape that referenced this pull request Aug 30, 2023
…#325)

* work in progress - hacking around removing nodes from the cluster

* temporarily revert to older paramiko

* work in progress - hacking around removing nodes from the cluster

* fixed most of this crap

* fixed the rest of the issues

* added couple of tests; more needed

* fixed more tests, fixed a bug when cluster becomes empty

* fixed the rest of the tests

* added another test, plus comments and simplify check cluster size changed

* removed unused var

* style

* merge fixes

* remove debug output

* refactor and more tests

* pr comment

* moved kwarg after positional args in json and vagrant

* updated vagrant to ubuntu20 and fixed network discovery to account for other possible interface names

* rever requirements.txt change

* use exception instead of success variable

* another test case + style

* just create ssh client instead of sending ping

* pr comments

* removed a separate class for a remove_node result and return a tuple instead; occams razor

* added type annotation

* unused import
gousteris pushed a commit to gousteris/ducktape that referenced this pull request Aug 30, 2023
…#325)

* work in progress - hacking around removing nodes from the cluster

* temporarily revert to older paramiko

* work in progress - hacking around removing nodes from the cluster

* fixed most of this crap

* fixed the rest of the issues

* added couple of tests; more needed

* fixed more tests, fixed a bug when cluster becomes empty

* fixed the rest of the tests

* added another test, plus comments and simplify check cluster size changed

* removed unused var

* style

* merge fixes

* remove debug output

* refactor and more tests

* pr comment

* moved kwarg after positional args in json and vagrant

* updated vagrant to ubuntu20 and fixed network discovery to account for other possible interface names

* rever requirements.txt change

* use exception instead of success variable

* another test case + style

* just create ssh client instead of sending ping

* pr comments

* removed a separate class for a remove_node result and return a tuple instead; occams razor

* added type annotation

* unused import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants