Remove bad nodes from cluster during the test execution #325

stan-is-hate · 2022-06-10T07:21:42Z

Summary
If the node stops responding, we don't want to schedule any more tests on it.
The way we do this is we ping every node as we allocate a subcluster for the test. If the node does not respond, we remove it from the cluster's available nodes permanently.

Implementation notes
I've considered multiple ways of implementing this.
The cleanest by far would be to move remove_spec and its "sister" methods out from node_container and into the cluster.

However, it would also be the most disruptive, especially if other people are using their own custom cluster implementations.

Instead, I've opted to do a relatively minimalistic patch to node_container, cluster implementations and runner. The core of the change is in the node_container.remove_spec() method, where we will now check node health before allocating it - if the node if of type RemoteAccount. It will return list of good nodes and a list of bad nodes, or raise an exception with bad nodes included as part of the exception object.

I'm not super happy about this implementation for two reasons:

NodeContainer behaves differently depending on whether in contains ClusterNode or RemoteAccount objects, and may do health checks or not

Testing
Tested with unit tests and live runs with vagrant.

…nged

imcdo

Thanks so much stan, partial review for now, going to go look into scheduler and runner this afternoon but appreciate the fix!

imcdo · 2022-06-16T18:14:51Z

ducktape/cluster/finite_subcluster.py

+        allocated, bad, err = self._available_nodes.remove_spec(cluster_spec)
+        if err:
+            raise InsufficientResourcesError("Not enough nodes available to allocate. " + err)


man not a huge fan of this go-like error handling here. Why dont we do error handling the more pythonic way here?

not a fan as well, but that's how it worked before too in node_container (check can_remove_spec method). The problem is that outside of a larger refactor, I don't see a good way of returning multiple types of information here, cause there are multiple outcomes of this method. I guess I could wrap this in an AllocationResult object, which would have 'success' field set to True or False, I'll try that.

There's a more fundamental issue here of using exceptions for control flow, which is not a good pattern in general, but we do use it since I didn't want to modify the signature of alloc method on the cluster. Like this method should not raise an exception when there are not enough nodes, this is one of the expected conditions, not an exception, but we'd have to modify the signature of do_alloc and alloc to get rid of that.

using exceptions for control flow is considered pretty pythonic https://docs.quantifiedcode.com/python-anti-patterns/readability/asking_for_permission_instead_of_forgiveness_when_working_with_files.html

generally try catching is preferred in python in my understanding, as its more of a do thing and see if it works language rather than the other way around.

ducktape/cluster/node_container.py

imcdo

left some more comments for you thanks for this work stan itll make a huge difference on ducktape run stability! One question I do have is how are these unscheduled tests reported? Should we create a new category for unscheduled?

ducktape/cluster/node_container.py

ducktape/cluster/remoteaccount.py

ducktape/tests/runner.py

requirements.txt

…r other possible interface names

ducktape/cluster/node_container.py

imcdo · 2022-06-22T23:45:16Z

ducktape/cluster/node_container.py

+@dataclass
+class RemoveSpecResult:
+    good_nodes: List = field(default_factory=list)
+    bad_nodes: List = field(default_factory=list)
+


could be convenient to just add

Suggested change

@dataclass

class RemoveSpecResult:

good_nodes: List = field(default_factory=list)

bad_nodes: List = field(default_factory=list)

@dataclass

class RemoveSpecResult:

good_nodes: List = field(default_factory=list)

bad_nodes: List = field(default_factory=list)

def __iter__(self):

yield self.good_nodes

yield self.bad_nodes

if you want to do
good_nodes, bad_nodes = remove_spec

but then why would you even need a result spec? can simply do return good_nodes, bad_nodes

did exactly that, seems more pythonic and occam-ic :)

well sometimes you want both i guess. if you only want the bad nodes you can do get_results().bad_nodes but if you want both you can do good_nodes, bad_nodes = get_results() but hey whatever you'd like here to be honest not super important
This is kinda the use case of the named tuple here that way its still explicit and if your a third party its easier to understand what you are holding after the return but its kinda a small detail.

yeah, I see your point. It's expressiveness vs simplicity I think. In this case I think an unnamed result is a bit more 'pythonic', but it can be argued about for sure. You're right about it being a small detail, so I'll leave it unnamed :)

ducktape/cluster/node_container.py

ducktape/cluster/remoteaccount.py

ducktape/tests/runner.py

ducktape/tests/runner_client.py

requirements.txt

systests/cluster/test_debug.py

…instead; occams razor

imcdo

looks good thanks stan!

stan-is-hate · 2022-06-29T22:47:59Z

Oh, I missed this question:

One question I do have is how are these unscheduled tests reported? Should we create a new category for unscheduled?

They are reported as FAILED, same as before - any test that doesn't fit existing cluster will be marked as failed, and the message in the failure should include the reason (ie not enough nodes available)

I haven't changed the code to report unscheduled tests, I've only made it run every time a cluster changes size, rather than once at the beginning.

…#325) * work in progress - hacking around removing nodes from the cluster * temporarily revert to older paramiko * work in progress - hacking around removing nodes from the cluster * fixed most of this crap * fixed the rest of the issues * added couple of tests; more needed * fixed more tests, fixed a bug when cluster becomes empty * fixed the rest of the tests * added another test, plus comments and simplify check cluster size changed * removed unused var * style * merge fixes * remove debug output * refactor and more tests * pr comment * moved kwarg after positional args in json and vagrant * updated vagrant to ubuntu20 and fixed network discovery to account for other possible interface names * rever requirements.txt change * use exception instead of success variable * another test case + style * just create ssh client instead of sending ping * pr comments * removed a separate class for a remove_node result and return a tuple instead; occams razor * added type annotation * unused import

stan-is-hate added 12 commits June 8, 2022 16:40

work in progress - hacking around removing nodes from the cluster

8eab8ac

temporarily revert to older paramiko

143ef6b

work in progress - hacking around removing nodes from the cluster

0a3577e

fixed most of this crap

23159c4

fixed the rest of the issues

165abae

added couple of tests; more needed

7cd9fd6

fixed more tests, fixed a bug when cluster becomes empty

4bfbd42

fixed the rest of the tests

9fbc157

added another test, plus comments and simplify check cluster size cha…

9b0c341

…nged

removed unused var

e49fcc3

style

28d9bde

Merge branch 'master' into tainted-nodes

ad1d486

stan-is-hate changed the title ~~DRAFT: Remove bad nodes from cluster during the test execution~~ Remove bad nodes from cluster during the test execution Jun 16, 2022

stan-is-hate marked this pull request as ready for review June 16, 2022 05:02

stan-is-hate requested a review from a team June 16, 2022 05:02

stan-is-hate added 2 commits June 15, 2022 23:02

merge fixes

69a7ce7

remove debug output

db0baff

imcdo reviewed Jun 16, 2022

View reviewed changes

imcdo requested changes Jun 16, 2022

View reviewed changes

imcdo reviewed Jun 16, 2022

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

stan-is-hate added 10 commits June 17, 2022 00:08

refactor and more tests

ae2e83f

pr comment

2a7b309

Merge branch 'master' into tainted-nodes

42f363e

moved kwarg after positional args in json and vagrant

285c1b1

updated vagrant to ubuntu20 and fixed network discovery to account fo…

406c0e6

…r other possible interface names

rever requirements.txt change

4129669

Merge branch 'vagrant-ubuntu-20' into tainted-nodes

8ddfca2

use exception instead of success variable

53b24f9

Merge branch 'master' into tainted-nodes

e7c4fdb

another test case + style

2db38d0

just create ssh client instead of sending ping

7912585

stan-is-hate requested review from imcdo and a team June 17, 2022 22:48

imcdo reviewed Jun 23, 2022

View reviewed changes

stan-is-hate added 2 commits June 24, 2022 22:08

pr comments

3c43a05

removed a separate class for a remove_node result and return a tuple …

519f8ed

…instead; occams razor

imcdo approved these changes Jun 27, 2022

View reviewed changes

stan-is-hate added 2 commits June 27, 2022 14:28

added type annotation

9ea28d1

unused import

3be67ab

stan-is-hate merged commit dc44ae9 into master Jun 29, 2022

stan-is-hate deleted the tainted-nodes branch June 29, 2022 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove bad nodes from cluster during the test execution #325

Remove bad nodes from cluster during the test execution #325

stan-is-hate commented Jun 10, 2022 •

edited

imcdo left a comment

imcdo Jun 16, 2022

stan-is-hate Jun 16, 2022

imcdo Jun 16, 2022 •

edited

imcdo left a comment •

edited

imcdo Jun 22, 2022

stan-is-hate Jun 25, 2022

stan-is-hate Jun 25, 2022

stan-is-hate Jun 25, 2022

imcdo Jun 27, 2022 •

edited

stan-is-hate Jun 27, 2022

imcdo left a comment

stan-is-hate commented Jun 29, 2022

Remove bad nodes from cluster during the test execution #325

Remove bad nodes from cluster during the test execution #325

Conversation

stan-is-hate commented Jun 10, 2022 • edited

imcdo left a comment

Choose a reason for hiding this comment

imcdo Jun 16, 2022

Choose a reason for hiding this comment

stan-is-hate Jun 16, 2022

Choose a reason for hiding this comment

imcdo Jun 16, 2022 • edited

Choose a reason for hiding this comment

imcdo left a comment • edited

Choose a reason for hiding this comment

imcdo Jun 22, 2022

Choose a reason for hiding this comment

stan-is-hate Jun 25, 2022

Choose a reason for hiding this comment

stan-is-hate Jun 25, 2022

Choose a reason for hiding this comment

stan-is-hate Jun 25, 2022

Choose a reason for hiding this comment

imcdo Jun 27, 2022 • edited

Choose a reason for hiding this comment

stan-is-hate Jun 27, 2022

Choose a reason for hiding this comment

imcdo left a comment

Choose a reason for hiding this comment

stan-is-hate commented Jun 29, 2022

stan-is-hate commented Jun 10, 2022 •

edited

imcdo Jun 16, 2022 •

edited

imcdo left a comment •

edited

imcdo Jun 27, 2022 •

edited