Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node health checks during pre-allocation takes time and leads to runner client timeouts #339

Closed
stan-is-hate opened this issue Jul 15, 2022 · 0 comments
Assignees

Comments

@stan-is-hate
Copy link
Member

Symptoms
Unable to receive response from driver when running a lot of tests in parallel, esp. if the tests take a lot of nodes

Cause
This error is thrown by the runner_client when it sends the request to the runner and doesn't get the response back in time. Timeout is 3 seconds, plus 5 retries, so 15 seconds total.

The issue started happening when v0.10.0 introduced health checks for nodes to remove the ones that fail this check from the cluster - with the intention of not scheduling tests on unresponsive nodes.

However, when scheduling multiple tests in a row (--max-parallel flag) and if each tests takes multiple nodes, this process can take some time.

Each test that runner triggers is spawning a runner_client process, which then sends back ready message. Due to the design of the runner, it doesn't attempt to receive messages until it finishes scheduling all available tests, so the first test that was triggered won't receive the response to its ready message until the runner finishes scheduling all remaining tests.
And since runner is pinging each node for each test, that process can theoretically take longer than 15 seconds thus leading to the first runner_client timing out waiting for response from runner.

Mitigation
We've already removed v0.10.0 from PyPI.

We've also updated internal release process guide to include running full suite of Apache Kafka tests before each ducktape release. We will move such guide to the official documentation or github wiki in the near future.

We will be releasing v0.10.1 shortly which disables health checks for nodes, to ublock master branch development (master branch has since moved on and more PRs have been merged)

Long-term solutions
Details TBD, current idea is to process zmq responses in the separate thread in the runner, so that scheduling is kept synchronous, but responses are handled in an async fashion. This should likely work, but any async and threading work can lead to more issues, so care needs to be taken.

cc @confluentinc/quality-eng

@stan-is-hate stan-is-hate self-assigned this Jul 15, 2022
stan-is-hate added a commit that referenced this issue Jul 18, 2022
* dont use copy

* dont use continue

* also log scheduling errors

* small fixes

* make available() a noop in remoteaccount.py

* bump version to 0.10.1 + changelog

* style
gousteris pushed a commit to gousteris/ducktape that referenced this issue Aug 30, 2023
…onfluentinc#343)

* dont use copy

* dont use continue

* also log scheduling errors

* small fixes

* make available() a noop in remoteaccount.py

* bump version to 0.10.1 + changelog

* style
gousteris pushed a commit to gousteris/ducktape that referenced this issue Aug 30, 2023
…onfluentinc#343)

* dont use copy

* dont use continue

* also log scheduling errors

* small fixes

* make available() a noop in remoteaccount.py

* bump version to 0.10.1 + changelog

* style
gousteris pushed a commit to gousteris/ducktape that referenced this issue Aug 30, 2023
* dont use copy

* dont use continue

* also log scheduling errors

* small fixes

* make available() a noop in remoteaccount.py

* bump version to 0.10.1 + changelog

* style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant