Node health checks during pre-allocation takes time and leads to runner client timeouts #339

stan-is-hate · 2022-07-15T22:22:53Z

Symptoms
Unable to receive response from driver when running a lot of tests in parallel, esp. if the tests take a lot of nodes

Cause
This error is thrown by the runner_client when it sends the request to the runner and doesn't get the response back in time. Timeout is 3 seconds, plus 5 retries, so 15 seconds total.

The issue started happening when v0.10.0 introduced health checks for nodes to remove the ones that fail this check from the cluster - with the intention of not scheduling tests on unresponsive nodes.

However, when scheduling multiple tests in a row (--max-parallel flag) and if each tests takes multiple nodes, this process can take some time.

Each test that runner triggers is spawning a runner_client process, which then sends back ready message. Due to the design of the runner, it doesn't attempt to receive messages until it finishes scheduling all available tests, so the first test that was triggered won't receive the response to its ready message until the runner finishes scheduling all remaining tests.
And since runner is pinging each node for each test, that process can theoretically take longer than 15 seconds thus leading to the first runner_client timing out waiting for response from runner.

Mitigation
We've already removed v0.10.0 from PyPI.

We've also updated internal release process guide to include running full suite of Apache Kafka tests before each ducktape release. We will move such guide to the official documentation or github wiki in the near future.

We will be releasing v0.10.1 shortly which disables health checks for nodes, to ublock master branch development (master branch has since moved on and more PRs have been merged)

Long-term solutions
Details TBD, current idea is to process zmq responses in the separate thread in the runner, so that scheduling is kept synchronous, but responses are handled in an async fashion. This should likely work, but any async and threading work can lead to more issues, so care needs to be taken.

cc @confluentinc/quality-eng

The text was updated successfully, but these errors were encountered:

* dont use copy * dont use continue * also log scheduling errors * small fixes * make available() a noop in remoteaccount.py * bump version to 0.10.1 + changelog * style

…onfluentinc#343) * dont use copy * dont use continue * also log scheduling errors * small fixes * make available() a noop in remoteaccount.py * bump version to 0.10.1 + changelog * style

* dont use copy * dont use continue * also log scheduling errors * small fixes * make available() a noop in remoteaccount.py * bump version to 0.10.1 + changelog * style

stan-is-hate self-assigned this Jul 15, 2022

stan-is-hate mentioned this issue Jul 15, 2022

Fix #339 - make available() a no-op #340

Merged

stan-is-hate added a commit that referenced this issue Jul 18, 2022

Fix #339 - make available() a no-op (#340)

4b78a13

* dont use copy * dont use continue * also log scheduling errors * small fixes * make available() a noop in remoteaccount.py * bump version to 0.10.1 + changelog * style

stan-is-hate closed this as completed in ef28f5a Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node health checks during pre-allocation takes time and leads to runner client timeouts #339

Node health checks during pre-allocation takes time and leads to runner client timeouts #339

stan-is-hate commented Jul 15, 2022

Node health checks during pre-allocation takes time and leads to runner client timeouts #339

Node health checks during pre-allocation takes time and leads to runner client timeouts #339

Comments

stan-is-hate commented Jul 15, 2022