New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Issues with CS Handling in ILM Async Steps #68361
Fix Issues with CS Handling in ILM Async Steps #68361
Conversation
A few findings from debugging an SLM/ILM test issue: * No need to spin on all CS changes when waiting for a snapshot to go away, just use an appropriate predicate * Stop waiting for a new state if we are not master any longer since we can't complete the action anyway at this point (worse yet it will potentially run concurrently on local and master if it uses the `client` to execute its requests) * Properly fail the CS listener if the node is shutting down to get easier to debug logging * Stop throwing exceptions in `clusterStateProcessed` in the error step mover, this is forbidden
Pinging @elastic/es-core-features (Team:Core/Features) |
…nc-retry-during-snapshot-action-step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for this cleanup and fix @original-brownbear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM also, thanks Armin
npnp + thanks Andrei + Lee! |
* No need to spin on all CS changes when waiting for a snapshot to go away, just use an appropriate predicate * Stop waiting for a new state if we are not master any longer since we can't complete the action anyway at this point (worse yet it will potentially run concurrently on local and master if it uses the `client` to execute its requests) * Properly fail the CS listener if the node is shutting down to get easier to debug logging * Stop throwing exceptions in `clusterStateProcessed` in the error step mover, this is forbidden
* No need to spin on all CS changes when waiting for a snapshot to go away, just use an appropriate predicate * Stop waiting for a new state if we are not master any longer since we can't complete the action anyway at this point (worse yet it will potentially run concurrently on local and master if it uses the `client` to execute its requests) * Properly fail the CS listener if the node is shutting down to get easier to debug logging * Stop throwing exceptions in `clusterStateProcessed` in the error step mover, this is forbidden
* No need to spin on all CS changes when waiting for a snapshot to go away, just use an appropriate predicate * Stop waiting for a new state if we are not master any longer since we can't complete the action anyway at this point (worse yet it will potentially run concurrently on local and master if it uses the `client` to execute its requests) * Properly fail the CS listener if the node is shutting down to get easier to debug logging * Stop throwing exceptions in `clusterStateProcessed` in the error step mover, this is forbidden
* No need to spin on all CS changes when waiting for a snapshot to go away, just use an appropriate predicate * Stop waiting for a new state if we are not master any longer since we can't complete the action anyway at this point (worse yet it will potentially run concurrently on local and master if it uses the `client` to execute its requests) * Properly fail the CS listener if the node is shutting down to get easier to debug logging * Stop throwing exceptions in `clusterStateProcessed` in the error step mover, this is forbidden
A few findings from debugging an SLM/ILM test issue:
just use an appropriate predicate
complete the action anyway at this point (worse yet it will potentially run concurrently
on local and master if it uses the
client
to execute its requests)logging
clusterStateProcessed
in the error step mover, this is forbiddenMarking as non-issue for now since it mostly makes test debugging noisy.