Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete old shutdowns before creating new ones #5629

Merged
merged 3 commits into from Apr 29, 2022
Merged

Conversation

pebrc
Copy link
Collaborator

@pebrc pebrc commented Apr 28, 2022

Fixes #5628

Moves the cleanup at the beginning of the upgrade function to make sure we don't end up with left over shutdowns and then don't get to clean them up in time because we exit early after node deletions.

@pebrc pebrc added >bug Something isn't working v2.2.0 labels Apr 28, 2022
@pebrc
Copy link
Collaborator Author

pebrc commented Apr 28, 2022

run/e2e-tests tags=es

@pebrc pebrc marked this pull request as draft April 28, 2022 15:02
@pebrc
Copy link
Collaborator Author

pebrc commented Apr 28, 2022

I need to clean up the tests but I think the principle works. Just wanted to raise this already to run the e2e tests.

@pebrc
Copy link
Collaborator Author

pebrc commented Apr 28, 2022

Test failed due to an issue with the new pipeline. Will try to run manual tests for the time being.

@thbkrkr
Copy link
Contributor

thbkrkr commented Apr 28, 2022

run/e2e-tests tags=es

@pebrc
Copy link
Collaborator Author

pebrc commented Apr 28, 2022

Screenshot 2022-04-28 at 21 35 49

Test run of all es e2e test was successful. There is a more ambitious version of this PR that simply moves the maybeCompleteNodeUpgrades function to the same position as the call to nodeShutdown.Clear. It has the advantage of keeping the clean up code together. I am still testing this locally but the current change set would be a minimal fix IMO.

@pebrc pebrc marked this pull request as ready for review April 28, 2022 19:43
Copy link
Contributor

@naemono naemono left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running test locally, but can't see a scenario where clearing old shutdowns where the node is back in the cluster at the start of any upgrade loop would cause future problems.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
I did some manual testing while keeping an eye on /_nodes/shutdown to ensure that there was at most one shutdown record. Maybe this is something we could include in a test, ideally in a unit test or/and maybe along the continuous health check. (in another PR ofc)

@pebrc
Copy link
Collaborator Author

pebrc commented Apr 29, 2022

Thanks @barkbay and @naemono for the quick reviews!

Maybe this is something we could include in a test, ideally in a unit test or/and maybe along the continuous health check. (in another PR ofc)

Yes I think I am going to follow up on this PR with something a bit more comprehensive targeting the main branch/next release and will include improved e2e testing as well. I think your idea of an additional continuous health check is really good.

@pebrc pebrc merged commit 53bf9a4 into elastic:main Apr 29, 2022
pebrc added a commit to pebrc/cloud-on-k8s that referenced this pull request Apr 29, 2022
Moves the cleanup at the beginning of the upgrade function to make sure we don't end up with left over shutdowns and then don't get to clean them up in time because we exit early after node deletions.

(cherry picked from commit 53bf9a4)
@pebrc
Copy link
Collaborator Author

pebrc commented Apr 29, 2022

💚 All backports created successfully

Status Branch Result
2.2

Questions ?

Please refer to the Backport tool documentation

thbkrkr pushed a commit that referenced this pull request Apr 29, 2022
Moves the cleanup at the beginning of the upgrade function to make sure we don't end up with left over shutdowns and then don't get to clean them up in time because we exit early after node deletions.

(cherry picked from commit 53bf9a4)
fantapsody pushed a commit to fantapsody/cloud-on-k8s that referenced this pull request Feb 7, 2023
Moves the cleanup at the beginning of the upgrade function to make sure we don't end up with left over shutdowns and then don't get to clean them up in time because we exit early after node deletions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v2.2.0
Projects
None yet
4 participants