Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

Open
12 of 15 tasks
Tracked by #2428
pierrehilbert opened this issue Jan 25, 2023 · 12 comments · Fixed by #2618 or #3274
Open
12 of 15 tasks
Tracked by #2428

Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

pierrehilbert opened this issue Jan 25, 2023 · 12 comments · Fixed by #2618 or #3274
Assignees
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@pierrehilbert
Copy link
Contributor

pierrehilbert commented Jan 25, 2023

We need to create a basic integration test in the Agent repository (because we are testing Agent capability) to ensure that the upgrade are working as expected:

This integration test should cover upgrade from the Agent itself and not test upgrade triggered by Fleet Server.
Upgrade versions should be configurable but by default it should be current_version to next_version.

This test is crucial: upgrade is a really impacting feature and we faced some issue in the past.

For reference the majority of the upgrade process is defined in https://github.com/elastic/elastic-agent/tree/main/internal/pkg/agent/application/upgrade.

Blocked By:

@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jan 25, 2023
@blakerouse
Copy link
Contributor

I think there is an issue with writing this test. Currently the built version of Elastic Agent stores that hash of the running elastic-agent, an upgrade within the same repository would result in it trying to upgrade to the same hash value. That will not work.

We need a way to have upgrade use a different hash value, but possibly it is the same build, hmm...

@jlind23 jlind23 changed the title Create an integration test to ensure that Elastic Agent's upgrade are working as expected Create an e2e test to ensure that Elastic Agent's upgrade are working as expected May 9, 2023
@jlind23 jlind23 linked a pull request May 9, 2023 that will close this issue
7 tasks
@ycombinator
Copy link
Contributor

Reopening this issue as #2618 only covers the first test scenario mentioned in this issue:

Tests that the normal upgrade process can proceed without issues

@ycombinator
Copy link
Contributor

There is a test listed in this issue's description that goes like so:

Tests that the upgrade process can successfully roll back to the previous version when the newly installed agent version fails to start.

I've been working on this test for the past couple of weeks and have found a few bugs in the process. Fixing these bugs will unblock this test.

However, speaking with @pierrehilbert today, he clarified that the intent of the test is to ensure that we have a graceful rollback process in place, not so much how the upgraded Agent fails to cause this rollback. As such, I've left this test as is but added another test to the list that is similar:

Tests that the upgrade process can successfully roll back to the previous version when the newly installed agent version reports as unhealthy.

Both these tests will ensure that we have a graceful rollback process. The difference in them is in the setup of the test scenario - in the first one, the rollback should be initiated due to the Agent process crashing whereas in the second one, the rollback should be initiated due to the Agent process remaining up but reporting as unhealthy.

@ycombinator
Copy link
Contributor

Tests that the agent reliably restarts with the current version when an upgrade to a new version is interrupted unexpectedly.

@pierrehilbert @cmacknz @jlind23 Not sure who added this item in the list of tests so pinging all of you. Thanks!

I'd like to get some clarity on what we mean by "current version" and "interrupted" exactly.

For example, let's say the user requests an upgrade, and while the Agent is downloading the new artifact, we kill the Agent process. This would've been the old (pre-upgrade) version of the Agent that we killed. In this case, do we want the test to verify that the Agent is restarted reliably with the old version or (eventually) the new version?

Similarly, let's say the new version (post-upgrade) of the Agent is running. But the Upgrade Watcher is also running. What does it mean to interrupt the upgrade in this scenario? Should the test try killing the Upgrade Watcher process and verify that it eventually restarted? Or should the test try killing the new version of the Agent (only once, otherwise the Upgrade Watcher will roll back the upgrade thinking the new Agent is unhealthy — and we already have an integration test for this scenario) and verify that the new version of the Agent is running again and does not get rolled back by the Upgrade Watcher?

Perhaps we want to test all of these different scenarios? In that case, I think we should replace this one test in the list with multiple ones, one for each specific scenario.

@jlind23
Copy link
Contributor

jlind23 commented Oct 2, 2023

For example, let's say the user requests an upgrade, and while the Agent is downloading the new artifact, we kill the Agent process. This would've been the old (pre-upgrade) version of the Agent that we killed. In this case, do we want the test to verify that the Agent is restarted reliably with the old version or (eventually) the new version?

I'm not the one who wrote it, but I believe that in such case we want the Agent to restart reliably with the old version as we don't know if what state the new one ended up.

Perhaps we want to test all of these different scenarios? In that case, I think we should replace this one test in the list with multiple ones, one for each specific scenario.

I lean towards that option, we should probably list all specific scenarii and prioritise the work.

@cmacknz
Copy link
Member

cmacknz commented Oct 3, 2023

Ideally we test all three scenarios to know the result, but the most likely reason for the upgrade process to be interrupted is the host system shutting down or hibernating mid-upgrade. This means we are unlikely to see the watcher and new version of the agent be interrupted individually, it is more likely they will both be terminated together.

The overarching goal with the upgrade process is eventual consistency. If the user instructed the agent to upgrade it should perform the upgrade, eventually, regardless of any disruption to any process or system involved in the upgrade.

Note that "perform the upgrade" means progressing through every stage of the upgrade process, and it is valid for the end state to be a successful upgrade to a new version or roll back to the previous version if the new version does not work as expected.

This starts to overlap with https://github.com/elastic/ingest-dev/issues/1889 (fault tolerant action processing scenarios), I suspect to get this to work properly we will possibly need to make changes to Fleet as well as the agent since it will be directly tied to how and when we acknowledge actions and where they are persisted.

I would prefer we prioritize making the upgrade process observable before we do this type of testing, both because doing so will make these problems easier to debug and because making this work correctly under all conditions may take significant time.

@jlind23
Copy link
Contributor

jlind23 commented May 15, 2024

@ycombinator is this issue still required even with all the work we achieved on upgrades lately?

@blakerouse
Copy link
Contributor

@jlind23 @ycombinator The integration testing framework covers all of this, even the checkboxes that where not checked. I would say this one is done.

@ycombinator
Copy link
Contributor

The integration testing framework covers all of this, even the checkboxes that where not checked

Thanks @blakerouse. I found the test that covers air gapped upgrades and have checked off that box. But I'm not sure we have tests that cover the remaining unchecked items:

Tests that the agent reliably restarts with the current version when an upgrade to a new version is interrupted unexpectedly.
Tests that the upgrade process can detect corrupted or unsigned upgrade packages.
Tests that the upgrade process either detects or automatically recovers from corrupted agent installations (missing directories, symlinks, files, etc.).

Could you please point me to the tests that cover each of these scenarios?

@blakerouse
Copy link
Contributor

The integration testing framework covers all of this, even the checkboxes that where not checked

Thanks @blakerouse. I found the test that covers air gapped upgrades and have checked off that box. But I'm not sure we have tests that cover the remaining unchecked items:

Tests that the agent reliably restarts with the current version when an upgrade to a new version is interrupted unexpectedly.

https://github.com/elastic/elastic-agent/blob/main/testing/integration/upgrade_rollback_test.go#L157

Tests that the upgrade process can detect corrupted or unsigned upgrade packages.

This one I do believe is missing. Was thinking we had it, but was unable to find it.

Tests that the upgrade process either detects or automatically recovers from corrupted agent installations (missing directories, symlinks, files, etc.).

https://github.com/elastic/elastic-agent/blob/main/testing/integration/upgrade_rollback_test.go#L42

Could you please point me to the tests that cover each of these scenarios?

See above.

@ycombinator
Copy link
Contributor

Tests that the agent reliably restarts with the current version when an upgrade to a new version is interrupted unexpectedly.

https://github.com/elastic/elastic-agent/blob/main/testing/integration/upgrade_rollback_test.go#L157

Actually, this test was written to cover the following scenario:

Tests that the upgrade process can successfully roll back to the previous version when the newly installed agent version fails to start: #3085

We need to write another test that interrupts the upgrade process itself, e.g. restarting the Agent while downloading or right before switching the symlink and ensuring that the pre-upgrade Agent version is the one that comes up. I'll clarify in the task list item to remove the ambiguity.

@ycombinator
Copy link
Contributor

The integration testing framework covers all of this, even the checkboxes that where not checked

Thanks @blakerouse. I found the test that covers air gapped upgrades and have checked off that box. But I'm not sure we have tests that cover the remaining unchecked items:

Tests that the agent reliably restarts with the current version when an upgrade to a new version is interrupted unexpectedly.

https://github.com/elastic/elastic-agent/blob/main/testing/integration/upgrade_rollback_test.go#L157

Tests that the upgrade process can detect corrupted or unsigned upgrade packages.

This one I do believe is missing. Was thinking we had it, but was unable to find it.

Tests that the upgrade process either detects or automatically recovers from corrupted agent installations (missing directories, symlinks, files, etc.).

https://github.com/elastic/elastic-agent/blob/main/testing/integration/upgrade_rollback_test.go#L42

Could you please point me to the tests that cover each of these scenarios?

See above.

Actually, this test was written to cover the following scenario:

Tests that the upgrade process can successfully roll back to the previous version when the newly installed agent version reports as unhealthy: #3274

We need to write another test that deliberately builds a corrupted Agent package, tries to upgrade to it, and ensures that the upgrade is rolled back to the pre-upgrade Agent version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
5 participants