Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

pierrehilbert · 2023-01-25T14:30:52Z

We need to create a basic integration test in the Agent repository (because we are testing Agent capability) to ensure that the upgrade are working as expected:

This integration test should cover upgrade from the Agent itself and not test upgrade triggered by Fleet Server.
Upgrade versions should be configurable but by default it should be current_version to next_version.

This test is crucial: upgrade is a really impacting feature and we faced some issue in the past.

For reference the majority of the upgrade process is defined in https://github.com/elastic/elastic-agent/tree/main/internal/pkg/agent/application/upgrade.

Blocked By:

The text was updated successfully, but these errors were encountered:

blakerouse · 2023-01-25T14:54:09Z

I think there is an issue with writing this test. Currently the built version of Elastic Agent stores that hash of the running elastic-agent, an upgrade within the same repository would result in it trying to upgrade to the same hash value. That will not work.

We need a way to have upgrade use a different hash value, but possibly it is the same build, hmm...

ycombinator · 2023-05-17T09:47:43Z

Reopening this issue as #2618 only covers the first test scenario mentioned in this issue:

Tests that the normal upgrade process can proceed without issues

ycombinator · 2023-08-08T16:31:14Z

There is a test listed in this issue's description that goes like so:

Tests that the upgrade process can successfully roll back to the previous version when the newly installed agent version fails to start.

I've been working on this test for the past couple of weeks and have found a few bugs in the process. Fixing these bugs will unblock this test.

However, speaking with @pierrehilbert today, he clarified that the intent of the test is to ensure that we have a graceful rollback process in place, not so much how the upgraded Agent fails to cause this rollback. As such, I've left this test as is but added another test to the list that is similar:

Tests that the upgrade process can successfully roll back to the previous version when the newly installed agent version reports as unhealthy.

Both these tests will ensure that we have a graceful rollback process. The difference in them is in the setup of the test scenario - in the first one, the rollback should be initiated due to the Agent process crashing whereas in the second one, the rollback should be initiated due to the Agent process remaining up but reporting as unhealthy.

ycombinator · 2023-09-29T21:32:18Z

Tests that the agent reliably restarts with the current version when an upgrade to a new version is interrupted unexpectedly.

@pierrehilbert @cmacknz @jlind23 Not sure who added this item in the list of tests so pinging all of you. Thanks!

I'd like to get some clarity on what we mean by "current version" and "interrupted" exactly.

For example, let's say the user requests an upgrade, and while the Agent is downloading the new artifact, we kill the Agent process. This would've been the old (pre-upgrade) version of the Agent that we killed. In this case, do we want the test to verify that the Agent is restarted reliably with the old version or (eventually) the new version?

Similarly, let's say the new version (post-upgrade) of the Agent is running. But the Upgrade Watcher is also running. What does it mean to interrupt the upgrade in this scenario? Should the test try killing the Upgrade Watcher process and verify that it eventually restarted? Or should the test try killing the new version of the Agent (only once, otherwise the Upgrade Watcher will roll back the upgrade thinking the new Agent is unhealthy — and we already have an integration test for this scenario) and verify that the new version of the Agent is running again and does not get rolled back by the Upgrade Watcher?

Perhaps we want to test all of these different scenarios? In that case, I think we should replace this one test in the list with multiple ones, one for each specific scenario.

jlind23 · 2023-10-02T18:32:22Z

For example, let's say the user requests an upgrade, and while the Agent is downloading the new artifact, we kill the Agent process. This would've been the old (pre-upgrade) version of the Agent that we killed. In this case, do we want the test to verify that the Agent is restarted reliably with the old version or (eventually) the new version?

I'm not the one who wrote it, but I believe that in such case we want the Agent to restart reliably with the old version as we don't know if what state the new one ended up.

Perhaps we want to test all of these different scenarios? In that case, I think we should replace this one test in the list with multiple ones, one for each specific scenario.

I lean towards that option, we should probably list all specific scenarii and prioritise the work.

cmacknz · 2023-10-03T14:41:32Z

Ideally we test all three scenarios to know the result, but the most likely reason for the upgrade process to be interrupted is the host system shutting down or hibernating mid-upgrade. This means we are unlikely to see the watcher and new version of the agent be interrupted individually, it is more likely they will both be terminated together.

The overarching goal with the upgrade process is eventual consistency. If the user instructed the agent to upgrade it should perform the upgrade, eventually, regardless of any disruption to any process or system involved in the upgrade.

Note that "perform the upgrade" means progressing through every stage of the upgrade process, and it is valid for the end state to be a successful upgrade to a new version or roll back to the previous version if the new version does not work as expected.

This starts to overlap with https://github.com/elastic/ingest-dev/issues/1889 (fault tolerant action processing scenarios), I suspect to get this to work properly we will possibly need to make changes to Fleet as well as the agent since it will be directly tied to how and when we acknowledge actions and where they are persisted.

I would prefer we prioritize making the upgrade process observable before we do this type of testing, both because doing so will make these problems easier to debug and because making this work correctly under all conditions may take significant time.

jlind23 · 2024-05-15T09:38:34Z

@ycombinator is this issue still required even with all the work we achieved on upgrades lately?

blakerouse · 2024-05-15T18:24:43Z

@jlind23 @ycombinator The integration testing framework covers all of this, even the checkboxes that where not checked. I would say this one is done.

ycombinator · 2024-05-15T18:40:39Z

The integration testing framework covers all of this, even the checkboxes that where not checked

Thanks @blakerouse. I found the test that covers air gapped upgrades and have checked off that box. But I'm not sure we have tests that cover the remaining unchecked items:

Tests that the agent reliably restarts with the current version when an upgrade to a new version is interrupted unexpectedly.
Tests that the upgrade process can detect corrupted or unsigned upgrade packages.
Tests that the upgrade process either detects or automatically recovers from corrupted agent installations (missing directories, symlinks, files, etc.).

Could you please point me to the tests that cover each of these scenarios?

blakerouse · 2024-05-15T19:17:47Z

The integration testing framework covers all of this, even the checkboxes that where not checked

Thanks @blakerouse. I found the test that covers air gapped upgrades and have checked off that box. But I'm not sure we have tests that cover the remaining unchecked items:

Tests that the agent reliably restarts with the current version when an upgrade to a new version is interrupted unexpectedly.

https://github.com/elastic/elastic-agent/blob/main/testing/integration/upgrade_rollback_test.go#L157

Tests that the upgrade process can detect corrupted or unsigned upgrade packages.

This one I do believe is missing. Was thinking we had it, but was unable to find it.

Tests that the upgrade process either detects or automatically recovers from corrupted agent installations (missing directories, symlinks, files, etc.).

https://github.com/elastic/elastic-agent/blob/main/testing/integration/upgrade_rollback_test.go#L42

Could you please point me to the tests that cover each of these scenarios?

See above.

ycombinator · 2024-05-20T16:26:38Z

Tests that the agent reliably restarts with the current version when an upgrade to a new version is interrupted unexpectedly.

https://github.com/elastic/elastic-agent/blob/main/testing/integration/upgrade_rollback_test.go#L157

Actually, this test was written to cover the following scenario:

Tests that the upgrade process can successfully roll back to the previous version when the newly installed agent version fails to start: #3085

We need to write another test that interrupts the upgrade process itself, e.g. restarting the Agent while downloading or right before switching the symlink and ensuring that the pre-upgrade Agent version is the one that comes up. I'll clarify in the task list item to remove the ambiguity.

ycombinator · 2024-05-20T16:29:03Z

The integration testing framework covers all of this, even the checkboxes that where not checked

Thanks @blakerouse. I found the test that covers air gapped upgrades and have checked off that box. But I'm not sure we have tests that cover the remaining unchecked items:

Tests that the agent reliably restarts with the current version when an upgrade to a new version is interrupted unexpectedly.

https://github.com/elastic/elastic-agent/blob/main/testing/integration/upgrade_rollback_test.go#L157

Tests that the upgrade process can detect corrupted or unsigned upgrade packages.

This one I do believe is missing. Was thinking we had it, but was unable to find it.

Tests that the upgrade process either detects or automatically recovers from corrupted agent installations (missing directories, symlinks, files, etc.).

https://github.com/elastic/elastic-agent/blob/main/testing/integration/upgrade_rollback_test.go#L42

Could you please point me to the tests that cover each of these scenarios?

See above.

Actually, this test was written to cover the following scenario:

Tests that the upgrade process can successfully roll back to the previous version when the newly installed agent version reports as unhealthy: #3274

We need to write another test that deliberately builds a corrupted Agent package, tries to upgrade to it, and ensures that the upgrade is rolled back to the pre-upgrade Agent version.

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jan 25, 2023

pierrehilbert assigned blakerouse Jan 25, 2023

This was referenced Apr 19, 2023

Log watcher properly #2518

Merged

Add integration and unit test coverage for the Elastic agent upgrade process. #1359

Closed

jlind23 changed the title ~~Create an integration test to ensure that Elastic Agent's upgrade are working as expected~~ Create an e2e test to ensure that Elastic Agent's upgrade are working as expected May 9, 2023

jlind23 linked a pull request May 9, 2023 that will close this issue

Port over Agent Upgrade test #2618

Merged

7 tasks

ycombinator closed this as completed in #2618 May 12, 2023

faec mentioned this issue May 12, 2023

[E2E][Linux][Windows][Mac] Upgrade managed Elastic Agent 8.X and 7.X #2429

Open

3 tasks

ycombinator reopened this May 17, 2023

ycombinator mentioned this issue May 17, 2023

[META] Testing scenarios for Elastic Agent E2E testing framework #2428

Open

1 task

jlind23 unassigned blakerouse May 17, 2023

pierrehilbert mentioned this issue May 24, 2023

Add an integration test for the upgrade rollback proces #2705

Closed

pierrehilbert assigned ycombinator Jun 4, 2023

pierrehilbert mentioned this issue Jun 7, 2023

Integration tests: test upgrades including grace period #2796

Closed

pierrehilbert changed the title ~~Create an e2e test to ensure that Elastic Agent's upgrade are working as expected~~ Create tests to ensure that Elastic Agent's upgrade are working as expected Jul 11, 2023

ycombinator mentioned this issue Jul 14, 2023

[Integration Test] New Agent after upgrade fails to start #3085

Merged

7 tasks

cmacknz mentioned this issue Jul 27, 2023

Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd #3124

Closed

ycombinator mentioned this issue Aug 23, 2023

Adding upgrade test for unhealthy upgraded Agent #3274

Merged

7 tasks

pierrehilbert linked a pull request Aug 27, 2023 that will close this issue

Adding upgrade test for unhealthy upgraded Agent #3274

Merged

7 tasks

ycombinator closed this as completed in #3274 Sep 8, 2023

ycombinator reopened this Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

pierrehilbert commented Jan 25, 2023 •

edited by ycombinator

blakerouse commented Jan 25, 2023

ycombinator commented May 17, 2023

ycombinator commented Aug 8, 2023

ycombinator commented Sep 29, 2023

jlind23 commented Oct 2, 2023

cmacknz commented Oct 3, 2023 •

edited

jlind23 commented May 15, 2024

blakerouse commented May 15, 2024

ycombinator commented May 15, 2024

blakerouse commented May 15, 2024

ycombinator commented May 20, 2024

ycombinator commented May 20, 2024

Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

Comments

pierrehilbert commented Jan 25, 2023 • edited by ycombinator

blakerouse commented Jan 25, 2023

ycombinator commented May 17, 2023

ycombinator commented Aug 8, 2023

ycombinator commented Sep 29, 2023

jlind23 commented Oct 2, 2023

cmacknz commented Oct 3, 2023 • edited

jlind23 commented May 15, 2024

blakerouse commented May 15, 2024

ycombinator commented May 15, 2024

blakerouse commented May 15, 2024

ycombinator commented May 20, 2024

ycombinator commented May 20, 2024

pierrehilbert commented Jan 25, 2023 •

edited by ycombinator

cmacknz commented Oct 3, 2023 •

edited