From 6c84db732b2822d4f90ab0c7dd7ad8a4742b3a27 Mon Sep 17 00:00:00 2001 From: Karen Metts Date: Wed, 17 Sep 2025 20:13:18 -0400 Subject: [PATCH 1/3] Doc: Add KI 8.18.7: Agent stuck on failed upgrade --- .../release-notes/release-notes-8.18.asciidoc | 39 +++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc b/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc index 299257b8c..5f08fe3d6 100644 --- a/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc +++ b/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc @@ -33,6 +33,45 @@ Also see: [[release-notes-8.18.7]] == {fleet} and {agent} 8.18.7 +[discrete] +[[known-issues-8.18.7]] +=== Known issues + +[[known-issue-2978-8-18-7]] +.Failed upgrades leave {agent} stuck until restart +[%collapsible] +==== + +This known issue applies to {agent} 8.18.7 and 9.0.7. {agent} versions 8.19.x and 9.1.x are not affected. + +On September 17, 2025, a known issue was discovered that can cause {agent} upgrades to get stuck if an upgrade attempt fails early. This happens because the coordinator's overrideState remains set, leaving the agent in a state that appears to be upgrading. + +**Conditions** + +This issue is triggered if the upgrade fails during one of the early checks inside Coordinator.Upgrade, for example: + +- The agent is not upgradeable +- Capabilities check denies the upgrade +- Most commonly: When {agent} is tamper-protected and Endpoint returns an error during action proxying, for example, because the upgrade action signature is invalid, missing, or fails verification. This causes the coordinator's override state to be stuck. + +**Symptoms** + +- {fleet} shows the upgrade action in progress, even though the upgrade remains stuck +- No further upgrade attempts succeed +- Elastic-agent status shows an override state indicating upgrade + +**Workaround** + +Restart the {agent} to clear the coordinator's overrideState and allow new upgrade attempts to proceed. + +**Resolution** + +This issue was fixed in link:https://github.com/elastic/elastic-agent/pull/9992[#9992], which ensures that the coordinator clears its override state whenever an early failure occurs. + +This fix will be included in versions 9.1.4, 8.19.4, 9.0.8, and 8.18.8. + +==== + [discrete] [[features-enhancements-8.18.7]] === New features and enhancements From 061785f0536a67367b0c47c4857090a0a4bb91c7 Mon Sep 17 00:00:00 2001 From: Karen Metts <35154725+karenzone@users.noreply.github.com> Date: Fri, 19 Sep 2025 15:30:09 -0400 Subject: [PATCH 2/3] Apply suggestions from code review Co-authored-by: Craig MacKenzie --- .../release-notes/release-notes-8.18.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc b/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc index 5f08fe3d6..a5d302023 100644 --- a/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc +++ b/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc @@ -44,7 +44,7 @@ Also see: This known issue applies to {agent} 8.18.7 and 9.0.7. {agent} versions 8.19.x and 9.1.x are not affected. -On September 17, 2025, a known issue was discovered that can cause {agent} upgrades to get stuck if an upgrade attempt fails early. This happens because the coordinator's overrideState remains set, leaving the agent in a state that appears to be upgrading. +On September 17, 2025, a known issue was discovered that can cause {agent} upgrades to get stuck if an upgrade attempt fails under specific conditions. This happens because the coordinator's overrideState remains set, leaving the agent in a state that appears to be upgrading. **Conditions** @@ -52,7 +52,7 @@ This issue is triggered if the upgrade fails during one of the early checks insi - The agent is not upgradeable - Capabilities check denies the upgrade -- Most commonly: When {agent} is tamper-protected and Endpoint returns an error during action proxying, for example, because the upgrade action signature is invalid, missing, or fails verification. This causes the coordinator's override state to be stuck. +- Most commonly: When {agent} is tamper-protected and Endpoint fails to validate that the upgrade action was correctly signed by Kibana to allow the upgrade, for example, because the signature is missing, invalid, or the connection between {agent} and endpoint was interrupted. This causes the agent coordinator's override state to become stuck until the agent is restarted. **Symptoms** From 2bec103751c4f45eb8668e980d088e4748b2ce73 Mon Sep 17 00:00:00 2001 From: Karen Metts <35154725+karenzone@users.noreply.github.com> Date: Fri, 19 Sep 2025 16:14:09 -0400 Subject: [PATCH 3/3] Porting over review changes from 9.0.7 known issue PR --- .../release-notes/release-notes-8.18.asciidoc | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc b/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc index a5d302023..ae331c74b 100644 --- a/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc +++ b/docs/en/ingest-management/release-notes/release-notes-8.18.asciidoc @@ -44,31 +44,31 @@ Also see: This known issue applies to {agent} 8.18.7 and 9.0.7. {agent} versions 8.19.x and 9.1.x are not affected. -On September 17, 2025, a known issue was discovered that can cause {agent} upgrades to get stuck if an upgrade attempt fails under specific conditions. This happens because the coordinator's overrideState remains set, leaving the agent in a state that appears to be upgrading. +On September 17, 2025, a known issue was discovered that can cause {agent} upgrades to get stuck if an upgrade attempt fails under specific conditions. This happens because the coordinator's `overrideState` remains set, leaving the agent in a state that appears to be upgrading. **Conditions** -This issue is triggered if the upgrade fails during one of the early checks inside Coordinator.Upgrade, for example: +This issue is triggered if the upgrade fails during one of the early checks inside `Coordinator.Upgrade`, for example: - The agent is not upgradeable - Capabilities check denies the upgrade -- Most commonly: When {agent} is tamper-protected and Endpoint fails to validate that the upgrade action was correctly signed by Kibana to allow the upgrade, for example, because the signature is missing, invalid, or the connection between {agent} and endpoint was interrupted. This causes the agent coordinator's override state to become stuck until the agent is restarted. +- When {agent} is tamper-protected, Endpoint must validate that the upgrade action was correctly signed by Kibana to allow the upgrade. If the signature is missing, invalid, or the connection between {agent} and Endpoint was interrupted, the validation fails. This causes the agent coordinator's override state to become stuck until the agent is restarted. **Symptoms** - {fleet} shows the upgrade action in progress, even though the upgrade remains stuck - No further upgrade attempts succeed -- Elastic-agent status shows an override state indicating upgrade +- Elastic Agent status shows an override state indicating upgrade **Workaround** -Restart the {agent} to clear the coordinator's overrideState and allow new upgrade attempts to proceed. +Restart the {agent} to clear the coordinator's `overrideState` and allow new upgrade attempts to proceed. **Resolution** This issue was fixed in link:https://github.com/elastic/elastic-agent/pull/9992[#9992], which ensures that the coordinator clears its override state whenever an early failure occurs. -This fix will be included in versions 9.1.4, 8.19.4, 9.0.8, and 8.18.8. +The fix is included in versions 9.1.4 and 8.19.4, and planned for versions 9.0.8 and 8.18.8. ====