Skip to content

Known Issue: Failed upgrades leave Elastic Agent stuck until restart #2978

@pkoutsovasilis

Description

@pkoutsovasilis

Description:

A regression introduced in #9634 can cause Elastic Agent upgrades to get stuck if an upgrade attempt fails early. This happens because the coordinator’s overrideState remains set, leaving the agent in a state that appears to be upgrading.

Affected Versions:

  • Versions 8.18.7, 9.0.7
  • Fixed in #9992, will be included in 9.1.4 (raised blocker), 8.19.4 (raised blocker), 9.0.8 (next patch release), 8.18.7 (next patch release)

Conditions:

This issue is triggered if the upgrade fails during one of the early checks inside Coordinator.Upgrade, for example:

  • The agent is not upgradeable
  • Capabilities check denies the upgrade
  • Most commonly: when Elastic Agent is tamper-protected and Endpoint returns an error during action proxying — for example because the upgrade action signature is invalid, missing, or fails verification. This causes the coordinator’s override state to be stuck.

Symptoms:

  • Upgrade remains stuck (Fleet shows the upgrade action in progress)
  • No further upgrade attempts succeed
  • elastic-agent status shows an override state indicating upgrade

Workaround:

Restarting the Elastic Agent clears the coordinator’s overrideState and allows new upgrade attempts to proceed.

Resolution:

This bug is fixed by #9992, which ensures that the coordinator clears its override state whenever an early failure occurs.

Action:

  • Please add a Known Issue entry to the release notes for affected versions.
  • Include the workaround (manual restart of Elastic Agent).

cc @lucabelluccini for visibility so that Support is aware
cc @ebeahan @cmacknz

Metadata

Metadata

Assignees

Labels

Team:IngestIssues owned by the Ingest Docs Team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions