Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Enable a 'repair agent' button in fleet #697

Open
aarju opened this issue Jul 8, 2022 · 3 comments
Open

[Feature Request] Enable a 'repair agent' button in fleet #697

aarju opened this issue Jul 8, 2022 · 3 comments
Labels
Team:Fleet Label for the Fleet team

Comments

@aarju
Copy link

aarju commented Jul 8, 2022

Describe the enhancement:
Within fleet there would be an additional drop down option to 'repair agent' that would task remote agents to attempt to repair unhealthy agents. The user would then get a confirmation prompt that would let them know that this could remove unsent logs from the agent. When 'repair agent' is used the agent would run multiple actions to attempt to repair or reset the agent to a working state. These actions could include things like deleting all downloads, any cached files that may be corrupted, etc. It would also reset the agent state within fleet.

Describe a specific use case for the enhancement or feature:

  • An agent is in an unhealthy state following an attempted upgrade where the system powered off during the upgrade leaving partial downloads on the system. Using the 'repair agent' button would clear all downloads and restore the agent to a healthy state.
  • An agent is upgraded but the final 'ack' for the upgrade doesn't make it to fleet. This leaves the agent in a healthy state but the status with fleet isn't correctly synchronized. The repair agent button will re-synchronize the agent and fleet so the current state is accurate.
@pierrehilbert pierrehilbert added the Team:Fleet Label for the Fleet team label May 12, 2023
@amitkanfer
Copy link
Contributor

@cmacknz in light of the recent redesign - does this ER make sense?

@cmacknz
Copy link
Member

cmacknz commented Jun 1, 2023

An agent is in an unhealthy state following an attempted upgrade where the system powered off during the upgrade leaving partial downloads on the system

We don't download input or integrations anymore, only the the next agent version during upgrades. That process should be well behaved during poorly timed reboots, but there have been bugs that left the agent in a partially working state because of missing symlinks or Unix socket paths. I don't have a good example off the top of my head to link to.

The repair here would just triggering the post-install process again to fix up any paths or broken links. This is probably still useful to handle "unknown unknown" bugs, but I would prefer to get the upgrade to a place where we don't need this.

An agent is upgraded but the final 'ack' for the upgrade doesn't make it to fleet. This leaves the agent in a healthy state but the status with fleet isn't correctly synchronized. The repair agent button will re-synchronize the agent and fleet so the current state is accurate.

I think this is covered by elastic/kibana#135539.

@cmacknz
Copy link
Member

cmacknz commented Jun 5, 2023

We should consider if having the currently installed agent version upgrade to itself would repair most issues, that would simplify the repair process into a variation of the upgrade process. #2780

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Label for the Fleet team
Projects
None yet
Development

No branches or pull requests

4 participants