-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemd
prematurely kills Upgrade Watcher when upgraded Agent fails to start
#3123
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
@pierrehilbert pulling this into the current sprint since this blocks #3123 |
The crux of the matter with this bug is that, today, the upgrade watcher process ( Now, the main process does start up the upgrade watcher process if we're in the midst of an upgrade, which is determined by the existence of the upgrade marker file. But in situations where the main process might crash even before it gets around to checking the existence of the upgrade marker file, and therefore (re)starting the upgrade watcher process, the upgrade watcher process will never get (re)started. The more I think through this bug and potential ways to ensure that the upgrade watcher process is always running in the midst of an upgrade, the more I'm starting to think that we might need to invert the process hierarchy and shift some of the upgrade responsibilities from the main process to the upgrade watcher process. We might want the elastic agent service to start the upgrade watcher process instead of the main agent process. In the context of the upgrade process, the main process would still be in charge of talking to Fleet and being the entrypoint for the CLI. However, when the main process receives a request to upgrade (either from Fleet or the CLI), it would just write out the upgrade marker file with enough details necessary to facilitate an upgrade and potentially a downgrade back to itself. It would not download the new artifact, switch symlinks, re-exec, or start the upgrade watcher, as it does today. Those steps would become the responsibility of the upgrade watcher instead. Also, the main process would no longer be responsible for looking for the upgrade marker file and (re)starting the upgrade watcher process (as mentioned in the second paragraph above). It would become the service's responsibility to ensure that the upgrade watcher process is always running. The upgrade watcher process would be watching (inotify, etc.) for the presence of the upgrade marker file. When its presence is detected, it would download the artifact, switch the symlink, and re-exec the main process. Likely it would also update a state field in the upgrade marker file for bookkeeping/debugging purposes but also the running Agent has the opportunity to convey this information to Fleet or report it via Since the upgrade watcher process will now be the one that's controlled by the service and is expected to be kept running by the service, the upgrade watcher process will need to ensure that the main Agent process is always running. Another way to think of the the upgrade watcher in this proposal is that it's a supervisor that ensures the correct version of Agent is always running and does the necessary work to achieve that desired state of the world. The benefits I see of this approach vs. today's implementation are:
Thoughts @cmacknz @elastic/elastic-agent? |
systemd
prematurely kills Upgrade Watcher when upgraded Agent fails to start
Using a separate supervisor process makes sense to me, and I don't think this is an uncommon approach. I believe @pchila mentioned he had seen this architecture before at one point. In general I like the idea of consolidating the upgrade logic into one process. In a previous life we called the parallel process the watchdog. It wasn't actually responsible for doing the upgrade, but it served a similar function of monitoring the main agent process and checked in with a special endpoint at a low frequency to allow us to recover from total failure of the main process. It wasn't the parent process, it was actually installed as a totally separate service on the host so it was fully independent. The one thing you didn't touch on is how we would safely update the watcher process, and how we do that without creating the same problem we have with the agent upgrade. How should we handle this in a fail safe way? |
I like this approach of a separate service for the watchdog process. At the moment I can't think of concrete arguments in favor of either approach — parent-child or parallel processes — but just instinctively a completely independent, parallel process seems more robust for some reason. I'll think more about the two approaches and I'm hoping others will have opinions as well to help pick a direction.
Yes, indeed — thanks for catching this rather large hole in my proposal 😂. My thinking is that the watcher process's primary responsibility is to ensure that the correct version of Agent is running at any given time. In concrete terms, that means that for most of it's life it does nothing, just waiting for the upgrade marker to show up. When the upgrade marker shows up, the watcher performs the following steps. Steps 4-7 cover how/when the watcher process would upgrade itself.
Note that in the Agent rollback path, the watcher does not rollback itself as well. I think it's to our benefit to always have the latest watcher running and by keeping the responsibilities of the watcher limited (relative to the main Agent), the odds of the watcher itself experiencing problems should be smaller 🤞. We could deliberately introduce a small delay after every step that updates the state in the upgrade marker. Such a delay would allow the main Agent process to read the updated state in the upgrade marker and communicate it to Fleet (in managed mode) or via the |
How would the Elastic Agent at this step know if the Watcher re-execing itself, worked correctly? Currently with the watcher being a subprocess it allows the Elastic Agent to know if it is able to run successfully. Another thought I have is should we just consolidate all of this into a single process that manages the whole upgrade, instead of splitting the knowledge between the watcher and the Elastic Agent. Maybe we should just make the watcher actually perform the entire upgrade process. |
It wouldn't, you're right. But the thinking is that the Watcher is relatively simpler (does fewer things and changes less frequently) than the Agent so it should be more stable. We could consider some kind of periodic check-in or heartbeat from the Watcher to the Agent too.
Yes, I mentioned this in my original proposal (#3123 (comment)) as well:
|
@ycombinator Sorry I missed some of that detail in my previous reading. I really like that separation! If we want to make the watcher simpler we could even make it its own binary instead of a subcommand of the Elastic Agent. That might make it even simpler. |
I have been thinking about this a bit. To solve the problem in this issue, the watcher cannot be a subprocess. It must exist outside of the agent's process tree so that its lifetime is decoupled from that of the agent. The best way to do this seems to be to install the watcher as a separate service along side the agent itself. At this point it can be its own binary as suggested above. The new watcher service should be installed and monitored by the agent service runtime as an implicit component much like the monitoring components. This will give us a way to observe it and ensure it is running. This will also create mutual supervision between the agent and the watcher, to solve the "who watches the watcher" problem. We will need to be very clear about the separation of responsibilities between the two processes. The watcher service should not duplicate any functionality currently provided by OS level service managers like systemd, Windows service manager, etc. Restarting the agent service when it exits is the OS service manager's job for example.
I like the idea of consolidating most of the upgrade process into the watcher, however I think this process fundamentally needs to be cooperative between the two services to guarantee fault tolerance. We must be able to safely upgrade both the watcher and the agent and recover from the situation in this issue where a new release of either is completely broken and fails to start at all. It may be simpler to keep the division of responsibility the same as it is today. I believe we can achieve this as follows:
The initial condition that ensures we can always get back to a good state is that there is a running version of the agent that is capable of upgrading currently installed. Once we have that it should always be possible to recover.
|
Just spoke with @leehinman and he made a good point that having the upgrade watcher act purely based on the presence of a file in the file system isn't particularly secure. This would become especially important in a world where the agent may not run as root. We could instead consider writing out the signed upgrade action received from Fleet and using that as the signal for the watcher to begin actively monitoring the agent and considering whether to roll it back. We already have the signing infrastructure in place for this. We would still need a way for the watcher to communicate to the agent whether it was rolled back, but it is likely acceptable to do this in an unsigned file. This way an attacker could only cause the agent to report an upgrade that did not happen, rather than having a way to instruct the watcher to directly take action on the running agent. This would add complexity so it could be something we optimize for after the initial implementation. |
@blakerouse Wanted to follow up on a couple of the suggestions you made during today's Agent core meeting. First, I confirmed that it's
So, my conclusion is that it's
As a next step, I'm going to dig into the "State 'stop-sigterm' timed out." message from the |
I believe the I will experiment with setting The other
So I'm not going to go this route, at least for now. |
Just for experimentation purposes, I tried a value of 100s and, sure enough, the Upgrade Watcher process stayed alive for 100s after the Agent process died. Here it is at the 95s (=1m 35s) mark:
Next, I tested setting the
Nevertheless, I think we have a reasonable fix for this bug here. The fix is to simply set |
I'm trying to work on a PR for this simple fix. I can't figure out which file(s) to add the [UPDATE] Never mind, I figured it out. Looks like the elastic-agent/internal/pkg/agent/install/svc.go Lines 47 to 48 in decadc9
I just got confused by the other two |
quick question, does increasing
I think we are taking advantage of the second purpose, but is there any case where the change in the first would be unexpected for our users? |
We are unaffected by the first purpose because we're not specifying an
However, @blakerouse has brought up similar questions about unintended side effects in #3220 (review). I want to run some more tests for the scenarios he outlined as well as with trying the alternative idea of specifying |
It looks like the elastic-agent.unit.tmpl is consumed by the RPM/DEB build so we probably have to update it as well, it just doesn't affect the use case you are testing. It is a bit annoying that this is defined twice. I'm also not sure if the systemd.unit.tmpl file is actually used or if it is just left over from when we forked out of the Beats repo. elastic-agent/dev-tools/packaging/packages.yml Lines 55 to 57 in ebf2fe7
|
This bug was noticed when writing an integration test where the upgraded Agent failed to start up (PR).
For confirmed bugs, please report:
main
/8.10.0-SNAPSHOT
To reproduce this bug, we need to upgrade to an Agent whose binary fails to start. The integration test in this PR builds such a failing, fake Agent binary, and packages it up.
Try to upgrade to the Agent on Linux using
elastic-agent upgrade <version> --source-uri file:///path/to/fake/failing/agent/package.tgz
.While the upgrade is in progress, monitor the status of the Elastic Agent service:
For about a minute and a half, the above command will show that the fake Agent process is failing with a non-zero status code. It will also show the Upgrade Watcher process running.
Then, after about a minute and a half, systemd will attempt to restart the fake Agent process. You can tell this by seeing the
Main PID
change in the status command's output. However, as a result of this attempted restart, the Upgrade Watcher process also gets killed. You can see it disappearing from the output.As a consequence of this premature killing of the Upgrade Watcher process, the upgraded Elastic Agent will keep failing to start but there will be no Upgrade Watcher around to monitor these failures and perform a rollback to the Agent that was running prior to the upgrade.
The text was updated successfully, but these errors were encountered: