Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd prematurely kills Upgrade Watcher when upgraded Agent fails to start #3123

Closed
Tracked by #2176
ycombinator opened this issue Jul 25, 2023 · 17 comments · Fixed by #3220
Closed
Tracked by #2176

systemd prematurely kills Upgrade Watcher when upgraded Agent fails to start #3123

ycombinator opened this issue Jul 25, 2023 · 17 comments · Fixed by #3220
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team

Comments

@ycombinator
Copy link
Contributor

This bug was noticed when writing an integration test where the upgraded Agent failed to start up (PR).

For confirmed bugs, please report:

  • Version: main / 8.10.0-SNAPSHOT
  • Operating System: Linux
  • Steps to Reproduce:
  1. To reproduce this bug, we need to upgrade to an Agent whose binary fails to start. The integration test in this PR builds such a failing, fake Agent binary, and packages it up.

  2. Try to upgrade to the Agent on Linux using elastic-agent upgrade <version> --source-uri file:///path/to/fake/failing/agent/package.tgz.

  3. While the upgrade is in progress, monitor the status of the Elastic Agent service:

    $ watch -n1 systemctl status elastic-agent.service
    
  4. For about a minute and a half, the above command will show that the fake Agent process is failing with a non-zero status code. It will also show the Upgrade Watcher process running.

    Every 1.0s: systemctl status elastic-agent.service                                                                  tdwijr: Fri Jul 14 18:51:38 2023
    
    ● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
         Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
         Active: deactivating (stop-sigterm) (Result: exit-code) since Fri 2023-07-14 18:50:50 UTC; 48s ago
        Process: 97938 ExecStart=/usr/bin/elastic-agent (code=exited, status=101)
       Main PID: 97938 (code=exited, status=101)
          Tasks: 6 (limit: 4637)
         Memory: 105.8M
            CPU: 3.386s
         CGroup: /system.slice/elastic-agent.service
                 └─98040 /opt/Elastic/Agent/data/elastic-agent-903287/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent
    

    Then, after about a minute and a half, systemd will attempt to restart the fake Agent process. You can tell this by seeing the Main PID change in the status command's output. However, as a result of this attempted restart, the Upgrade Watcher process also gets killed. You can see it disappearing from the output.

    Every 1.0s: systemctl status elastic-agent.service                                                                  tdwijr: Fri Jul 14 18:52:32 2023
    
    ● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
         Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
         Active: activating (auto-restart) (Result: exit-code) since Fri 2023-07-14 18:52:20 UTC; 12s ago
        Process: 97938 ExecStart=/usr/bin/elastic-agent (code=exited, status=101)
       Main PID: 97938 (code=exited, status=101)
            CPU: 3.396s
    
    Jul 14 18:52:20 tdwijr systemd[1]: elastic-agent.service: Killing process 98043 (elastic-agent) with signal SIGKILL.
    Jul 14 18:52:20 tdwijr systemd[1]: elastic-agent.service: Killing process 98044 (elastic-agent) with signal SIGKILL.
    Jul 14 18:52:20 tdwijr systemd[1]: elastic-agent.service: Killing process 98045 (elastic-agent) with signal SIGKILL.
    Jul 14 18:52:20 tdwijr systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
    Jul 14 18:52:20 tdwijr systemd[1]: elastic-agent.service: Consumed 3.396s CPU time.
    
  5. As a consequence of this premature killing of the Upgrade Watcher process, the upgraded Elastic Agent will keep failing to start but there will be no Upgrade Watcher around to monitor these failures and perform a rollback to the Agent that was running prior to the upgrade.

@ycombinator ycombinator added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jul 25, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member

cmacknz commented Jul 27, 2023

@pierrehilbert pulling this into the current sprint since this blocks #3123

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 4, 2023

The crux of the matter with this bug is that, today, the upgrade watcher process (elastic-agent watch) is a child process of the "main" elastic agent process (elastic-agent run). As such, if the main process exits for some reason, it will also bring down it's child process(es) with it. We could try to prevent the cascade by having the upgrade watcher process trap and ignore SIGTERM or SIGINT signals, allowing the upgrade watcher process to deliberately become an orphan and outlive the main agent process. But this feels hacky and also I'm not sure what shape such a solution would take on non-POSIX systems.

Now, the main process does start up the upgrade watcher process if we're in the midst of an upgrade, which is determined by the existence of the upgrade marker file. But in situations where the main process might crash even before it gets around to checking the existence of the upgrade marker file, and therefore (re)starting the upgrade watcher process, the upgrade watcher process will never get (re)started.

The more I think through this bug and potential ways to ensure that the upgrade watcher process is always running in the midst of an upgrade, the more I'm starting to think that we might need to invert the process hierarchy and shift some of the upgrade responsibilities from the main process to the upgrade watcher process. We might want the elastic agent service to start the upgrade watcher process instead of the main agent process.

In the context of the upgrade process, the main process would still be in charge of talking to Fleet and being the entrypoint for the CLI. However, when the main process receives a request to upgrade (either from Fleet or the CLI), it would just write out the upgrade marker file with enough details necessary to facilitate an upgrade and potentially a downgrade back to itself. It would not download the new artifact, switch symlinks, re-exec, or start the upgrade watcher, as it does today. Those steps would become the responsibility of the upgrade watcher instead.

Also, the main process would no longer be responsible for looking for the upgrade marker file and (re)starting the upgrade watcher process (as mentioned in the second paragraph above). It would become the service's responsibility to ensure that the upgrade watcher process is always running.

The upgrade watcher process would be watching (inotify, etc.) for the presence of the upgrade marker file. When its presence is detected, it would download the artifact, switch the symlink, and re-exec the main process. Likely it would also update a state field in the upgrade marker file for bookkeeping/debugging purposes but also the running Agent has the opportunity to convey this information to Fleet or report it via elastic-agent status. The upgrade watcher would then start monitoring the health of the upgraded main process, as it does today. Should these health checks fail, it would initiate a rollback, as it does today. When the agent is either successfully upgraded or a rollback is successfully executed, the upgrade watcher process would cleanup any inactive Agent files. The upgrade watcher would write a new terminal state (upgraded successfully or rolled back successfully) to the upgrade marker file. Once the Agent has read this terminal state from the upgrade marker file, and conveyed it to Fleet (in managed mode), it would delete the file.

Since the upgrade watcher process will now be the one that's controlled by the service and is expected to be kept running by the service, the upgrade watcher process will need to ensure that the main Agent process is always running. Another way to think of the the upgrade watcher in this proposal is that it's a supervisor that ensures the correct version of Agent is always running and does the necessary work to achieve that desired state of the world.

The benefits I see of this approach vs. today's implementation are:

  • Relatively speaking, the upgrade watcher will be less complex than the main Agent process (in terms of their respective responsibilities). Thus, the odds of the upgrade watcher crashing should be lower than that of the main Agent crashing. So it makes sense to have the upgrade watcher process being the parent of the main Agent process, rather than the other way around as we have it today (which is what's leading to the bug detailed in this issue).
  • All the responsibilities of upgrading/downgrading are consolidated into the upgrade watcher. Today upgrading is handled by the main Agent process while downgrading is handled by the upgrade watcher process. So there would be a clearer separation of responsibilities than we have today:
    • the Agent is responsible for supervising the components, exposing a CLI interface, and talking to Fleet (in managed mode). As far as upgrades go, it's only job is to "declare", via the upgrade marker file, that it wants to be upgraded.
    • the Upgrade Watcher is responsible for supervising the main Agent and ensuring that the correct version is running at any given time.
    • whether an upgrade is in progress or not can be determined by the presence or absence, respectively, of the upgrade marker file, as is the case today. However, unlike today, the presence (creation) and absence (deletion) of the upgrade marker file would solely be the responsibility of the main Agent process.

Thoughts @cmacknz @elastic/elastic-agent?

@ycombinator ycombinator changed the title Systemd prematurely kills Upgrade Watcher when upgraded Agent fails to start systemd prematurely kills Upgrade Watcher when upgraded Agent fails to start Aug 4, 2023
@ycombinator ycombinator changed the title systemd prematurely kills Upgrade Watcher when upgraded Agent fails to start systemd prematurely kills Upgrade Watcher when upgraded Agent fails to start Aug 4, 2023
@cmacknz
Copy link
Member

cmacknz commented Aug 4, 2023

Using a separate supervisor process makes sense to me, and I don't think this is an uncommon approach. I believe @pchila mentioned he had seen this architecture before at one point. In general I like the idea of consolidating the upgrade logic into one process.

In a previous life we called the parallel process the watchdog. It wasn't actually responsible for doing the upgrade, but it served a similar function of monitoring the main agent process and checked in with a special endpoint at a low frequency to allow us to recover from total failure of the main process. It wasn't the parent process, it was actually installed as a totally separate service on the host so it was fully independent.

The one thing you didn't touch on is how we would safely update the watcher process, and how we do that without creating the same problem we have with the agent upgrade. How should we handle this in a fail safe way?

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 4, 2023

In a previous life we called the parallel process the watchdog. It wasn't actually responsible for doing the upgrade, but it served a similar function of monitoring the main agent process and checked in with a special endpoint at a low frequency to allow us to recover from total failure of the main process. It wasn't the parent process, it was actually installed as a totally separate service on the host so it was fully independent.

I like this approach of a separate service for the watchdog process. At the moment I can't think of concrete arguments in favor of either approach — parent-child or parallel processes — but just instinctively a completely independent, parallel process seems more robust for some reason. I'll think more about the two approaches and I'm hoping others will have opinions as well to help pick a direction.

The one thing you didn't touch on is how we would safely update the watcher process, and how we do that without creating the same problem we have with the agent upgrade. How should we handle this in a fail safe way?

Yes, indeed — thanks for catching this rather large hole in my proposal 😂.

My thinking is that the watcher process's primary responsibility is to ensure that the correct version of Agent is running at any given time. In concrete terms, that means that for most of it's life it does nothing, just waiting for the upgrade marker to show up. When the upgrade marker shows up, the watcher performs the following steps. Steps 4-7 cover how/when the watcher process would upgrade itself.

  1. Update state in the upgrade marker to indicate it's downloading the new Agent artifact.
  2. Download the new Agent artifact.
  3. Update state in the upgrade marker to indicate it's switching symlinks.
  4. Switch symlinks for the elastic-agent executable. Since the watcher process is simply elastic-agent watch, this would mean the watcher re-exec'ing itself after this step would have the effect of the upgraded watcher running.
  5. Update state in the upgrade marker to indicate it's re-exec'ing itself.
  6. Re-exec itself. The re-exec'd Watcher should be the updated version at this point and, seeing the upgrade marker file and the state in it, it should resume the upgrade process from where it's older counterpart left it.
  7. Check that it's own version matches the target version in the update marker file. If so, its own re-exec was successful; otherwise something went wrong and it should probably retry re-exec'ing itself.
  8. Update state in the upgrade marker to indicate it's re-exec'ing the main Agent process.
  9. Re-exec the main Agent process (whether that's a as a child or parallel process).
  10. Update state in the upgrade marker to indicate it's monitoring the upgraded main Agent process.
  11. If the monitoring is successful, i.e. the upgraded Agent main process is healthy, update state in the upgrade marker to indicate a successful upgrade. Agent will read this state, communicate it to Fleet (in managed mode), and delete the upgrade marker file.
  12. If the monitoring fails, update state in the upgrade marker to indicate it's rolling back.
  13. Switch symlinks.
  14. Re-exec the main Agent process (whether that's a as a child or parallel process).
  15. Update state in the upgrade marker to indicate the rollback was successful. The rolled back Agent will read this state, communicate it to Fleet (in managed mode), and delete the upgrade marker file.

Note that in the Agent rollback path, the watcher does not rollback itself as well. I think it's to our benefit to always have the latest watcher running and by keeping the responsibilities of the watcher limited (relative to the main Agent), the odds of the watcher itself experiencing problems should be smaller 🤞.

We could deliberately introduce a small delay after every step that updates the state in the upgrade marker. Such a delay would allow the main Agent process to read the updated state in the upgrade marker and communicate it to Fleet (in managed mode) or via the elastic-agent status CLI output.

@blakerouse
Copy link
Contributor

In a previous life we called the parallel process the watchdog. It wasn't actually responsible for doing the upgrade, but it served a similar function of monitoring the main agent process and checked in with a special endpoint at a low frequency to allow us to recover from total failure of the main process. It wasn't the parent process, it was actually installed as a totally separate service on the host so it was fully independent.

I like this approach of a separate service for the watchdog process. At the moment I can't think of concrete arguments in favor of either approach — parent-child or parallel processes — but just instinctively a completely independent, parallel process seems more robust for some reason. I'll think more about the two approaches and I'm hoping others will have opinions as well to help pick a direction.

The one thing you didn't touch on is how we would safely update the watcher process, and how we do that without creating the same problem we have with the agent upgrade. How should we handle this in a fail safe way?

Yes, indeed — thanks for catching this rather large hole in my proposal 😂.

My thinking is that the watcher process's primary responsibility is to ensure that the correct version of Agent is running at any given time. In concrete terms, that means that for most of it's life it does nothing, just waiting for the upgrade marker to show up. When the upgrade marker shows up, the watcher performs the following steps. Steps 4-7 cover how/when the watcher process would upgrade itself.

  1. Update state in the upgrade marker to indicate it's downloading the new Agent artifact.
  2. Download the new Agent artifact.
  3. Update state in the upgrade marker to indicate it's switching symlinks.
  4. Switch symlinks for the elastic-agent executable. Since the watcher process is simply elastic-agent watch, this would mean the watcher re-exec'ing itself after this step would have the effect of the upgraded watcher running.

How would the Elastic Agent at this step know if the Watcher re-execing itself, worked correctly? Currently with the watcher being a subprocess it allows the Elastic Agent to know if it is able to run successfully.

Another thought I have is should we just consolidate all of this into a single process that manages the whole upgrade, instead of splitting the knowledge between the watcher and the Elastic Agent. Maybe we should just make the watcher actually perform the entire upgrade process.

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 7, 2023

How would the Elastic Agent at this step know if the Watcher re-execing itself, worked correctly? Currently with the watcher being a subprocess it allows the Elastic Agent to know if it is able to run successfully.

It wouldn't, you're right. But the thinking is that the Watcher is relatively simpler (does fewer things and changes less frequently) than the Agent so it should be more stable.

We could consider some kind of periodic check-in or heartbeat from the Watcher to the Agent too.

Another thought I have is should we just consolidate all of this into a single process that manages the whole upgrade, instead of splitting the knowledge between the watcher and the Elastic Agent. Maybe we should just make the watcher actually perform the entire upgrade process.

Yes, I mentioned this in my original proposal (#3123 (comment)) as well:

In the context of the upgrade process, the main process would still be in charge of talking to Fleet and being the entrypoint for the CLI. However, when the main process receives a request to upgrade (either from Fleet or the CLI), it would just write out the upgrade marker file with enough details necessary to facilitate an upgrade and potentially a downgrade back to itself. It would not download the new artifact, switch symlinks, re-exec, or start the upgrade watcher, as it does today. Those steps would become the responsibility of the upgrade watcher instead.

...

All the responsibilities of upgrading/downgrading are consolidated into the upgrade watcher. Today upgrading is handled by the main Agent process while downgrading is handled by the upgrade watcher process. So there would be a clearer separation of responsibilities than we have today:

  • the Agent is responsible for supervising the components, exposing a CLI interface, and talking to Fleet (in managed mode). As far as upgrades go, it's only job is to "declare", via the upgrade marker file, that it wants to be upgraded.
  • the Upgrade Watcher is responsible for supervising the main Agent and ensuring that the correct version is running at any given time.
  • whether an upgrade is in progress or not can be determined by the presence or absence, respectively, of the upgrade marker file, as is the case today. However, unlike today, the presence (creation) and absence (deletion) of the upgrade marker file would solely be the responsibility of the main Agent process.

@blakerouse
Copy link
Contributor

@ycombinator Sorry I missed some of that detail in my previous reading. I really like that separation!

If we want to make the watcher simpler we could even make it its own binary instead of a subcommand of the Elastic Agent. That might make it even simpler.

@cmacknz
Copy link
Member

cmacknz commented Aug 8, 2023

I have been thinking about this a bit. To solve the problem in this issue, the watcher cannot be a subprocess. It must exist outside of the agent's process tree so that its lifetime is decoupled from that of the agent.

The best way to do this seems to be to install the watcher as a separate service along side the agent itself. At this point it can be its own binary as suggested above.

The new watcher service should be installed and monitored by the agent service runtime as an implicit component much like the monitoring components. This will give us a way to observe it and ensure it is running. This will also create mutual supervision between the agent and the watcher, to solve the "who watches the watcher" problem.

We will need to be very clear about the separation of responsibilities between the two processes. The watcher service should not duplicate any functionality currently provided by OS level service managers like systemd, Windows service manager, etc. Restarting the agent service when it exits is the OS service manager's job for example.

Another thought I have is should we just consolidate all of this into a single process that manages the whole upgrade, instead of splitting the knowledge between the watcher and the Elastic Agent. Maybe we should just make the watcher actually perform the entire upgrade process.

I like the idea of consolidating most of the upgrade process into the watcher, however I think this process fundamentally needs to be cooperative between the two services to guarantee fault tolerance. We must be able to safely upgrade both the watcher and the agent and recover from the situation in this issue where a new release of either is completely broken and fails to start at all. It may be simpler to keep the division of responsibility the same as it is today.

I believe we can achieve this as follows:

  1. At startup the agent ensures that the upgrade watcher service is running and healthy.
  2. When an upgrade is requested, the agent process downloads and verifies the new upgrade artifact.
    i. The upgrade action should be persisted to disk before being acknowledged to guarantee we attempt it at least once.
  3. The agent upgrades the watcher to the new version from the artifact it just downloaded. An up to date and healthy upgrade watcher service is a precondition of the agent itself upgrading.
    i. The agent could watch the watcher service for a period of time afterwards, but the watcher should be simple enough that having it check in as healthy is enough to proceed.
    ii. Upgrading the watcher first protects us from the situation where we release a broken version of the upgrade watcher. A new release allows us to distribute a fixed version.
  4. The agent upgrade begins. This is started by the agent writing the upgrade marker file to disk, which is read by the watcher service. The watcher begins actively monitoring the agent waiting for the agent to reexec into the new version.
  5. The watcher either observes the upgrade complete successfully, or observes it fail and rolls back the agent to the previous version. The upgrade marker file is updated to indicate the final result.
    i. I don't like the idea of two concurrent processes writing to the same file. We should consider implementing the IPC here with something like file based messaging, where the agent writes out an upgrade request file, and the watcher writes out an upgrade response file. This avoids collisions and the existence of the response file indicates the upgrade has completed.
    ii. There is an edge case here where the agent writes out the upgrade marker but never actually reexecs to the next version. We will need to account for this, possibly writing out the upgrade response indicating nothing happened and restarting the agent from the watcher is enough.
  6. The upgrade marker files are removed by the agent after being reported to Fleet.

The initial condition that ensures we can always get back to a good state is that there is a running version of the agent that is capable of upgrading currently installed. Once we have that it should always be possible to recover.

  1. We can attempt to upgrade the watcher service as many times as we need to before upgrading the agent.
  2. The upgrade watcher always has a known good version of the agent to go back to.
  3. The agent process dying unexpectedly or failing to start at all no longer matters because the watcher is a completely independent process.

@cmacknz
Copy link
Member

cmacknz commented Aug 9, 2023

The watcher either observes the upgrade complete successfully, or observes it fail and rolls back the agent to the previous version. The upgrade marker file is updated to indicate the final result.
i. I don't like the idea of two concurrent processes writing to the same file. We should consider implementing the IPC here with something like file based messaging, where the agent writes out an upgrade request file, and the watcher writes out an upgrade response file. This avoids collisions and the existence of the response file indicates the upgrade has completed.
ii. There is an edge case here where the agent writes out the upgrade marker but never actually reexecs to the next version. We will need to account for this, possibly writing out the upgrade response indicating nothing happened and restarting the agent from the watcher is enough.

Just spoke with @leehinman and he made a good point that having the upgrade watcher act purely based on the presence of a file in the file system isn't particularly secure. This would become especially important in a world where the agent may not run as root.

We could instead consider writing out the signed upgrade action received from Fleet and using that as the signal for the watcher to begin actively monitoring the agent and considering whether to roll it back. We already have the signing infrastructure in place for this.

We would still need a way for the watcher to communicate to the agent whether it was rolled back, but it is likely acceptable to do this in an unsigned file. This way an attacker could only cause the agent to report an upgrade that did not happen, rather than having a way to instruct the watcher to directly take action on the running agent.

This would add complexity so it could be something we optimize for after the initial implementation.

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 9, 2023

@blakerouse Wanted to follow up on a couple of the suggestions you made during today's Agent core meeting.

First, I confirmed that it's systemd that's sending a SIGKILL to the Upgrade Watcher process. Here is the complete sequence of events (with screenshots and logs):

  1. The pre-upgraded Agent, 8.10.0-SNAPSHOT in my test, is running normally for some time. Its PID is 295569.
Screenshot 2023-08-09 at 16 02 38
  1. I start the upgrade to a 8.13.0 Agent, which I've coded up to crash 15s after it starts up.
  2. The upgrade process starts, and eventually the Agent process is re-exec'd. Also, the Upgrade Watcher process is started. Its PID is 296543.
Screenshot 2023-08-09 at 16 03 08
  1. As expected, 15s later, the Agent process crashes. But the Upgrade Watcher process keeps running.
Screenshot 2023-08-09 at 16 03 23
  1. Exactly 1m30s after the Agent process crashes, the Upgrade Watcher process is killed by systemd. You can tell this from the systemd logs (notice the timestamps and the Upgrade Watcher PID in the logs).
    $ journalctl -u elastic-agent.service --no-pager --since '2023-08-09 23:03:20'
    ...
    Aug 09 23:03:20 shaunak-ubuntu-22-arm elastic-agent[295569]: prematurely crashing agent
    Aug 09 23:03:20 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
    Aug 09 23:04:50 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: State 'stop-sigterm' timed out. Killing.
    Aug 09 23:04:50 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Killing process 296543 (elastic-agent) with signal SIGKILL.
    ...
    

So, my conclusion is that it's systemd that's killing the Upgrade Watcher process via SIGKILL, as seen in the systemd logs above. Removing or changing the line where Agent propagates SIGINT to the Upgrade Watcher process will not make any difference, IMO:

As a next step, I'm going to dig into the "State 'stop-sigterm' timed out." message from the systemd log. If we could make that timeout be longer than the default Upgrade Watcher grace period (10m), I think we'd be in business.

@ycombinator
Copy link
Contributor Author

As a next step, I'm going to dig into the "State 'stop-sigterm' timed out." message from the systemd log. If we could make that timeout be longer than the default Upgrade Watcher grace period (10m), I think we'd be in business.

I believe the systemd unit configuration setting we need to tweak to increase the timeout is TimeoutStopSec. Its default value is the value of DefaultTimeoutStopSec, which is documented as 90s, which is consistent with how long systemd is keeping the Upgrade Watcher process alive after the Agent process has crashed.

I will experiment with setting TimeoutStopSec in the elastic-agent.service unit configuration file to something longer than the default of 90s, and eventually to something like 10m10s — just a little bit longer than the Upgrade Watcher's default grace period of 10m, giving it enough time to perform its checks and execute the rollback steps or exit gracefully, as needed.


The other systemd unit configuration setting I discovered that might be useful for keeping the Upgrade Watcher process alive indefinitely, until it exits on its own, is KillMode=process. The documentation for this says:

If [this setting is] set to process, only the main process itself is killed (not recommended!)
...
Note that it is not recommended to set KillMode= to process..., as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources.

So I'm not going to go this route, at least for now.

@ycombinator
Copy link
Contributor Author

I will experiment with setting TimeoutStopSec in the elastic-agent.service unit configuration file to something longer than the default of 90s...

Just for experimentation purposes, I tried a value of 100s and, sure enough, the Upgrade Watcher process stayed alive for 100s after the Agent process died. Here it is at the 95s (=1m 35s) mark:

Screenshot 2023-08-09 at 16 49 21

... and eventually to something like 10m10s — just a little bit longer than the Upgrade Watcher's default grace period of 10m, giving it enough time to perform its checks and execute the rollback steps or exit gracefully, as needed.

Next, I tested setting the TimeoutStopSec value to 610s (=10m 10s) and, indeed, the Upgrade Watcher runs long enough after the Agent process crashes to at least initiate the rollback to the previous version of Agent. Here are some screenshots and logs to prove this is happening with the new setting in place:

  1. Here is the upgraded Agent process (PID = 307788) running, right after the upgrade command was issued. And we see the Upgrade Watcher process (PID = 308181) as well.
Screenshot 2023-08-09 at 17 15 49
  1. Then the upgraded Agent process crashes 15s later, but the Upgrade Watcher process is still running, 184s (=2m 24s) after the Agent process crashed. This is the effect of the new setting in the elastic-agent.service unit configuration file.
Screenshot 2023-08-09 at 17 18 26
  1. At the same time, I checked the Upgrade Watcher logs and you can see that a rollback was initiated, which was not happening before with this bug because systemd just wasn't allowing the Upgrade Watcher to run long enough to get to this point.
Screenshot 2023-08-09 at 17 18 34
  1. Unfortunately, that's as far as the Upgrade Watcher is able to take the rollback process. It gets stuck on trying to restart the Agent (see the final line in the Upgrade Watcher log above) because it's now running into another bug: [Upgrade Watcher][Crash Checker] Consider Agent process as crashed if its PID remains 0 #3166 (comment).

Nevertheless, I think we have a reasonable fix for this bug here. The fix is to simply set TimeoutStopSec=610 (=10m 10s) in the elastic-agent.service unit configuration file. This value is just a little bit longer than the Upgrade Watcher's default grace period (=10m). Essentially, we're asking systemd to keep the Upgrade Watcher process alive long enough to do it's job. I'm going to put up a PR for this fix.

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 10, 2023

Nevertheless, I think we have a reasonable fix for this bug here. The fix is to simply set TimeoutStopSec=610 (=10m 10s) in the elastic-agent.service unit configuration file. This value is just a little bit longer than the Upgrade Watcher's default grace period (=10m). Essentially, we're asking systemd to keep the Upgrade Watcher process alive long enough to do it's job. I'm going to put up a PR for this fix.

I'm trying to work on a PR for this simple fix. I can't figure out which file(s) to add the TimeoutStopSec=610 setting to. I've tried adding it to https://github.com/elastic/elastic-agent/blob/main/dev-tools/packaging/templates/linux/elastic-agent.unit.tmpl and https://github.com/elastic/elastic-agent/blob/main/dev-tools/packaging/templates/linux/systemd.unit.tmpl both, rebuilt the agent package, and installed it. But when I run cat /etc/systemd/system/elastic-agent.service I don't see the setting. @blakerouse @cmacknz do you know the correct file(s) for this setting?

[UPDATE] Never mind, I figured it out. Looks like the systemd unit configuration file for elastic-agent.service is created programatically over here:

// Linux (systemd) always restart on failure
"Restart": "always",

I just got confused by the other two *.systemd.unit.tmpl files under dev-tools/packaging/templates/linux/, that's all.

@leehinman
Copy link
Contributor

quick question, does increasing TimeoutStopSec by such a large value have any unintended side affects? From the man page:

This option serves two purposes.

First, it configures the time to wait for each ExecStop= command.
If any of them times out, subsequent ExecStop= commands are skipped and the service will be terminated
by SIGTERM. If no ExecStop= commands are specified, the service gets the SIGTERM immediately.
This default behavior can be changed by the TimeoutStopFailureMode= option. 

Second, it configures the time to wait for the service itself to stop. 

I think we are taking advantage of the second purpose, but is there any case where the change in the first would be unexpected for our users?

@ycombinator
Copy link
Contributor Author

I think we are taking advantage of the second purpose, but is there any case where the change in the first would be unexpected for our users?

We are unaffected by the first purpose because we're not specifying an ExecStop= commands in our unit configuration file:

$ grep ExecStop /etc/systemd/system/elastic-agent.service | wc -l
0

However, @blakerouse has brought up similar questions about unintended side effects in #3220 (review). I want to run some more tests for the scenarios he outlined as well as with trying the alternative idea of specifying KillMode=process instead of the TimeoutStopSec setting.

@cmacknz
Copy link
Member

cmacknz commented Aug 10, 2023

I'm trying to work on a PR for this simple fix. I can't figure out which file(s) to add the TimeoutStopSec=610 setting to. I've tried adding it to https://github.com/elastic/elastic-agent/blob/main/dev-tools/packaging/templates/linux/elastic-agent.unit.tmpl and https://github.com/elastic/elastic-agent/blob/main/dev-tools/packaging/templates/linux/systemd.unit.tmpl both, rebuilt the agent package, and installed it. But when I run cat /etc/systemd/system/elastic-agent.service I don't see the setting. @blakerouse @cmacknz do you know the correct file(s) for this setting?

[UPDATE] Never mind, I figured it out. Looks like the systemd unit configuration file for elastic-agent.service is created programatically over here:

It looks like the elastic-agent.unit.tmpl is consumed by the RPM/DEB build so we probably have to update it as well, it just doesn't affect the use case you are testing.

It is a bit annoying that this is defined twice. I'm also not sure if the systemd.unit.tmpl file is actually used or if it is just left over from when we forked out of the Beats repo.

/lib/systemd/system/{{.BeatServiceName}}.service:
template: '{{ elastic_beats_dir }}/dev-tools/packaging/templates/linux/elastic-agent.unit.tmpl'
mode: 0644

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
5 participants