Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent systemd from killing the Upgrade Watcher process after the main Agent process has crashed #3220

Merged

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Aug 10, 2023

What does this PR do?

This PR adds the KillMode=process setting to Elastic Agent's systemd unit configuration file, installed at /etc/systemd/system/elastic-agent.service.

Why is it important?

The KillMode setting governs which processes under the service will be killed by systemd. The default value of the setting is control-group. With this value, all processes under the service are killed. However, with the process value, only the main process under the service is killed while the others are left running.

In the context of Elastic Agent, the main process it the Agent process. During an ongoing upgrade, there is a second process in the same cgroup as the main process: the Upgrade Watcher process.

In the scenario when the main process crashes or is not reporting a healthy status repeatedly, we need the Upgrade Watcher process to keep running so it can detect this repeated crashing or unhealthy status, and then initiate a rollback.

By setting KillMode=process in Elastic Agent's systemd unit configuration file, we ensure that when the main process is repeatedly crashing, systemd keeps the Upgrade Watcher process running so it can fulfill it's destiny.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

How to test this PR locally

Test A

For this test, we will upgrade from a pre-built version of Agent, say 8.9.0 to the build of Agent with this PR (8.10.0) and then to another build of Agent that deliberately crashes (say, 8.11.0).

The upgrade from the first to the second build of Agent will test that the /etc/systemd/system/elastic-agent.service file starts off without the KillMode=process setting in it and that the setting is added after the new (post-upgrade) Agent starts running.

The upgrade from the second to the third build of Agent will test that the KillMode=process setting is behaving as expected, by ensuring that the Upgrade Watcher process keeps running even after the new (post-upgrade) Agent process crashes.

  1. Start with a pre-built version of Elastic Agent < 8.10.0

  2. Unpack and install this Elastic Agent on Linux.

  3. Check the contents of the /etc/systemd/system/elastic-agent.service file and ensure that the KillMode=process setting is absent.

    grep KillMode /etc/systemd/system/elastic-agent.service
    
  4. Check out this PR's branch and build Elastic Agent for Linux.

    DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage package
    
  5. Upgrade to the newly-built Agent.

    sudo elastic-agent upgrade 8.10.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify
    
  6. Check the contents of the /etc/systemd/system/elastic-agent.service file and ensure that the KillMode=process setting is now present. Don't worry about any whitespace around the = sign (systemd ignores it).

    grep KillMode /etc/systemd/system/elastic-agent.service
    
  7. Ensure that the systemd unit configuration files have been reloaded. For this, run systemctl status elastic-agent.service >/dev/null and ensure that the following warning message is NOT shown:

    Warning: The unit file, source configuration file or drop-ins of elastic-agent.service changed on disk. Run 'systemctl daemon-reload' to reload units.
    
  8. Ensure that the upgrade succeeds before moving on. For this, wait until the Upgrade Watcher process has finished running and ensure that the running Elastic Agent's version is 8.10.0.

    systemctl status elastic-agent.service | grep 'elastic-agent watch'
    sudo elastic-agent version
    
  9. Bump up the minor version in version/version.go (to, say, 8.11.0), and also return an error at the very start of

    RunE: func(cmd *cobra.Command, _ []string) error {
    to ensure that the built Agent will crash immediately upon start up.

  10. Make sure to make a new commit with these changes so the build ID (SHA) in the Agent build is different. Otherwise the upgrade won't work.

  11. Build Elastic Agent for Linux.

    DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage package
    
  12. Upgrade to the newly-built Agent.

    sudo elastic-agent upgrade 8.11.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify
    
  13. Monitor the output of systemctl status elastic-agent.service. Watch the Upgrade Watcher process being spawned, then the new (post-upgrade) Agent process crashing, and ensure that the Upgrade Watcher process keeps running for at least 2 minutes.

    watch -n1 systemctl status elastic-agent.service
    

Test B

For this test, we will upgrade from a pre-built version of Agent, say 8.9.0 to a build of Agent that has the changes in this PR but also crashes upon start up (say, 8.10.0).

This will test that that the /etc/systemd/system/elastic-agent.service file starts off without the KillMode=process setting in it, that the setting is added after the new (post-upgrade) Agent starts running, and that the setting takes effect immediately so as to keep the Upgrade Watcher process running after the Agent process has crashed.

  1. Start with a pre-built version of Elastic Agent < 8.10.0

  2. Unpack and install this Elastic Agent on Linux.

  3. Check the contents of the /etc/systemd/system/elastic-agent.service file and ensure that the KillMode=process setting is absent.

    grep KillMode /etc/systemd/system/elastic-agent.service
    
  4. Check out this PR's branch.

  5. Modify the code to return an error right after the changes in this PR have a chance to take effect, to ensure that the built Agent will crash:

    if err := handleUpgrade(); err != nil {
    return fmt.Errorf("error checking for and handling upgrade: %w", err)
    }

  6. Make sure to make a new commit with these changes so the build ID (SHA) in the Agent build is different. Otherwise the upgrade won't work.

  7. Build Elastic Agent for Linux.

    DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage package
    
  8. Upgrade to the newly-built Agent.

    sudo elastic-agent upgrade 8.10.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify
    
  9. Check the contents of the /etc/systemd/system/elastic-agent.service file and ensure that the KillMode=process setting is now present. Don't worry about any whitespace around the = sign (systemd ignores it).

    grep KillMode /etc/systemd/system/elastic-agent.service
    
  10. Ensure that the systemd unit configuration files have been reloaded. For this, run systemctl status elastic-agent.service >/dev/null and ensure that the following warning message is NOT shown:

    Warning: The unit file, source configuration file or drop-ins of elastic-agent.service changed on disk. Run 'systemctl daemon-reload' to reload units.
    
  11. Monitor the output of systemctl status elastic-agent.service. Watch the Upgrade Watcher process being spawned, then the new (post-upgrade) Agent process crashing, and ensure that the Upgrade Watcher process keeps running for at least 2 minutes.

    watch -n1 systemctl status elastic-agent.service
    

Related issues

@ycombinator ycombinator added bug Something isn't working Team:Elastic-Agent Label for the Agent team backport-v8.9.0 Automated backport with mergify labels Aug 10, 2023
@ycombinator ycombinator requested a review from a team as a code owner August 10, 2023 05:59
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@ycombinator ycombinator force-pushed the systemd-increase-upg-watcher-kill-timeout branch from 3d58988 to d570694 Compare August 10, 2023 06:00
@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 10, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-08-15T19:25:10.358+0000

  • Duration: 51 min 48 sec

Test stats 🧪

Test Results
Failed 0
Passed 6277
Skipped 47
Total 6324

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@ycombinator ycombinator changed the title Systemd increase upg watcher kill timeout Increase the time before systemd will kill the Upgrade Watcher process after the main Agent process has crashed Aug 10, 2023
@ycombinator ycombinator removed the backport-v8.9.0 Automated backport with mergify label Aug 10, 2023
@mergify
Copy link
Contributor

mergify bot commented Aug 10, 2023

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 10, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.75% (79/80) 👍
Files 67.832% (194/286) 👍 0.113
Classes 66.292% (354/534) 👍 0.063
Methods 52.874% (1104/2088) 👎 -0.028
Lines 38.709% (12651/32682) 👎 -0.047
Conditionals 100.0% (0/0) 💚

Copy link
Contributor

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ycombinator
Copy link
Contributor Author

buildkite test this

@ycombinator ycombinator enabled auto-merge (squash) August 10, 2023 13:52
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to talk about this change a little bit more.

I read all of the details from the issue that you provided on changing this value. I do worry that this change of the timeout does have a side-effect of systemctl stop elastic-agent. I believe that if stopping the Elastic Agent using the above command hangs that it would block for the full 10 minutes. I don't know if that is the behavior we want to provide in that situation.

I also worry about this having an effect on uninstall. We stop the service, if the watcher is running and uninstall is performed will it block for 10mins waiting for the watcher to die before being able to continue.

I actually thinking changing the KillMode=process is the proper solution. Being that Elastic Agent is a supervisor and it spawns all of it subprocess with the proper Pkill in the situations that require it. It would also equal the same behavior that we have on macOS and WIndows.

@ycombinator
Copy link
Contributor Author

@blakerouse Thanks for taking the time to read through all the investigation details in the issue and bringing up your concerns here. Let me do some more testing for the scenarios you outlined as well as with the alternative idea of specifying KillMode=process instead of the TimeoutStopSec setting and post my findings here.

@ycombinator ycombinator force-pushed the systemd-increase-upg-watcher-kill-timeout branch from 36b424f to 40fb28e Compare August 10, 2023 21:16
@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 10, 2023

I tested the three scenarios @blakerouse brought up in #3220 (review). Observations and conclusions from each of the three tests are below but tl;dr is that it seems better to go with the KillMode=process setting than the TimeoutStopSec setting. I've updated this PR accordingly.

I do worry that this change of the timeout does have a side-effect of systemctl stop elastic-agent. I believe that if stopping the Elastic Agent using the above command hangs that it would block for the full 10 minutes. I don't know if that is the behavior we want to provide in that situation.

I built a version of Agent (on top of the changes in this PR) that will block for 30 minutes (<-time.After(30 * time.Minute)) very early in the elastic-agent run code path.

I installed this Agent made sure it started running under the service.

Then I immediately called systemctl stop elastic-agent and then looked at the output of systemctl status elastic-agent.

It appears the service is stopped and the Agent process is terminated.

$ sudo systemctl status elastic-agent.service
Unit elastic-agent.service could not be found.
$ sudo ./elastic-agent install
Elastic Agent will be installed at /opt/Elastic/Agent and will run as a service. Do you want to continue? [Y/n]:y
Do you want to enroll this Agent into Fleet? [Y/n]:n
Elastic Agent has been successfully installed.
$ sudo systemctl status elastic-agent.service
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-08-10 19:50:35 UTC; 26s ago
   Main PID: 478576 (elastic-agent)
      Tasks: 5 (limit: 4637)
     Memory: 10.9M
        CPU: 60ms
     CGroup: /system.slice/elastic-agent.service
             └─478576 /opt/Elastic/Agent/elastic-agent

Aug 10 19:50:35 shaunak-ubuntu-22-arm systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
$  sudo systemctl stop elastic-agent.service
$ sudo systemctl status elastic-agent.service
○ elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2023-08-10 19:51:11 UTC; 5s ago
    Process: 478576 ExecStart=/usr/bin/elastic-agent (code=killed, signal=TERM)
   Main PID: 478576 (code=killed, signal=TERM)
        CPU: 65ms

Aug 10 19:50:35 shaunak-ubuntu-22-arm systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
Aug 10 19:51:11 shaunak-ubuntu-22-arm systemd[1]: Stopping Elastic Agent is a unified agent to observe, monitor and protect your system....
Aug 10 19:51:11 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Deactivated successfully.
Aug 10 19:51:11 shaunak-ubuntu-22-arm systemd[1]: Stopped Elastic Agent is a unified agent to observe, monitor and protect your system..

Conclusion: it would appear that setting TimeoutStopSec=610 does not have an adverse impact on the behavior of systemctl stop elastic-agent.


I also worry about this having an effect on uninstall. We stop the service, if the watcher is running and uninstall is performed will it block for 10mins waiting for the watcher to die before being able to continue.

I tested this scenario by starting with a build of Agent from this PR (so it includes the TimeoutStopSec=610 setting) and upgrading to a build of Agent that crashes immediately upon starting up.

$ grep TimeoutStopSec /etc/systemd/system/elastic-agent.service
TimeoutStopSec=610
$ sudo elastic-agent upgrade 8.13.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify
Upgrade triggered to version 8.13.0, Elastic Agent is currently restarting
$ 

As expected, due to the timeout change in this PR, the Upgrade Watcher process kept running after the upgraded Agent process crashed.

$ sudo systemctl status elastic-agent.service
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: deactivating (stop-sigterm) (Result: exit-code) since Thu 2023-08-10 20:29:06 UTC; 3min 10s ago
    Process: 492383 ExecStart=/usr/bin/elastic-agent (code=exited, status=1/FAILURE)
   Main PID: 492383 (code=exited, status=1/FAILURE)
      Tasks: 6 (limit: 4637)
     Memory: 103.8M
        CPU: 6.234s
     CGroup: /system.slice/elastic-agent.service
             └─493106 /opt/Elastic/Agent/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent

Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --environment environmentVar   set environment being ran in (default default)
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:   -h, --help                         help for elastic-agent
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.config string           Config path is the directory Agent looks for its >
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.downloads string        Downloads path contains binaries Agent downloads >
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.home string             Agent root path (default "/opt/Elastic/Agent")
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.install string          Install path contains binaries Agent extracts
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.logs string             Logs path contains Agent log output (default "/op>
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:   -v, --v                            Log at INFO level
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]: Use "elastic-agent [command] --help" for more information about a command.
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]: prematurely crashing agent

At this point, I uninstalled the Agent. The command returned almost immediately, as usual, without blocking for too long.

$ sudo elastic-agent uninstall
Elastic Agent is installed but currently broken: service is not installed
Continuing will uninstall the broken Elastic Agent at /opt/Elastic/Agent. Do you want to continue? [Y/n]:y
Elastic Agent has been uninstalled.

Then I checked the service status.

$ sudo systemctl status elastic-agent.service
Warning: The unit file, source configuration file or drop-ins of elastic-agent.service changed on disk. Run 'systemctl daemon-reload' to reload uni>
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; bad; vendor preset: enabled)
     Active: deactivating (stop-sigterm) (Result: exit-code) since Thu 2023-08-10 20:29:06 UTC; 3min 31s ago
   Main PID: 492383 (code=exited, status=1/FAILURE)
      Tasks: 6 (limit: 4637)
     Memory: 103.4M
        CPU: 6.239s
     CGroup: /system.slice/elastic-agent.service
             └─493106 /opt/Elastic/Agent/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent

Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --environment environmentVar   set environment being ran in (default default)
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:   -h, --help                         help for elastic-agent
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.config string           Config path is the directory Agent looks for its >
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.downloads string        Downloads path contains binaries Agent downloads >
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.home string             Agent root path (default "/opt/Elastic/Agent")
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.install string          Install path contains binaries Agent extracts
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.logs string             Logs path contains Agent log output (default "/op>
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:   -v, --v                            Log at INFO level
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]: Use "elastic-agent [command] --help" for more information about a command.
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]: prematurely crashing agent

As the output above shows, the service was still running and, notably, the Upgrade Watcher process was still running.

Interestingly, however, about 2m16s later, the Upgrade Watcher process stopped running, much sooner than the 660s timeout. Given the odd timing (2m16s), most likely the Upgrade Watcher itself failed somehow. Unfortunately since the elastic-agent uninstall had wiped out everything under /opt/Elastic/Agent, I couldn't inspect the Upgrade Watcher logs. However, this does indicate to me, that had the Upgrade Watcher process not naturally exited it would likely have been kept running by systemd for 660s.

Here we can see that the service is still running but there are no processes under it.

$ sudo systemctl status elastic-agent.service
Warning: The unit file, source configuration file or drop-ins of elastic-agent.service changed on disk. Run 'systemctl daemon-reload' to reload uni>
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; bad; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2023-08-10 20:35:21 UTC; 1s ago
   Main PID: 492383 (code=exited, status=1/FAILURE)
        CPU: 6.266s

Aug 10 20:35:21 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
Aug 10 20:35:21 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Consumed 6.266s CPU time.

Exactly 2 minutes after the Upgrade Watcher process stopped running, the service itself got uninstalled.

$ sudo systemctl status elastic-agent.service
Unit elastic-agent.service could not be found.

Conclusion: it would appear that setting TimeoutStopSec=610 does have some impact on the uninstall process but it does not block the uninstall process.


I actually thinking changing the KillMode=process is the proper solution. Being that Elastic Agent is a supervisor and it spawns all of it subprocess with the proper Pkill in the situations that require it. It would also equal the same behavior that we have on macOS and WIndows.

I tested this scenario by starting with a build of Agent that set KillMode=process instead of TimeoutStopSec=610 in the unit configuration file, and upgrading to a build of Agent that crashes immediately upon starting up.

$ grep TimeoutStopSec /etc/systemd/system/elastic-agent.service
$ grep KillMode /etc/systemd/system/elastic-agent.service
KillMode=process
$ sudo elastic-agent upgrade 8.13.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify
Upgrade triggered to version 8.13.0, Elastic Agent is currently restarting

As expected, due to KillMode=process, the Upgrade Watcher process kept running after the upgraded Agent process crashed.

$ sudo systemctl status elastic-agent.service
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2023-08-10 20:59:03 UTC; 1min 57s ago
    Process: 503266 ExecStart=/usr/bin/elastic-agent (code=exited, status=1/FAILURE)
   Main PID: 503266 (code=exited, status=1/FAILURE)
      Tasks: 6 (limit: 4637)
     Memory: 104.3M
        CPU: 6.068s
     CGroup: /system.slice/elastic-agent.service
             └─504544 /opt/Elastic/Agent/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent

Aug 10 20:59:03 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 20:59:03 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
Aug 10 20:59:03 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Unit process 504544 (elastic-agent) remains running after unit stopped.
Aug 10 20:59:03 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Consumed 6.047s CPU time.

At this point, I uninstalled the Agent. The command returned almost immediately, as usual, without blocking for too long.

$ sudo elastic-agent uninstall
Elastic Agent is installed but currently broken: service is not installed
Continuing will uninstall the broken Elastic Agent at /opt/Elastic/Agent. Do you want to continue? [Y/n]:y
Elastic Agent has been uninstalled.

Then I checked the service status.

$ sudo systemctl status elastic-agent.service
Warning: The unit file, source configuration file or drop-ins of elastic-agent.service changed on disk. Run 'systemctl daemon-reload' to reload uni>
○ elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; bad; vendor preset: enabled)
     Active: inactive (dead) (Result: exit-code) since Thu 2023-08-10 21:05:33 UTC; 15s ago
   Main PID: 506788 (code=exited, status=1/FAILURE)
      Tasks: 6 (limit: 4637)
     Memory: 102.9M
        CPU: 5.377s
     CGroup: /system.slice/elastic-agent.service
             └─507047 /opt/Elastic/Agent/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent

Aug 10 21:05:20 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 21:05:20 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
Aug 10 21:05:20 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Unit process 507047 (elastic-agent) remains running after unit stopped.
Aug 10 21:05:20 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Consumed 5.370s CPU time.
Aug 10 21:05:33 shaunak-ubuntu-22-arm systemd[1]: Stopped Elastic Agent is a unified agent to observe, monitor and protect your system..
Aug 10 21:05:33 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Consumed 5.370s CPU time.

As the output above shows, the service was still running and, notably, the Upgrade Watcher process was still running.

I kept watching the output of sudo systemctl status elastic-agent.service. Exactly 2 minutes after the Agent crash happened, the Upgrade Watcher process was killed AND the service was also uninstalled. This is definitely a better experience than with using the TimeoutStopSec setting. This behavior also alleviates a concern I had with using the KillMode=process setting: what if one of the Agent's component sub-processes, e.g. Metricbeat, hangs? Will it prevent the service from being uninstalled? That doesn't appear to be the case, given what happened to the Upgrade Watcher process.

I also wanted to test that, with the KillMode=process setting a rollback would've still been executed had I not uninstalled the Agent while the Upgrade Watcher was running. I essentially repeated all the steps above, skipping over the Agent uninstall step. Within a couple of minutes after the upgraded Agent crashed, the Upgrade Watcher initiated the rollback. I confirmed this by checking the Upgrade Watcher logs.

Conclusions:

  • Using KillMode=process does prevent systemd from prematurely killing off the Upgrade Watcher process, thus allowing it enough time to do its job.
  • Using KillMode=process, when elastic-agent uninstall is executed while the Upgrade Watcher is running, the service (and any lingering processes, e.g. the Upgrade Watcher) are removed/killed faster than when using the TimeoutStopSec setting.

Copy link
Contributor

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 2 small updates in the comments, looks good otherwise

internal/pkg/agent/install/svc.go Outdated Show resolved Hide resolved
internal/pkg/agent/install/svc.go Outdated Show resolved Hide resolved
@ycombinator ycombinator changed the title Increase the time before systemd will kill the Upgrade Watcher process after the main Agent process has crashed Prevent systemd from killing the Upgrade Watcher process after the main Agent process has crashed Aug 11, 2023
Copy link
Contributor

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ycombinator ycombinator force-pushed the systemd-increase-upg-watcher-kill-timeout branch from ad152a4 to a5da642 Compare August 11, 2023 13:06
@ycombinator ycombinator force-pushed the systemd-increase-upg-watcher-kill-timeout branch from fc59ab4 to 1745599 Compare August 15, 2023 16:33
@ycombinator
Copy link
Contributor Author

Could we modify one of the upgrade integration tests to check that KillMode process change takes effect? The best case would be to always do this on a test from a pre-8.10 version to 8.10+.

Sure, that will be done as part of #3085. It was while writing the test in that PR that I found this bug.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment.

switch service.ChosenSystem().String() {
case "linux-systemd":
unitFilePath := "/etc/systemd/system/" + paths.ServiceName + ".service"
if err := ensureSystemdServiceConfigUpToDate(unitFilePath); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should change this to also return a boolean informing the caller if the file was updated.

}

// Reload systemd unit configuration files
cmd := exec.Command("systemctl", "daemon-reload")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my comment above, I think we should use that boolean value to only call daemon-reload in the case that we actually updated the file. I think calling this every time the Elastic Agent is started is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. Will change!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c1986bc.

@ycombinator ycombinator force-pushed the systemd-increase-upg-watcher-kill-timeout branch from 43e99c6 to c1986bc Compare August 15, 2023 18:45
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for addressing all the feedback.

@ycombinator ycombinator enabled auto-merge (squash) August 15, 2023 19:09
@ycombinator ycombinator merged commit d4e444e into elastic:main Aug 15, 2023
27 checks passed
@ycombinator ycombinator deleted the systemd-increase-upg-watcher-kill-timeout branch August 15, 2023 20:22
cmacknz added a commit to cmacknz/elastic-agent that referenced this pull request Aug 29, 2023
…ter the main Agent process has crashed (elastic#3220)"

This reverts commit d4e444e.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

systemd prematurely kills Upgrade Watcher when upgraded Agent fails to start
5 participants