Prevent `systemd` from killing the Upgrade Watcher process after the main Agent process has crashed #3220

ycombinator · 2023-08-10T05:59:29Z

What does this PR do?

This PR adds the KillMode=process setting to Elastic Agent's systemd unit configuration file, installed at /etc/systemd/system/elastic-agent.service.

Why is it important?

The KillMode setting governs which processes under the service will be killed by systemd. The default value of the setting is control-group. With this value, all processes under the service are killed. However, with the process value, only the main process under the service is killed while the others are left running.

In the context of Elastic Agent, the main process it the Agent process. During an ongoing upgrade, there is a second process in the same cgroup as the main process: the Upgrade Watcher process.

In the scenario when the main process crashes or is not reporting a healthy status repeatedly, we need the Upgrade Watcher process to keep running so it can detect this repeated crashing or unhealthy status, and then initiate a rollback.

By setting KillMode=process in Elastic Agent's systemd unit configuration file, we ensure that when the main process is repeatedly crashing, systemd keeps the Upgrade Watcher process running so it can fulfill it's destiny.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
~~I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in ./changelog/fragments using the changelog tool
~~I have added an integration test or an E2E test~~

How to test this PR locally

Test A

For this test, we will upgrade from a pre-built version of Agent, say 8.9.0 to the build of Agent with this PR (8.10.0) and then to another build of Agent that deliberately crashes (say, 8.11.0).

The upgrade from the first to the second build of Agent will test that the /etc/systemd/system/elastic-agent.service file starts off without the KillMode=process setting in it and that the setting is added after the new (post-upgrade) Agent starts running.

The upgrade from the second to the third build of Agent will test that the KillMode=process setting is behaving as expected, by ensuring that the Upgrade Watcher process keeps running even after the new (post-upgrade) Agent process crashes.

Start with a pre-built version of Elastic Agent < 8.10.0
Unpack and install this Elastic Agent on Linux.
Check the contents of the /etc/systemd/system/elastic-agent.service file and ensure that the KillMode=process setting is absent.
```
grep KillMode /etc/systemd/system/elastic-agent.service
```

Check out this PR's branch and build Elastic Agent for Linux.

DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage package

Upgrade to the newly-built Agent.

sudo elastic-agent upgrade 8.10.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify

Check the contents of the /etc/systemd/system/elastic-agent.service file and ensure that the KillMode=process setting is now present. Don't worry about any whitespace around the = sign (systemd ignores it).
```
grep KillMode /etc/systemd/system/elastic-agent.service
```
Ensure that the systemd unit configuration files have been reloaded. For this, run systemctl status elastic-agent.service >/dev/null and ensure that the following warning message is NOT shown:
```
Warning: The unit file, source configuration file or drop-ins of elastic-agent.service changed on disk. Run 'systemctl daemon-reload' to reload units.
```
Ensure that the upgrade succeeds before moving on. For this, wait until the Upgrade Watcher process has finished running and ensure that the running Elastic Agent's version is 8.10.0.
```
systemctl status elastic-agent.service | grep 'elastic-agent watch'
sudo elastic-agent version
```
Bump up the minor version in version/version.go (to, say, 8.11.0), and also return an error at the very start of

elastic-agent/internal/pkg/agent/cmd/run.go

Line 65 in af31709

RunE: func(cmd *cobra.Command, _ []string) error {

to ensure that the built Agent will crash immediately upon start up.
Make sure to make a new commit with these changes so the build ID (SHA) in the Agent build is different. Otherwise the upgrade won't work.

Build Elastic Agent for Linux.

DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage package

Upgrade to the newly-built Agent.

sudo elastic-agent upgrade 8.11.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify

Monitor the output of systemctl status elastic-agent.service. Watch the Upgrade Watcher process being spawned, then the new (post-upgrade) Agent process crashing, and ensure that the Upgrade Watcher process keeps running for at least 2 minutes.
```
watch -n1 systemctl status elastic-agent.service
```

Test B

For this test, we will upgrade from a pre-built version of Agent, say 8.9.0 to a build of Agent that has the changes in this PR but also crashes upon start up (say, 8.10.0).

This will test that that the /etc/systemd/system/elastic-agent.service file starts off without the KillMode=process setting in it, that the setting is added after the new (post-upgrade) Agent starts running, and that the setting takes effect immediately so as to keep the Upgrade Watcher process running after the Agent process has crashed.

Start with a pre-built version of Elastic Agent < 8.10.0
Unpack and install this Elastic Agent on Linux.
Check the contents of the /etc/systemd/system/elastic-agent.service file and ensure that the KillMode=process setting is absent.
```
grep KillMode /etc/systemd/system/elastic-agent.service
```
Check out this PR's branch.

Modify the code to return an error right after the changes in this PR have a chance to take effect, to ensure that the built Agent will crash:

elastic-agent/internal/pkg/agent/cmd/run.go

Lines 108 to 111 in af31709

    
           if err := handleUpgrade(); err != nil { 
        
           	return fmt.Errorf("error checking for and handling upgrade: %w", err) 
        
           }

Make sure to make a new commit with these changes so the build ID (SHA) in the Agent build is different. Otherwise the upgrade won't work.

Build Elastic Agent for Linux.

DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage package

Upgrade to the newly-built Agent.

sudo elastic-agent upgrade 8.10.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify

Check the contents of the /etc/systemd/system/elastic-agent.service file and ensure that the KillMode=process setting is now present. Don't worry about any whitespace around the = sign (systemd ignores it).
```
grep KillMode /etc/systemd/system/elastic-agent.service
```
Ensure that the systemd unit configuration files have been reloaded. For this, run systemctl status elastic-agent.service >/dev/null and ensure that the following warning message is NOT shown:
```
Warning: The unit file, source configuration file or drop-ins of elastic-agent.service changed on disk. Run 'systemctl daemon-reload' to reload units.
```
Monitor the output of systemctl status elastic-agent.service. Watch the Upgrade Watcher process being spawned, then the new (post-upgrade) Agent process crashing, and ensure that the Upgrade Watcher process keeps running for at least 2 minutes.
```
watch -n1 systemctl status elastic-agent.service
```

Related issues

Fixes systemd prematurely kills Upgrade Watcher when upgraded Agent fails to start #3123

elasticmachine · 2023-08-10T05:59:33Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

elasticmachine · 2023-08-10T06:02:53Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-08-15T19:25:10.358+0000
Duration: 51 min 48 sec

Test stats 🧪

Test	Results
Failed	0
Passed	6277
Skipped	47
Total	6324

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2023-08-10T06:16:29Z

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2023-08-10T06:31:05Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.75% (`79/80`)	👍
Files	67.832% (`194/286`)	👍 0.113
Classes	66.292% (`354/534`)	👍 0.063
Methods	52.874% (`1104/2088`)	👎 -0.028
Lines	38.709% (`12651/32682`)	👎 -0.047
Conditionals	100.0% (`0/0`)	💚

pchila

LGTM

ycombinator · 2023-08-10T13:42:38Z

buildkite test this

blakerouse

I would like to talk about this change a little bit more.

I read all of the details from the issue that you provided on changing this value. I do worry that this change of the timeout does have a side-effect of systemctl stop elastic-agent. I believe that if stopping the Elastic Agent using the above command hangs that it would block for the full 10 minutes. I don't know if that is the behavior we want to provide in that situation.

I also worry about this having an effect on uninstall. We stop the service, if the watcher is running and uninstall is performed will it block for 10mins waiting for the watcher to die before being able to continue.

I actually thinking changing the KillMode=process is the proper solution. Being that Elastic Agent is a supervisor and it spawns all of it subprocess with the proper Pkill in the situations that require it. It would also equal the same behavior that we have on macOS and WIndows.

ycombinator · 2023-08-10T16:28:59Z

@blakerouse Thanks for taking the time to read through all the investigation details in the issue and bringing up your concerns here. Let me do some more testing for the scenarios you outlined as well as with the alternative idea of specifying KillMode=process instead of the TimeoutStopSec setting and post my findings here.

ycombinator · 2023-08-10T21:20:43Z

I tested the three scenarios @blakerouse brought up in #3220 (review). Observations and conclusions from each of the three tests are below but tl;dr is that it seems better to go with the KillMode=process setting than the TimeoutStopSec setting. I've updated this PR accordingly.

I do worry that this change of the timeout does have a side-effect of systemctl stop elastic-agent. I believe that if stopping the Elastic Agent using the above command hangs that it would block for the full 10 minutes. I don't know if that is the behavior we want to provide in that situation.

I built a version of Agent (on top of the changes in this PR) that will block for 30 minutes (<-time.After(30 * time.Minute)) very early in the elastic-agent run code path.

I installed this Agent made sure it started running under the service.

Then I immediately called systemctl stop elastic-agent and then looked at the output of systemctl status elastic-agent.

It appears the service is stopped and the Agent process is terminated.

$ sudo systemctl status elastic-agent.service
Unit elastic-agent.service could not be found.
$ sudo ./elastic-agent install
Elastic Agent will be installed at /opt/Elastic/Agent and will run as a service. Do you want to continue? [Y/n]:y
Do you want to enroll this Agent into Fleet? [Y/n]:n
Elastic Agent has been successfully installed.
$ sudo systemctl status elastic-agent.service
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-08-10 19:50:35 UTC; 26s ago
   Main PID: 478576 (elastic-agent)
      Tasks: 5 (limit: 4637)
     Memory: 10.9M
        CPU: 60ms
     CGroup: /system.slice/elastic-agent.service
             └─478576 /opt/Elastic/Agent/elastic-agent

Aug 10 19:50:35 shaunak-ubuntu-22-arm systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
$  sudo systemctl stop elastic-agent.service
$ sudo systemctl status elastic-agent.service
○ elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2023-08-10 19:51:11 UTC; 5s ago
    Process: 478576 ExecStart=/usr/bin/elastic-agent (code=killed, signal=TERM)
   Main PID: 478576 (code=killed, signal=TERM)
        CPU: 65ms

Aug 10 19:50:35 shaunak-ubuntu-22-arm systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
Aug 10 19:51:11 shaunak-ubuntu-22-arm systemd[1]: Stopping Elastic Agent is a unified agent to observe, monitor and protect your system....
Aug 10 19:51:11 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Deactivated successfully.
Aug 10 19:51:11 shaunak-ubuntu-22-arm systemd[1]: Stopped Elastic Agent is a unified agent to observe, monitor and protect your system..

Conclusion: it would appear that setting TimeoutStopSec=610 does not have an adverse impact on the behavior of systemctl stop elastic-agent.

I also worry about this having an effect on uninstall. We stop the service, if the watcher is running and uninstall is performed will it block for 10mins waiting for the watcher to die before being able to continue.

I tested this scenario by starting with a build of Agent from this PR (so it includes the TimeoutStopSec=610 setting) and upgrading to a build of Agent that crashes immediately upon starting up.

$ grep TimeoutStopSec /etc/systemd/system/elastic-agent.service
TimeoutStopSec=610
$ sudo elastic-agent upgrade 8.13.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify
Upgrade triggered to version 8.13.0, Elastic Agent is currently restarting
$

As expected, due to the timeout change in this PR, the Upgrade Watcher process kept running after the upgraded Agent process crashed.

$ sudo systemctl status elastic-agent.service
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: deactivating (stop-sigterm) (Result: exit-code) since Thu 2023-08-10 20:29:06 UTC; 3min 10s ago
    Process: 492383 ExecStart=/usr/bin/elastic-agent (code=exited, status=1/FAILURE)
   Main PID: 492383 (code=exited, status=1/FAILURE)
      Tasks: 6 (limit: 4637)
     Memory: 103.8M
        CPU: 6.234s
     CGroup: /system.slice/elastic-agent.service
             └─493106 /opt/Elastic/Agent/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent

Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --environment environmentVar   set environment being ran in (default default)
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:   -h, --help                         help for elastic-agent
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.config string           Config path is the directory Agent looks for its >
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.downloads string        Downloads path contains binaries Agent downloads >
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.home string             Agent root path (default "/opt/Elastic/Agent")
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.install string          Install path contains binaries Agent extracts
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.logs string             Logs path contains Agent log output (default "/op>
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:   -v, --v                            Log at INFO level
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]: Use "elastic-agent [command] --help" for more information about a command.
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]: prematurely crashing agent

At this point, I uninstalled the Agent. The command returned almost immediately, as usual, without blocking for too long.

$ sudo elastic-agent uninstall
Elastic Agent is installed but currently broken: service is not installed
Continuing will uninstall the broken Elastic Agent at /opt/Elastic/Agent. Do you want to continue? [Y/n]:y
Elastic Agent has been uninstalled.

Then I checked the service status.

$ sudo systemctl status elastic-agent.service
Warning: The unit file, source configuration file or drop-ins of elastic-agent.service changed on disk. Run 'systemctl daemon-reload' to reload uni>
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; bad; vendor preset: enabled)
     Active: deactivating (stop-sigterm) (Result: exit-code) since Thu 2023-08-10 20:29:06 UTC; 3min 31s ago
   Main PID: 492383 (code=exited, status=1/FAILURE)
      Tasks: 6 (limit: 4637)
     Memory: 103.4M
        CPU: 6.239s
     CGroup: /system.slice/elastic-agent.service
             └─493106 /opt/Elastic/Agent/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent

Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --environment environmentVar   set environment being ran in (default default)
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:   -h, --help                         help for elastic-agent
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.config string           Config path is the directory Agent looks for its >
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.downloads string        Downloads path contains binaries Agent downloads >
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.home string             Agent root path (default "/opt/Elastic/Agent")
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.install string          Install path contains binaries Agent extracts
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:       --path.logs string             Logs path contains Agent log output (default "/op>
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]:   -v, --v                            Log at INFO level
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]: Use "elastic-agent [command] --help" for more information about a command.
Aug 10 20:29:06 shaunak-ubuntu-22-arm elastic-agent[492383]: prematurely crashing agent

As the output above shows, the service was still running and, notably, the Upgrade Watcher process was still running.

Interestingly, however, about 2m16s later, the Upgrade Watcher process stopped running, much sooner than the 660s timeout. Given the odd timing (2m16s), most likely the Upgrade Watcher itself failed somehow. Unfortunately since the elastic-agent uninstall had wiped out everything under /opt/Elastic/Agent, I couldn't inspect the Upgrade Watcher logs. However, this does indicate to me, that had the Upgrade Watcher process not naturally exited it would likely have been kept running by systemd for 660s.

Here we can see that the service is still running but there are no processes under it.

$ sudo systemctl status elastic-agent.service
Warning: The unit file, source configuration file or drop-ins of elastic-agent.service changed on disk. Run 'systemctl daemon-reload' to reload uni>
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; bad; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2023-08-10 20:35:21 UTC; 1s ago
   Main PID: 492383 (code=exited, status=1/FAILURE)
        CPU: 6.266s

Aug 10 20:35:21 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
Aug 10 20:35:21 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Consumed 6.266s CPU time.

Exactly 2 minutes after the Upgrade Watcher process stopped running, the service itself got uninstalled.

$ sudo systemctl status elastic-agent.service
Unit elastic-agent.service could not be found.

Conclusion: it would appear that setting TimeoutStopSec=610 does have some impact on the uninstall process but it does not block the uninstall process.

I actually thinking changing the KillMode=process is the proper solution. Being that Elastic Agent is a supervisor and it spawns all of it subprocess with the proper Pkill in the situations that require it. It would also equal the same behavior that we have on macOS and WIndows.

I tested this scenario by starting with a build of Agent that set KillMode=process instead of TimeoutStopSec=610 in the unit configuration file, and upgrading to a build of Agent that crashes immediately upon starting up.

$ grep TimeoutStopSec /etc/systemd/system/elastic-agent.service
$ grep KillMode /etc/systemd/system/elastic-agent.service
KillMode=process
$ sudo elastic-agent upgrade 8.13.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify
Upgrade triggered to version 8.13.0, Elastic Agent is currently restarting

As expected, due to KillMode=process, the Upgrade Watcher process kept running after the upgraded Agent process crashed.

$ sudo systemctl status elastic-agent.service
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2023-08-10 20:59:03 UTC; 1min 57s ago
    Process: 503266 ExecStart=/usr/bin/elastic-agent (code=exited, status=1/FAILURE)
   Main PID: 503266 (code=exited, status=1/FAILURE)
      Tasks: 6 (limit: 4637)
     Memory: 104.3M
        CPU: 6.068s
     CGroup: /system.slice/elastic-agent.service
             └─504544 /opt/Elastic/Agent/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent

Aug 10 20:59:03 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 20:59:03 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
Aug 10 20:59:03 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Unit process 504544 (elastic-agent) remains running after unit stopped.
Aug 10 20:59:03 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Consumed 6.047s CPU time.

At this point, I uninstalled the Agent. The command returned almost immediately, as usual, without blocking for too long.

$ sudo elastic-agent uninstall
Elastic Agent is installed but currently broken: service is not installed
Continuing will uninstall the broken Elastic Agent at /opt/Elastic/Agent. Do you want to continue? [Y/n]:y
Elastic Agent has been uninstalled.

Then I checked the service status.

$ sudo systemctl status elastic-agent.service
Warning: The unit file, source configuration file or drop-ins of elastic-agent.service changed on disk. Run 'systemctl daemon-reload' to reload uni>
○ elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; bad; vendor preset: enabled)
     Active: inactive (dead) (Result: exit-code) since Thu 2023-08-10 21:05:33 UTC; 15s ago
   Main PID: 506788 (code=exited, status=1/FAILURE)
      Tasks: 6 (limit: 4637)
     Memory: 102.9M
        CPU: 5.377s
     CGroup: /system.slice/elastic-agent.service
             └─507047 /opt/Elastic/Agent/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent

Aug 10 21:05:20 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 21:05:20 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
Aug 10 21:05:20 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Unit process 507047 (elastic-agent) remains running after unit stopped.
Aug 10 21:05:20 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Consumed 5.370s CPU time.
Aug 10 21:05:33 shaunak-ubuntu-22-arm systemd[1]: Stopped Elastic Agent is a unified agent to observe, monitor and protect your system..
Aug 10 21:05:33 shaunak-ubuntu-22-arm systemd[1]: elastic-agent.service: Consumed 5.370s CPU time.

As the output above shows, the service was still running and, notably, the Upgrade Watcher process was still running.

I kept watching the output of sudo systemctl status elastic-agent.service. Exactly 2 minutes after the Agent crash happened, the Upgrade Watcher process was killed AND the service was also uninstalled. This is definitely a better experience than with using the TimeoutStopSec setting. This behavior also alleviates a concern I had with using the KillMode=process setting: what if one of the Agent's component sub-processes, e.g. Metricbeat, hangs? Will it prevent the service from being uninstalled? That doesn't appear to be the case, given what happened to the Upgrade Watcher process.

I also wanted to test that, with the KillMode=process setting a rollback would've still been executed had I not uninstalled the Agent while the Upgrade Watcher was running. I essentially repeated all the steps above, skipping over the Agent uninstall step. Within a couple of minutes after the upgraded Agent crashed, the Upgrade Watcher initiated the rollback. I confirmed this by checking the Upgrade Watcher logs.

Conclusions:

Using KillMode=process does prevent systemd from prematurely killing off the Upgrade Watcher process, thus allowing it enough time to do its job.
Using KillMode=process, when elastic-agent uninstall is executed while the Upgrade Watcher is running, the service (and any lingering processes, e.g. the Upgrade Watcher) are removed/killed faster than when using the TimeoutStopSec setting.

pchila

Just 2 small updates in the comments, looks good otherwise

internal/pkg/agent/install/svc.go

pchila

LGTM

This reverts commit 4682050.

…s with Agent rollback

ycombinator · 2023-08-15T16:34:58Z

Could we modify one of the upgrade integration tests to check that KillMode process change takes effect? The best case would be to always do this on a test from a pre-8.10 version to 8.10+.

Sure, that will be done as part of #3085. It was while writing the test in that PR that I found this bug.

blakerouse

See inline comment.

blakerouse · 2023-08-15T17:38:48Z

internal/pkg/agent/application/upgrade/service_update_linux.go

+	switch service.ChosenSystem().String() {
+	case "linux-systemd":
+		unitFilePath := "/etc/systemd/system/" + paths.ServiceName + ".service"
+		if err := ensureSystemdServiceConfigUpToDate(unitFilePath); err != nil {


I think you should change this to also return a boolean informing the caller if the file was updated.

blakerouse · 2023-08-15T17:39:31Z

internal/pkg/agent/application/upgrade/service_update_linux.go

+		}
+
+		// Reload systemd unit configuration files
+		cmd := exec.Command("systemctl", "daemon-reload")


With my comment above, I think we should use that boolean value to only call daemon-reload in the case that we actually updated the file. I think calling this every time the Elastic Agent is started is wrong.

Yeah, good point. Will change!

Done in c1986bc.

blakerouse

Looks good! Thanks for addressing all the feedback.

…ter the main Agent process has crashed (elastic#3220)" This reverts commit d4e444e.

ycombinator added bug Something isn't working Team:Elastic-Agent Label for the Agent team backport-v8.9.0 Automated backport with mergify labels Aug 10, 2023

ycombinator requested a review from a team as a code owner August 10, 2023 05:59

ycombinator requested review from AndersonQ and pchila August 10, 2023 05:59

mergify bot assigned ycombinator Aug 10, 2023

ycombinator force-pushed the systemd-increase-upg-watcher-kill-timeout branch from 3d58988 to d570694 Compare August 10, 2023 06:00

ycombinator changed the title ~~Systemd increase upg watcher kill timeout~~ Increase the time before systemd will kill the Upgrade Watcher process after the main Agent process has crashed Aug 10, 2023

ycombinator removed the backport-v8.9.0 Automated backport with mergify label Aug 10, 2023

mergify bot added the backport-skip label Aug 10, 2023

pchila approved these changes Aug 10, 2023

View reviewed changes

ycombinator enabled auto-merge (squash) August 10, 2023 13:52

blakerouse requested changes Aug 10, 2023

View reviewed changes

ycombinator mentioned this pull request Aug 10, 2023

systemd prematurely kills Upgrade Watcher when upgraded Agent fails to start #3123

Closed

ycombinator force-pushed the systemd-increase-upg-watcher-kill-timeout branch from 36b424f to 40fb28e Compare August 10, 2023 21:16

ycombinator requested review from blakerouse and pchila August 10, 2023 21:23

pchila approved these changes Aug 11, 2023

View reviewed changes

internal/pkg/agent/install/svc.go Outdated Show resolved Hide resolved

internal/pkg/agent/install/svc.go Outdated Show resolved Hide resolved

ycombinator requested a review from pchila August 11, 2023 12:13

ycombinator changed the title ~~Increase the time before systemd will kill the Upgrade Watcher process after the main Agent process has crashed~~ Prevent systemd from killing the Upgrade Watcher process after the main Agent process has crashed Aug 11, 2023

pchila approved these changes Aug 11, 2023

View reviewed changes

ycombinator force-pushed the systemd-increase-upg-watcher-kill-timeout branch from ad152a4 to a5da642 Compare August 11, 2023 13:06

ycombinator added 19 commits August 15, 2023 09:32

Fix option path in template

d948345

Add CHANGELOG fragment

d2f0fac

[Testing] Block for 30 min on startup

29a1199

Revert "[Testing] Block for 30 min on startup"

cf586b6

This reverts commit 4682050.

Use KillMode=process instead of TimeoutStopSec setting

65e9105

Fix comments

9dac008

Adding explanation of why KillMode=process is being used

4a936b4

Adding KillMode=process to unit template file used for RPM/DEB build

b96b919

Refactor: extract function for ensuring install marker is present

2683a05

Update service file on start up, during upgrade

bba8110

Running mage fmt

4f94931

Add explanatory comment for using INI parser/writer

f34307c

Reload systemd unit configuration files after change

fbba101

Updating NOTICE.txt

fc48e75

[Testing] Crash agent

7b26cf0

Refactoring: remove reload step from config change function

aaead7f

Remove code added for testing crash

9203a16

Moving files to upgrade package

6b07ccb

Add comment noting inability to rollback service configuration change…

1745599

…s with Agent rollback

ycombinator force-pushed the systemd-increase-upg-watcher-kill-timeout branch from fc59ab4 to 1745599 Compare August 15, 2023 16:33

blakerouse requested changes Aug 15, 2023

View reviewed changes

ycombinator requested a review from blakerouse August 15, 2023 18:43

Only reload systemd daemon if unit file changed

c1986bc

ycombinator force-pushed the systemd-increase-upg-watcher-kill-timeout branch from 43e99c6 to c1986bc Compare August 15, 2023 18:45

blakerouse approved these changes Aug 15, 2023

View reviewed changes

ycombinator enabled auto-merge (squash) August 15, 2023 19:09

ycombinator merged commit d4e444e into elastic:main Aug 15, 2023
27 checks passed

ycombinator deleted the systemd-increase-upg-watcher-kill-timeout branch August 15, 2023 20:22

cmacknz added a commit to cmacknz/elastic-agent that referenced this pull request Aug 29, 2023

Revert "Prevent systemd from killing the Upgrade Watcher process af…

a827fe0

…ter the main Agent process has crashed (elastic#3220)" This reverts commit d4e444e.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent `systemd` from killing the Upgrade Watcher process after the main Agent process has crashed #3220

Prevent `systemd` from killing the Upgrade Watcher process after the main Agent process has crashed #3220

ycombinator commented Aug 10, 2023 •

edited

elasticmachine commented Aug 10, 2023

elasticmachine commented Aug 10, 2023 •

edited

Build stats

Test stats 🧪

mergify bot commented Aug 10, 2023

elasticmachine commented Aug 10, 2023 •

edited

pchila left a comment

ycombinator commented Aug 10, 2023

blakerouse left a comment

ycombinator commented Aug 10, 2023

ycombinator commented Aug 10, 2023 •

edited

pchila left a comment

pchila left a comment

ycombinator commented Aug 15, 2023

blakerouse left a comment

blakerouse Aug 15, 2023

blakerouse Aug 15, 2023

ycombinator Aug 15, 2023

ycombinator Aug 15, 2023 •

edited

blakerouse left a comment

	if err := handleUpgrade(); err != nil {
	return fmt.Errorf("error checking for and handling upgrade: %w", err)
	}

Prevent systemd from killing the Upgrade Watcher process after the main Agent process has crashed #3220

Prevent systemd from killing the Upgrade Watcher process after the main Agent process has crashed #3220

Conversation

ycombinator commented Aug 10, 2023 • edited

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Test A

Test B

Related issues

elasticmachine commented Aug 10, 2023

elasticmachine commented Aug 10, 2023 • edited

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

mergify bot commented Aug 10, 2023

elasticmachine commented Aug 10, 2023 • edited

🌐 Coverage report

pchila left a comment

Choose a reason for hiding this comment

ycombinator commented Aug 10, 2023

blakerouse left a comment

Choose a reason for hiding this comment

ycombinator commented Aug 10, 2023

ycombinator commented Aug 10, 2023 • edited

pchila left a comment

Choose a reason for hiding this comment

pchila left a comment

Choose a reason for hiding this comment

ycombinator commented Aug 15, 2023

blakerouse left a comment

Choose a reason for hiding this comment

blakerouse Aug 15, 2023

Choose a reason for hiding this comment

blakerouse Aug 15, 2023

Choose a reason for hiding this comment

ycombinator Aug 15, 2023

Choose a reason for hiding this comment

ycombinator Aug 15, 2023 • edited

Choose a reason for hiding this comment

blakerouse left a comment

Choose a reason for hiding this comment

Prevent `systemd` from killing the Upgrade Watcher process after the main Agent process has crashed #3220

Prevent `systemd` from killing the Upgrade Watcher process after the main Agent process has crashed #3220

ycombinator commented Aug 10, 2023 •

edited

elasticmachine commented Aug 10, 2023 •

edited

elasticmachine commented Aug 10, 2023 •

edited

ycombinator commented Aug 10, 2023 •

edited

ycombinator Aug 15, 2023 •

edited