-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent systemd
from killing the Upgrade Watcher process after the main Agent process has crashed
#3220
Prevent systemd
from killing the Upgrade Watcher process after the main Agent process has crashed
#3220
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
3d58988
to
d570694
Compare
systemd
will kill the Upgrade Watcher process after the main Agent process has crashed
This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
NOTE: |
🌐 Coverage report
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
buildkite test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to talk about this change a little bit more.
I read all of the details from the issue that you provided on changing this value. I do worry that this change of the timeout does have a side-effect of systemctl stop elastic-agent
. I believe that if stopping the Elastic Agent using the above command hangs that it would block for the full 10 minutes. I don't know if that is the behavior we want to provide in that situation.
I also worry about this having an effect on uninstall. We stop the service, if the watcher is running and uninstall is performed will it block for 10mins waiting for the watcher to die before being able to continue.
I actually thinking changing the KillMode=process
is the proper solution. Being that Elastic Agent is a supervisor and it spawns all of it subprocess with the proper Pkill
in the situations that require it. It would also equal the same behavior that we have on macOS and WIndows.
@blakerouse Thanks for taking the time to read through all the investigation details in the issue and bringing up your concerns here. Let me do some more testing for the scenarios you outlined as well as with the alternative idea of specifying |
36b424f
to
40fb28e
Compare
I tested the three scenarios @blakerouse brought up in #3220 (review). Observations and conclusions from each of the three tests are below but tl;dr is that it seems better to go with the
I built a version of Agent (on top of the changes in this PR) that will block for 30 minutes ( I installed this Agent made sure it started running under the service. Then I immediately called It appears the service is stopped and the Agent process is terminated.
Conclusion: it would appear that setting
I tested this scenario by starting with a build of Agent from this PR (so it includes the
As expected, due to the timeout change in this PR, the Upgrade Watcher process kept running after the upgraded Agent process crashed.
At this point, I uninstalled the Agent. The command returned almost immediately, as usual, without blocking for too long.
Then I checked the service status.
As the output above shows, the service was still running and, notably, the Upgrade Watcher process was still running. Interestingly, however, about 2m16s later, the Upgrade Watcher process stopped running, much sooner than the 660s timeout. Given the odd timing (2m16s), most likely the Upgrade Watcher itself failed somehow. Unfortunately since the Here we can see that the service is still running but there are no processes under it.
Exactly 2 minutes after the Upgrade Watcher process stopped running, the service itself got uninstalled.
Conclusion: it would appear that setting
I tested this scenario by starting with a build of Agent that set
As expected, due to
At this point, I uninstalled the Agent. The command returned almost immediately, as usual, without blocking for too long.
Then I checked the service status.
As the output above shows, the service was still running and, notably, the Upgrade Watcher process was still running. I kept watching the output of I also wanted to test that, with the Conclusions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just 2 small updates in the comments, looks good otherwise
systemd
will kill the Upgrade Watcher process after the main Agent process has crashedsystemd
from killing the Upgrade Watcher process after the main Agent process has crashed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ad152a4
to
a5da642
Compare
This reverts commit 4682050.
…s with Agent rollback
fc59ab4
to
1745599
Compare
Sure, that will be done as part of #3085. It was while writing the test in that PR that I found this bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comment.
switch service.ChosenSystem().String() { | ||
case "linux-systemd": | ||
unitFilePath := "/etc/systemd/system/" + paths.ServiceName + ".service" | ||
if err := ensureSystemdServiceConfigUpToDate(unitFilePath); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should change this to also return a boolean informing the caller if the file was updated.
} | ||
|
||
// Reload systemd unit configuration files | ||
cmd := exec.Command("systemctl", "daemon-reload") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my comment above, I think we should use that boolean value to only call daemon-reload
in the case that we actually updated the file. I think calling this every time the Elastic Agent is started is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. Will change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c1986bc.
43e99c6
to
c1986bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for addressing all the feedback.
…ter the main Agent process has crashed (elastic#3220)" This reverts commit d4e444e.
What does this PR do?
This PR adds the
KillMode=process
setting to Elastic Agent'ssystemd
unit configuration file, installed at/etc/systemd/system/elastic-agent.service
.Why is it important?
The
KillMode
setting governs which processes under the service will be killed bysystemd
. The default value of the setting iscontrol-group
. With this value, all processes under the service are killed. However, with theprocess
value, only the main process under the service is killed while the others are left running.In the context of Elastic Agent, the main process it the Agent process. During an ongoing upgrade, there is a second process in the same cgroup as the main process: the Upgrade Watcher process.
In the scenario when the main process crashes or is not reporting a healthy status repeatedly, we need the Upgrade Watcher process to keep running so it can detect this repeated crashing or unhealthy status, and then initiate a rollback.
By setting
KillMode=process
in Elastic Agent'ssystemd
unit configuration file, we ensure that when the main process is repeatedly crashing,systemd
keeps the Upgrade Watcher process running so it can fulfill it's destiny.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added tests that prove my fix is effective or that my feature works./changelog/fragments
using the changelog toolI have added an integration test or an E2E testHow to test this PR locally
Test A
For this test, we will upgrade from a pre-built version of Agent, say
8.9.0
to the build of Agent with this PR (8.10.0
) and then to another build of Agent that deliberately crashes (say,8.11.0
).The upgrade from the first to the second build of Agent will test that the
/etc/systemd/system/elastic-agent.service
file starts off without theKillMode=process
setting in it and that the setting is added after the new (post-upgrade) Agent starts running.The upgrade from the second to the third build of Agent will test that the
KillMode=process
setting is behaving as expected, by ensuring that the Upgrade Watcher process keeps running even after the new (post-upgrade) Agent process crashes.Start with a pre-built version of Elastic Agent <
8.10.0
Unpack and install this Elastic Agent on Linux.
Check the contents of the
/etc/systemd/system/elastic-agent.service
file and ensure that theKillMode=process
setting is absent.Check out this PR's branch and build Elastic Agent for Linux.
Upgrade to the newly-built Agent.
Check the contents of the
/etc/systemd/system/elastic-agent.service
file and ensure that theKillMode=process
setting is now present. Don't worry about any whitespace around the=
sign (systemd
ignores it).Ensure that the
systemd
unit configuration files have been reloaded. For this, runsystemctl status elastic-agent.service >/dev/null
and ensure that the following warning message is NOT shown:Ensure that the upgrade succeeds before moving on. For this, wait until the Upgrade Watcher process has finished running and ensure that the running Elastic Agent's version is
8.10.0
.Bump up the minor version in
version/version.go
(to, say,8.11.0
), and also return an error at the very start ofelastic-agent/internal/pkg/agent/cmd/run.go
Line 65 in af31709
Make sure to make a new commit with these changes so the build ID (SHA) in the Agent build is different. Otherwise the upgrade won't work.
Build Elastic Agent for Linux.
Upgrade to the newly-built Agent.
Monitor the output of
systemctl status elastic-agent.service
. Watch the Upgrade Watcher process being spawned, then the new (post-upgrade) Agent process crashing, and ensure that the Upgrade Watcher process keeps running for at least 2 minutes.Test B
For this test, we will upgrade from a pre-built version of Agent, say
8.9.0
to a build of Agent that has the changes in this PR but also crashes upon start up (say,8.10.0
).This will test that that the
/etc/systemd/system/elastic-agent.service
file starts off without theKillMode=process
setting in it, that the setting is added after the new (post-upgrade) Agent starts running, and that the setting takes effect immediately so as to keep the Upgrade Watcher process running after the Agent process has crashed.Start with a pre-built version of Elastic Agent <
8.10.0
Unpack and install this Elastic Agent on Linux.
Check the contents of the
/etc/systemd/system/elastic-agent.service
file and ensure that theKillMode=process
setting is absent.Check out this PR's branch.
Modify the code to return an error right after the changes in this PR have a chance to take effect, to ensure that the built Agent will crash:
elastic-agent/internal/pkg/agent/cmd/run.go
Lines 108 to 111 in af31709
Make sure to make a new commit with these changes so the build ID (SHA) in the Agent build is different. Otherwise the upgrade won't work.
Build Elastic Agent for Linux.
Upgrade to the newly-built Agent.
Check the contents of the
/etc/systemd/system/elastic-agent.service
file and ensure that theKillMode=process
setting is now present. Don't worry about any whitespace around the=
sign (systemd
ignores it).Ensure that the
systemd
unit configuration files have been reloaded. For this, runsystemctl status elastic-agent.service >/dev/null
and ensure that the following warning message is NOT shown:Monitor the output of
systemctl status elastic-agent.service
. Watch the Upgrade Watcher process being spawned, then the new (post-upgrade) Agent process crashing, and ensure that the Upgrade Watcher process keeps running for at least 2 minutes.Related issues
systemd
prematurely kills Upgrade Watcher when upgraded Agent fails to start #3123