Skip to content

Conversation

@michalpristas
Copy link
Contributor

@michalpristas michalpristas commented Sep 1, 2025

This PR is huge but it does just few things.

It does not use elastic-agent-libs for calling ProcessWindowsControlEvents. ProcessWindowsControlEvents creates an objects that serves as an service instance and communicates with service manager. But having this in libs has an effect of running init section of whole elastic-agent-libs dependency tree before registering with a service.

It splits agent/cmd package into sub packages isolating each command. Service Manager communication is then moved outside to internal/pkg/agent/agentservice as described in first step. Having this split and service in a package deliberately named makes init section of agentservice being called sooner during initialization.

In agentservice init service we add WaitGroup when spinning up ProcessWindowsControlEvents goroutine. This blocks loading of subsequent packages and avoids possibility of starving ourselves of resources (when new goroutines are started later, subprocesses...). We cannot guarantee ordering of goroutines and that this one will be up in time. This is best effort of achieving that.
It's essential to have a package named like this because of the way how init section loading is implemented (alphabetical sorting plays a significant role)

Result of this is that we not just moved communication a way sooner to init section (normally end of init due to big dependency tree)
But also moved it before proceeding with initialization of

  • composables
  • azure, aws sdks,
  • k8s,
  • prometheus. and cockroachdb,
  • >95% of otel dependencies,
  • beats

This makes separation of otel into custom binary not critical for issues related to windows service manager communication timeouts.

To make it easier I added comment with a code change as these are not that frequent

Related #4971

@michalpristas michalpristas self-assigned this Sep 1, 2025
@michalpristas michalpristas added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-skip skip-changelog labels Sep 1, 2025
@michalpristas michalpristas marked this pull request as ready for review September 2, 2025 10:25
@michalpristas michalpristas requested a review from a team as a code owner September 2, 2025 10:25
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@ebeahan ebeahan removed the request for review from nkvoll September 2, 2025 15:42
@michalpristas michalpristas added backport-active-all Automated backport with mergify to all the active branches and removed backport-skip labels Sep 3, 2025
@michalpristas michalpristas marked this pull request as draft September 3, 2025 13:46
@michalpristas michalpristas marked this pull request as ready for review September 4, 2025 11:43
@mergify
Copy link
Contributor

mergify bot commented Sep 5, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b play/init-svc-start upstream/play/init-svc-start
git merge upstream/main
git push upstream play/init-svc-start

@mergify
Copy link
Contributor

mergify bot commented Sep 10, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b play/init-svc-start upstream/play/init-svc-start
git merge upstream/main
git push upstream play/init-svc-start

// couldNotConnect is the errno for ERROR_FAILED_SERVICE_CONTROLLER_CONNECT.
const couldNotConnect syscall.Errno = 1063

type beatService struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change: bringin in dependency

// After this is run, the service is considered by the OS to be stopped.
// This must be the first deferred cleanup task (last to execute).
defer func() {
agentservice.NotifyTermination()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change: calling internal implementation

agentservice.WaitExecutionDone()
}()

service.BeforeRun()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are original

// or more contributor license agreements. Licensed under the Elastic License 2.0;
// you may not use this file except in compliance with the Elastic License 2.0.

package agentrun
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change: run -> agentrun to help with alphabetical ordering during init

"github.com/elastic/elastic-agent/testing/integration"
)

func TestInitOrderNotDegraded(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change: added test
not entirely happy with testing criteriabut at least it checks it is called during init and not at the end of it

@mergify
Copy link
Contributor

mergify bot commented Sep 26, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b play/init-svc-start upstream/play/init-svc-start
git merge upstream/main
git push upstream play/init-svc-start

@elastic-sonarqube
Copy link

Quality Gate failed Quality Gate failed

Failed conditions
21.1% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @michalpristas

@mergify
Copy link
Contributor

mergify bot commented Oct 1, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b play/init-svc-start upstream/play/init-svc-start
git merge upstream/main
git push upstream play/init-svc-start

@ebeahan
Copy link
Member

ebeahan commented Nov 5, 2025

@michalpristas do we still need to pursue the changes here? Or can we close?

@michalpristas
Copy link
Contributor Author

My preference would be to close and go with Blakes proposal as it is less fragile, more deterministic

@ebeahan ebeahan closed this Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches bug Something isn't working skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants