Skip to content

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Jul 11, 2025

What does this PR do?

This PR fixes multiple deadlock conditions in the runtime communicator logic and improves overall lifecycle handling. Specifically, it:

  • Handles potential deadlock when sending the initial observed message to the runtime if the communicator is destroyed.
  • Handles deadlock when the runtime calls CheckinExpected (init expected check-in) and the communicator is already destroyed.
  • Handles deadlock when the runtime calls CheckinExpected (init expected check-in) and the server has been disconnected.
  • Although highly unlikely due to existing synchronisation primitives, it doesn't block against multiple init checkin messages that arrive after the first that completed the init checkin process.
  • Removes a redundant goroutine in the checkin method, which also allows returning accurate gRPC status codes back to the client. (PS: this optimisation can be also applied for the actions handling)
  • Adds comprehensive unit tests covering all the above scenarios to prevent regressions.

Why is it important?

These fixes ensure the runtime communicator behaves predictably and safely across lifecycle boundaries like shutdown, reconnection, and concurrent access. Without these changes, users could experience hangs or lost check-in signals in rare but critical failure modes.

You can see that the previous implementation fails to complete these scenarios under test in CI here

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

This is a bug-fix for internal lifecycle logic and is not expected to introduce changes in behaviour or configuration requirements for end users.

How to test this PR locally

mage unitTest

Related issues

* ci: write unit-tests for runtime_comm.go

* fix: blocking issues in runtime_comm.go

* fix: QF1004 use strings.ReplaceAll

* fix: guard closing c.runtimeCheckinDone with a local variable

(cherry picked from commit 0d316ce)
@mergify mergify bot added the backport label Jul 11, 2025
@mergify mergify bot requested a review from a team as a code owner July 11, 2025 03:12
@mergify mergify bot requested review from blakerouse and michalpristas and removed request for a team July 11, 2025 03:12
@mergify mergify bot added the backport label Jul 11, 2025
@github-actions github-actions bot added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jul 11, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elastic-sonarqube
Copy link

@pkoutsovasilis pkoutsovasilis merged commit 9db3b0a into 8.18 Jul 11, 2025
18 checks passed
@pkoutsovasilis pkoutsovasilis deleted the mergify/bp/8.18/pr-8881 branch July 11, 2025 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants