Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the Windows OnFailureDelayDuration delay to 15s #3657

Merged
merged 3 commits into from
Oct 26, 2023

Conversation

cmacknz
Copy link
Member

@cmacknz cmacknz commented Oct 25, 2023

Increase the Windows OnFailureDelayDuration delay to 15s. This is the delay before the service is restarted when it exits unexpectedly. This is the same value used by endpoint-security by default.

Note that this change only applies to new agent installations. We would need to add code to migrate existing agent installations to the new value.

It was originally suggested that we increase this to 30s+ in #3307 (comment) but I am confident the root cause for that problem was addressed by elastic/elastic-agent-libs#155.

Regardless I don't think our current default for this value was chosen for any particular reason and we can at least have agent behave consistently with endpoint. Speaking with @bjmcnic a 15s delay would also mitigate the original problem in all but the most extreme cases even if it were to still go unfixed.

This matches the value that endpoint uses and helps mitigate bugs where
agent unexpectedly restarts during a system shutdown.
@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Oct 25, 2023
@cmacknz cmacknz self-assigned this Oct 25, 2023
@cmacknz cmacknz requested a review from a team as a code owner October 25, 2023 19:00
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 25, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-26T18:19:48.651+0000

  • Duration: 25 min 23 sec

Test stats 🧪

Test Results
Failed 0
Passed 6553
Skipped 59
Total 6612

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 25, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.824% (84/85) 👍
Files 66.885% (204/305) 👍
Classes 65.95% (368/558) 👍
Methods 53.041% (1160/2187) 👍
Lines 39.386% (13670/34708) 👍 0.026
Conditionals 100.0% (0/0) 💚

@cmacknz
Copy link
Member Author

cmacknz commented Oct 25, 2023

(linux-arm64-ubuntu-2204) Failed for instance linux-arm64-ubuntu-2204 (@ 34.31.33.167): ogc-linux-arm64-ubuntu-2204-9441 unable to continue because stack never became ready: failed to check for cloud 8.12.0-SNAPSHOT to be ready: context deadline exceeded

Arg

@cmacknz
Copy link
Member Author

cmacknz commented Oct 25, 2023

buildkite test it

Copy link
Contributor

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…rvice-restarts-on-failure-to-15s-on-Windows.yaml

Co-authored-by: Paolo Chilà <paolo.chila@elastic.co>
@elastic-sonarqube
Copy link

@cmacknz
Copy link
Member Author

cmacknz commented Oct 26, 2023

Force merging, test failure is unrelated. #3657

@cmacknz cmacknz merged commit 910d17b into elastic:main Oct 26, 2023
20 of 21 checks passed
mergify bot pushed a commit that referenced this pull request Oct 26, 2023
* Update Windows OnFailureDelayDuration to 15s.

This matches the value that endpoint uses and helps mitigate bugs where
agent unexpectedly restarts during a system shutdown.

* Add changelog.

* Update changelog/fragments/1698259940-Increase-wait-period-between-service-restarts-on-failure-to-15s-on-Windows.yaml

Co-authored-by: Paolo Chilà <paolo.chila@elastic.co>

---------

Co-authored-by: Paolo Chilà <paolo.chila@elastic.co>
(cherry picked from commit 910d17b)
@cmacknz cmacknz deleted the agent-win-on-failure-duration branch October 27, 2023 18:44
cmacknz added a commit that referenced this pull request Oct 27, 2023
* Update Windows OnFailureDelayDuration to 15s.

This matches the value that endpoint uses and helps mitigate bugs where
agent unexpectedly restarts during a system shutdown.

* Add changelog.

* Update changelog/fragments/1698259940-Increase-wait-period-between-service-restarts-on-failure-to-15s-on-Windows.yaml

Co-authored-by: Paolo Chilà <paolo.chila@elastic.co>

---------

Co-authored-by: Paolo Chilà <paolo.chila@elastic.co>
(cherry picked from commit 910d17b)

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants