Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Upgrade Watcher] Try restarting Agent in multiple ways during rollback #3268

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Aug 18, 2023

What does this PR do?

This PR fixes a potential bug where, during an upgrade, if the upgraded Agent process crashes, the Upgrade Watcher initiates a rollback but isn't able to restart the rolled back Agent.

Before this PR, the Upgrade Watcher would attempt to restart the rolled back Agent via a GRPC call to the Agent's Restart() function. However, if the Agent process has crashed, there is no GRPC server available to receive such a call, thus preventing the Agent from being restarted.

With this PR, the Upgrade Watcher continues to use the GRPC Restart() call to restart the Agent. However, if that call fails, the Upgrade Watcher uses a fallback method of asking the Elastic Agent service to restart the Agent.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test This bug was found while writing the integration test in [Integration Test] New Agent after upgrade fails to start #3085. As such, this bug fix will be tested by that test.

How to test this PR locally

  1. Build an Agent package from main (currently at version 8.11.0), unpack it, and install it. This will serve as the version of Agent we will upgrade from.

    SNAPSHOT=true EXTERNAL=true DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage package
    
  2. Build an Agent package from the branch for this PR (say at version 8.12.0). This will serve as the version of Agent we will upgrade to. This Agent will deliberately crash upon start up but it also has the bugfix in the Upgrade Watcher code path.

    DEV=true PLATFORMS=linux/arm64 PACKAGES=tar.gz mage package
    
  3. Upgrade to the 8.12.0 Agent.

    sudo elastic-agent upgrade 8.11.0 --source-uri file://$(pwd)/build/distributions/ --skip-verify
    
  4. Ensure that the Agent version is now 8.12.0. Because the Agent will be crashing, you may only see the binary version, not the daemon version.

    sudo elastic-agent version
    
  5. Tail the Upgrade Watcher logs. Ensure that the Upgrade Watcher eventually initiates rollback.

    sudo tail -F /opt/Elastic/Agent/data/elastic-agent-06be89/logs/elastic-agent-watcher-$(date +%Y%m%d).ndjson
    
  6. Ensure that the rolled back Agent gets restarted.

  7. Ensure that the Agent version is now back to 8.11.0.

    sudo elastic-agent version
    

Related issues

@ycombinator ycombinator added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Aug 18, 2023
@ycombinator ycombinator requested a review from a team as a code owner August 18, 2023 01:44
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Comment on lines 214 to 217
// FIXME: for testing only!
time.Sleep(10 * time.Second)
return errors.New("deliberately crashing agent (after invoking watcher)")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be removed once this PR is tested and approved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 2af8c30.

@@ -4,5 +4,5 @@

package version

const defaultBeatVersion = "8.11.0"
const defaultBeatVersion = "8.12.0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be removed once this PR is tested and approved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 2af8c30.

@ycombinator ycombinator force-pushed the bugfix-rollback-restart-agent-no-process branch from 7b20499 to 06be899 Compare August 18, 2023 01:46
@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 18, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-08-28T18:37:50.894+0000

  • Duration: 28 min 55 sec

Test stats 🧪

Test Results
Failed 0
Passed 6217
Skipped 47
Total 6264

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Comment on lines 148 to 149
install.StopService(topPath)
err := install.StartService(topPath)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it better to create a new install.RestartServive(topPath) wrapper around these two calls?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I like that idea. Logging the error for stop service might be helpful as well, just for debugging issues in the future, having the contents of that error in the log could be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added install.RestartService in 6cb1746. I'm taking care of adding logging to the restart step in #3245.

Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, restarting by service should work in rollback cases without a need for check if service is there as for now we don't support upgrade scenario without agent being installed

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me.

Comment on lines 148 to 149
install.StopService(topPath)
err := install.StartService(topPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I like that idea. Logging the error for stop service might be helpful as well, just for debugging issues in the future, having the contents of that error in the log could be useful.

@ycombinator ycombinator enabled auto-merge (squash) August 21, 2023 17:00
@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 21, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.765% (80/81) 👍
Files 67.368% (192/285) 👍
Classes 66.105% (353/534) 👍
Methods 52.896% (1105/2089) 👍
Lines 38.53% (12605/32715) 👎 -0.024
Conditionals 100.0% (0/0) 💚

Comment on lines 176 to 189
func RestartService(topPath string) error {
err := StopService(topPath)
if err != nil {
return err
}

return StartService(topPath)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are dealing with unlikely problems already, if the agent were to stop for whatever reason between the Stop and Start calls it would be left stopped.

The most likely reason this could happen is a laptop system being powered off.

It would be much safer to use systemctl restart or equivalent if it exists because it should prevent this situation. The service manager knows you want the services end state to be running in this case, which is not true with an isolated call to stop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed in d81fe70.

@ycombinator ycombinator force-pushed the bugfix-rollback-restart-agent-no-process branch from 2af8c30 to f5560d0 Compare August 22, 2023 01:28
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ycombinator ycombinator force-pushed the bugfix-rollback-restart-agent-no-process branch from 45f327a to 8917a6e Compare August 24, 2023 21:12
@ycombinator ycombinator force-pushed the bugfix-rollback-restart-agent-no-process branch from 8917a6e to 14ea553 Compare August 25, 2023 14:30
@mergify
Copy link
Contributor

mergify bot commented Aug 28, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b bugfix-rollback-restart-agent-no-process upstream/bugfix-rollback-restart-agent-no-process
git merge upstream/main
git push upstream bugfix-rollback-restart-agent-no-process

@ycombinator ycombinator force-pushed the bugfix-rollback-restart-agent-no-process branch from 14ea553 to 44bd7ec Compare August 28, 2023 18:37
@ycombinator ycombinator added backport-v8.9.0 Automated backport with mergify backport-v8.10.0 and removed backport-skip backport-v8.9.0 Automated backport with mergify labels Aug 28, 2023
@pierrehilbert
Copy link
Contributor

Unrelated CI failures. Force merge this one as it will bring some additional stability to our upgrade process.

@pierrehilbert pierrehilbert merged commit 9377dca into elastic:main Aug 29, 2023
28 of 29 checks passed
mergify bot pushed a commit that referenced this pull request Aug 29, 2023
…ck (#3268)

* Try restarting via the service as a backup

* [Testing] Crashing agent

* Adding CHANGELOG fragment

* [Testing] Bump up version

* Adding install.RestartService

* Removing testing code

* Use service manager's restart

* Updating comment

(cherry picked from commit 9377dca)
@ycombinator ycombinator deleted the bugfix-rollback-restart-agent-no-process branch August 29, 2023 13:07
pierrehilbert pushed a commit that referenced this pull request Aug 29, 2023
…ck (#3268) (#3311)

* Try restarting via the service as a backup

* [Testing] Crashing agent

* Adding CHANGELOG fragment

* [Testing] Bump up version

* Adding install.RestartService

* Removing testing code

* Use service manager's restart

* Updating comment

(cherry picked from commit 9377dca)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.10.0 bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants