-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure Orbit (component of fleetd) enrollment retry backoff #16594
Comments
Hey @chuckyz, thanks for opening the PR with the improvement. I updated this issue to use Fleet's standard user story template. I moved your original issue description below. Please let me know if I'm missing anything in the updated description! ProblemWhen Orbit is mass deployed in any situation, if there's an issue during that deployment that causes the enroll step to retry, the retry is consistent. In some cases this consistent time is too fast. This causes a lot of stress to the server cluster. Potential solutions
|
Hey @sharon-fdm and @lucasmrod heads up, since there's an open PR for this user story, I pulled this user story into the release board. This way, we can track the progress of getting the PR reviewed and merged in the upcoming sprint. cc @lukeheath |
@lucasmrod @sharon-fdm Should this be in the "In review" column? |
@lukeheath This is a PR from community that needs some modifications. |
Hi @chuckyz! Next week I will be working on this during the current sprint. Let me know what works for you. |
@noahtalerman I didn't add any new configuration. The default will now be to always do exponential backoff whenever there are enroll failures in fleetd (https://github.com/fleetdm/fleet/pull/17368/files#r1512885699). (I don't see a reason to not do backoff when there are fleetd enroll failures.) Let me know if it makes sense. |
@lucasmrod nice! Not adding any new configuration and instead updating the behavior for everyone (default) is always a win. Makes sense that we should back off by default. @chuckyz what do you think? Heads up, Lucas opened a fresh PR here: #17368 If you get the chance, would love your feedback. |
Following are the scenarios to test for QA: @xpkoala/@sabrinabuckets All tests must be performed in the three OSs. Scenarios: A. Test a package with an invalid enroll secret: SYSTEMS="macos windows linux" \
PKG_FLEET_URL=https://localhost:8080 \
PKG_TUF_URL=http://localhost:8081 \
DEB_FLEET_URL=https://host.docker.internal:8080 \
DEB_TUF_URL=http://host.docker.internal:8081 \
RPM_FLEET_URL=https://host.docker.internal:8080 \
RPM_TUF_URL=http://host.docker.internal:8081 \
MSI_FLEET_URL=https://host.docker.internal:8080 \
MSI_TUF_URL=http://host.docker.internal:8081 \
GENERATE_PKG=1 \
GENERATE_DEB=1 \
GENERATE_RPM=1 \
GENERATE_MSI=1 \
ENROLL_SECRET=INVALID_ENROLL_SECRET \
FLEET_DESKTOP=1 \
USE_FLEET_SERVER_CERTIFICATE=1 \
DEBUG=1 \
./tools/tuf/test/main.sh Expected result: You should see enroll failures and retries with a backoff: 10s, 20s, 40s, 80s, 160s, and then it starts over. B. After (A) is done, push a dummy update to orbit and it should auto-update (even if it hasn't enrolled to Fleet) # Dummy change to change the output of `sudo orbit version`
sed -i '' 's/fmt.Println("orbit "/fmt.Println("orbit2 "/' ./orbit/cmd/orbit/orbit.go
# GOARCH=arm64 in case in M1
GOOS=darwin GOARCH=amd64 go build -o orbit-darwin ./orbit/cmd/orbit
./tools/tuf/test/push_target.sh macos orbit orbit-darwin 42
# Verify that it auto-updated successfully you can run:
sudo orbit version C. Smoke test packages with a valid enroll secret (fleetd should enroll successfully). D. After testing (C), delete the three hosts from Fleet and they should re-enroll successfully. E. After (D) is done, push a dummy update to orbit and it should auto-update successfully.
|
#16594 - [X] Changes file added for user-visible changes in `changes/` or `orbit/changes/`. See [Changes files](https://fleetdm.com/docs/contributing/committing-changes#changes-files) for more information. - [X] Added/updated tests - [X] Manual QA for all new/changed functionality - For Orbit and Fleet Desktop changes: - [X] Manual QA must be performed in the three main OSs, macOS, Windows and Linux. - [X] Auto-update manual QA, from released version of component to new version (see [tools/tuf/test](../tools/tuf/test/README.md)).
#16594 - [X] Changes file added for user-visible changes in `changes/` or `orbit/changes/`. See [Changes files](https://fleetdm.com/docs/contributing/committing-changes#changes-files) for more information. - [X] Added/updated tests - [X] Manual QA for all new/changed functionality - For Orbit and Fleet Desktop changes: - [X] Manual QA must be performed in the three main OSs, macOS, Windows and Linux. - [X] Auto-update manual QA, from released version of component to new version (see [tools/tuf/test](../tools/tuf/test/README.md)).
#16594 - [X] Changes file added for user-visible changes in `changes/` or `orbit/changes/`. See [Changes files](https://fleetdm.com/docs/contributing/committing-changes#changes-files) for more information. - [X] Added/updated tests - [X] Manual QA for all new/changed functionality - For Orbit and Fleet Desktop changes: - [X] Manual QA must be performed in the three main OSs, macOS, Windows and Linux. - [X] Auto-update manual QA, from released version of component to new version (see [tools/tuf/test](../tools/tuf/test/README.md)).
@noahtalerman was this supposed to be closed out? |
Hey @zayhanlon, yes. Looking at the date this was moved to the drafting board (Apr 4), I think this one got lost in the ZenHub boards. @lucasmrod, did this story release a new config or did we update the default behavior for everyone? If we added a new config, what's the config? I think we want to make sure we updated any reference docs (server config options, API, GitOps, Agent options). |
I just realized that Lucas is OOO.
Hey @sharon-fdm do you know the answer to the above? |
See #16594 (comment). No new configuration. By default, upon enroll failures, fleetd will use the following retry intervals (exponential backoff): 10s, 20s, 40s, 80s, 160s and then return the failure (max attempts = 6) thus executing no more than ~6 enroll request failures every ~5 minutes. |
Thanks Lucas! Closing this issue. cc @zayhanlon |
Orbit's steady pulse, |
Goal
Changes
Product
Engineering
Context
QA
Risk assessment
Manual testing steps
Testing notes
Confirmation
The text was updated successfully, but these errors were encountered: