Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure Orbit (component of fleetd) enrollment retry backoff #16594

Closed
1 of 6 tasks
chuckyz opened this issue Feb 5, 2024 · 14 comments
Closed
1 of 6 tasks

Configure Orbit (component of fleetd) enrollment retry backoff #16594

chuckyz opened this issue Feb 5, 2024 · 14 comments
Assignees
Labels
customer-rocher #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature
Milestone

Comments

@chuckyz
Copy link
Contributor

chuckyz commented Feb 5, 2024

Goal

User story
As a endpoint operator deploying Fleet's agent (fleetd) on thousands of hosts,
I want to configure Orbit's (component of fleetd) enrollment retry back off if enrollment fails
so that I can reduce the amount of stress on the Fleet server.

Changes

Product

Engineering

  • Database schema migrations: TODO
  • Load testing: TODO

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

  • Requestor(s): _________________________

QA

Risk assessment

  • Requires load testing: TODO
  • Risk level: Low / High TODO
  • Risk description: TODO

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. QA (@____): Added comment to user story confirming successful completion of QA.
@chuckyz chuckyz added the ~feature fest Will be reviewed at next Feature Fest label Feb 5, 2024
@noahtalerman noahtalerman added story A user story defining an entire feature :product Product Design department (shows up on 🦢 Drafting board) and removed ~feature fest Will be reviewed at next Feature Fest labels Feb 15, 2024
@noahtalerman noahtalerman self-assigned this Feb 16, 2024
@noahtalerman
Copy link
Member

Hey @chuckyz, thanks for opening the PR with the improvement. I updated this issue to use Fleet's standard user story template.

I moved your original issue description below. Please let me know if I'm missing anything in the updated description!

Problem

When Orbit is mass deployed in any situation, if there's an issue during that deployment that causes the enroll step to retry, the retry is consistent. In some cases this consistent time is too fast. This causes a lot of stress to the server cluster.

Potential solutions

  1. Increase FLEETD_ENROLL_RETRY_INTERVAL
  2. Add a basic backoff mechanism into the retry package.

@noahtalerman noahtalerman changed the title Reduce impact on FleetDM server when many clients enroll with errors Configure Orbit (component of fleetd) enrollment retry backoff Feb 19, 2024
@noahtalerman noahtalerman added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. #g-endpoint-ops Endpoint ops product group and removed :product Product Design department (shows up on 🦢 Drafting board) labels Feb 19, 2024
@noahtalerman
Copy link
Member

Hey @sharon-fdm and @lucasmrod heads up, since there's an open PR for this user story, I pulled this user story into the release board.

This way, we can track the progress of getting the PR reviewed and merged in the upcoming sprint.

cc @lukeheath

@sharon-fdm sharon-fdm assigned lucasmrod and unassigned sharon-fdm Feb 20, 2024
@sharon-fdm sharon-fdm added this to the 4.46.0-tentative milestone Feb 20, 2024
@lukeheath
Copy link
Member

@lucasmrod @sharon-fdm Should this be in the "In review" column?

@sharon-fdm
Copy link
Collaborator

@lukeheath This is a PR from community that needs some modifications.
Hasn't started yet.

@lucasmrod
Copy link
Member

Hi @chuckyz!

Next week I will be working on this during the current sprint.
I may have to make a separate PR because I can't push to your fork (unless you have the time and are planning on making the requested changes).

Let me know what works for you.

@lucasmrod
Copy link
Member

@noahtalerman I didn't add any new configuration. The default will now be to always do exponential backoff whenever there are enroll failures in fleetd (https://github.com/fleetdm/fleet/pull/17368/files#r1512885699).

(I don't see a reason to not do backoff when there are fleetd enroll failures.)

Let me know if it makes sense.

@noahtalerman
Copy link
Member

I didn't add any new configuration. The default will now be to always do exponential backoff whenever there are enroll failures in fleetd

@lucasmrod nice! Not adding any new configuration and instead updating the behavior for everyone (default) is always a win.

Makes sense that we should back off by default.

@chuckyz what do you think? Heads up, Lucas opened a fresh PR here: #17368

If you get the chance, would love your feedback.

@lucasmrod
Copy link
Member

Following are the scenarios to test for QA:

@xpkoala/@sabrinabuckets

All tests must be performed in the three OSs.

Scenarios:

A. Test a package with an invalid enroll secret:

SYSTEMS="macos windows linux" \
PKG_FLEET_URL=https://localhost:8080 \
PKG_TUF_URL=http://localhost:8081 \
DEB_FLEET_URL=https://host.docker.internal:8080 \
DEB_TUF_URL=http://host.docker.internal:8081 \
RPM_FLEET_URL=https://host.docker.internal:8080 \
RPM_TUF_URL=http://host.docker.internal:8081 \
MSI_FLEET_URL=https://host.docker.internal:8080 \
MSI_TUF_URL=http://host.docker.internal:8081 \
GENERATE_PKG=1 \
GENERATE_DEB=1 \
GENERATE_RPM=1 \
GENERATE_MSI=1 \
ENROLL_SECRET=INVALID_ENROLL_SECRET \
FLEET_DESKTOP=1 \
USE_FLEET_SERVER_CERTIFICATE=1 \
DEBUG=1 \
./tools/tuf/test/main.sh

Expected result: You should see enroll failures and retries with a backoff: 10s, 20s, 40s, 80s, 160s, and then it starts over.

B. After (A) is done, push a dummy update to orbit and it should auto-update (even if it hasn't enrolled to Fleet)
(It may take up to 5 minutes for it to auto-update.)

# Dummy change to change the output of `sudo orbit version`
sed -i '' 's/fmt.Println("orbit "/fmt.Println("orbit2 "/' ./orbit/cmd/orbit/orbit.go

# GOARCH=arm64 in case in M1
GOOS=darwin GOARCH=amd64 go build -o orbit-darwin ./orbit/cmd/orbit
./tools/tuf/test/push_target.sh macos orbit orbit-darwin 42

# Verify that it auto-updated successfully you can run:
sudo orbit version

C. Smoke test packages with a valid enroll secret (fleetd should enroll successfully).

D. After testing (C), delete the three hosts from Fleet and they should re-enroll successfully.

E. After (D) is done, push a dummy update to orbit and it should auto-update successfully.

# Dummy change to change the output of `sudo orbit version`
sed -i '' 's/fmt.Println("orbit2 "/fmt.Println("orbit3 "/' ./orbit/cmd/orbit/orbit.go

# GOARCH=arm64 in case in M1
GOOS=darwin GOARCH=amd64 go build -o orbit-darwin ./orbit/cmd/orbit
./tools/tuf/test/push_target.sh macos orbit orbit-darwin 42

# To verify it auto-updated successfully you can run:
sudo orbit version

@lukeheath lukeheath modified the milestones: 4.47.0, 4.48.0-tentative Mar 11, 2024
lucasmrod added a commit that referenced this issue Mar 13, 2024
#16594

- [X] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [X] Added/updated tests
- [X] Manual QA for all new/changed functionality
  - For Orbit and Fleet Desktop changes:
- [X] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.
- [X] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).
mostlikelee pushed a commit that referenced this issue Mar 22, 2024
#16594

- [X] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [X] Added/updated tests
- [X] Manual QA for all new/changed functionality
  - For Orbit and Fleet Desktop changes:
- [X] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.
- [X] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).
mostlikelee pushed a commit that referenced this issue Mar 22, 2024
#16594

- [X] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [X] Added/updated tests
- [X] Manual QA for all new/changed functionality
  - For Orbit and Fleet Desktop changes:
- [X] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.
- [X] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).
@lukeheath lukeheath added :product Product Design department (shows up on 🦢 Drafting board) and removed :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. labels Apr 4, 2024
@zayhanlon
Copy link
Contributor

@noahtalerman was this supposed to be closed out?

@noahtalerman
Copy link
Member

Hey @zayhanlon, yes. Looking at the date this was moved to the drafting board (Apr 4), I think this one got lost in the ZenHub boards.

@lucasmrod, did this story release a new config or did we update the default behavior for everyone?

If we added a new config, what's the config? I think we want to make sure we updated any reference docs (server config options, API, GitOps, Agent options).

@noahtalerman
Copy link
Member

I just realized that Lucas is OOO.

did this story release a new config or did we update the default behavior for everyone?

If we added a new config, what's the config? I think we want to make sure we updated any reference docs (server config options, API, GitOps, Agent options).

Hey @sharon-fdm do you know the answer to the above?

@lucasmrod
Copy link
Member

See #16594 (comment).

No new configuration.

By default, upon enroll failures, fleetd will use the following retry intervals (exponential backoff): 10s, 20s, 40s, 80s, 160s and then return the failure (max attempts = 6) thus executing no more than ~6 enroll request failures every ~5 minutes.

@noahtalerman
Copy link
Member

No new configuration.

By default, upon enroll failures, fleetd will use the following retry intervals (exponential backoff): 10s, 20s, 40s, 80s, 160s and then return the failure (max attempts = 6) thus executing no more than ~6 enroll request failures every ~5 minutes.

Thanks Lucas! Closing this issue.

cc @zayhanlon

@fleet-release
Copy link
Contributor

Orbit's steady pulse,
Tamed by thoughtful code and care,
Servers breathe easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-rocher #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature
Development

No branches or pull requests

7 participants