Configure Orbit (component of fleetd) enrollment retry backoff #16594

chuckyz · 2024-02-05T16:07:20Z

Goal

User story
As a endpoint operator deploying Fleet's agent (fleetd) on thousands of hosts,
I want to configure Orbit's (component of fleetd) enrollment retry back off if enrollment fails
so that I can reduce the amount of stress on the Fleet server.

Changes

Product

fleetd changes: Add exponential backoff to orbit enroll retries #17368
Outdated documentation changes: No documentation needed.

Engineering

Database schema migrations: TODO
Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

Requestor(s): _________________________

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

Engineer (@____): Added comment to user story confirming successful completion of QA.
QA (@____): Added comment to user story confirming successful completion of QA.

noahtalerman · 2024-02-19T22:11:42Z

Hey @chuckyz, thanks for opening the PR with the improvement. I updated this issue to use Fleet's standard user story template.

I moved your original issue description below. Please let me know if I'm missing anything in the updated description!

Problem

When Orbit is mass deployed in any situation, if there's an issue during that deployment that causes the enroll step to retry, the retry is consistent. In some cases this consistent time is too fast. This causes a lot of stress to the server cluster.

Potential solutions

Increase FLEETD_ENROLL_RETRY_INTERVAL
Add a basic backoff mechanism into the retry package.

noahtalerman · 2024-02-19T22:16:09Z

Hey @sharon-fdm and @lucasmrod heads up, since there's an open PR for this user story, I pulled this user story into the release board.

This way, we can track the progress of getting the PR reviewed and merged in the upcoming sprint.

cc @lukeheath

lukeheath · 2024-02-22T16:24:17Z

@lucasmrod @sharon-fdm Should this be in the "In review" column?

sharon-fdm · 2024-02-22T16:27:08Z

@lukeheath This is a PR from community that needs some modifications.
Hasn't started yet.

lucasmrod · 2024-02-22T16:29:27Z

Hi @chuckyz!

Next week I will be working on this during the current sprint.
I may have to make a separate PR because I can't push to your fork (unless you have the time and are planning on making the requested changes).

Let me know what works for you.

lucasmrod · 2024-03-05T14:12:04Z

@noahtalerman I didn't add any new configuration. The default will now be to always do exponential backoff whenever there are enroll failures in fleetd (https://github.com/fleetdm/fleet/pull/17368/files#r1512885699).

(I don't see a reason to not do backoff when there are fleetd enroll failures.)

Let me know if it makes sense.

noahtalerman · 2024-03-05T14:42:34Z

I didn't add any new configuration. The default will now be to always do exponential backoff whenever there are enroll failures in fleetd

@lucasmrod nice! Not adding any new configuration and instead updating the behavior for everyone (default) is always a win.

Makes sense that we should back off by default.

@chuckyz what do you think? Heads up, Lucas opened a fresh PR here: #17368

If you get the chance, would love your feedback.

lucasmrod · 2024-03-05T14:59:25Z

Following are the scenarios to test for QA:

@xpkoala/@sabrinabuckets

All tests must be performed in the three OSs.

Scenarios:

A. Test a package with an invalid enroll secret:

SYSTEMS="macos windows linux" \
PKG_FLEET_URL=https://localhost:8080 \
PKG_TUF_URL=http://localhost:8081 \
DEB_FLEET_URL=https://host.docker.internal:8080 \
DEB_TUF_URL=http://host.docker.internal:8081 \
RPM_FLEET_URL=https://host.docker.internal:8080 \
RPM_TUF_URL=http://host.docker.internal:8081 \
MSI_FLEET_URL=https://host.docker.internal:8080 \
MSI_TUF_URL=http://host.docker.internal:8081 \
GENERATE_PKG=1 \
GENERATE_DEB=1 \
GENERATE_RPM=1 \
GENERATE_MSI=1 \
ENROLL_SECRET=INVALID_ENROLL_SECRET \
FLEET_DESKTOP=1 \
USE_FLEET_SERVER_CERTIFICATE=1 \
DEBUG=1 \
./tools/tuf/test/main.sh

Expected result: You should see enroll failures and retries with a backoff: 10s, 20s, 40s, 80s, 160s, and then it starts over.

B. After (A) is done, push a dummy update to orbit and it should auto-update (even if it hasn't enrolled to Fleet)
(It may take up to 5 minutes for it to auto-update.)

# Dummy change to change the output of `sudo orbit version`
sed -i '' 's/fmt.Println("orbit "/fmt.Println("orbit2 "/' ./orbit/cmd/orbit/orbit.go

# GOARCH=arm64 in case in M1
GOOS=darwin GOARCH=amd64 go build -o orbit-darwin ./orbit/cmd/orbit
./tools/tuf/test/push_target.sh macos orbit orbit-darwin 42

# Verify that it auto-updated successfully you can run:
sudo orbit version

C. Smoke test packages with a valid enroll secret (fleetd should enroll successfully).

D. After testing (C), delete the three hosts from Fleet and they should re-enroll successfully.

E. After (D) is done, push a dummy update to orbit and it should auto-update successfully.

# Dummy change to change the output of `sudo orbit version`
sed -i '' 's/fmt.Println("orbit2 "/fmt.Println("orbit3 "/' ./orbit/cmd/orbit/orbit.go

# GOARCH=arm64 in case in M1
GOOS=darwin GOARCH=amd64 go build -o orbit-darwin ./orbit/cmd/orbit
./tools/tuf/test/push_target.sh macos orbit orbit-darwin 42

# To verify it auto-updated successfully you can run:
sudo orbit version

#16594 - [X] Changes file added for user-visible changes in `changes/` or `orbit/changes/`. See [Changes files](https://fleetdm.com/docs/contributing/committing-changes#changes-files) for more information. - [X] Added/updated tests - [X] Manual QA for all new/changed functionality - For Orbit and Fleet Desktop changes: - [X] Manual QA must be performed in the three main OSs, macOS, Windows and Linux. - [X] Auto-update manual QA, from released version of component to new version (see [tools/tuf/test](../tools/tuf/test/README.md)).

zayhanlon · 2024-07-29T18:50:38Z

@noahtalerman was this supposed to be closed out?

noahtalerman · 2024-07-30T13:17:03Z

Hey @zayhanlon, yes. Looking at the date this was moved to the drafting board (Apr 4), I think this one got lost in the ZenHub boards.

@lucasmrod, did this story release a new config or did we update the default behavior for everyone?

If we added a new config, what's the config? I think we want to make sure we updated any reference docs (server config options, API, GitOps, Agent options).

noahtalerman · 2024-08-01T15:13:24Z

I just realized that Lucas is OOO.

did this story release a new config or did we update the default behavior for everyone?

If we added a new config, what's the config? I think we want to make sure we updated any reference docs (server config options, API, GitOps, Agent options).

Hey @sharon-fdm do you know the answer to the above?

lucasmrod · 2024-08-07T13:16:21Z

See #16594 (comment).

No new configuration.

By default, upon enroll failures, fleetd will use the following retry intervals (exponential backoff): 10s, 20s, 40s, 80s, 160s and then return the failure (max attempts = 6) thus executing no more than ~6 enroll request failures every ~5 minutes.

noahtalerman · 2024-08-08T17:41:12Z

No new configuration.

By default, upon enroll failures, fleetd will use the following retry intervals (exponential backoff): 10s, 20s, 40s, 80s, 160s and then return the failure (max attempts = 6) thus executing no more than ~6 enroll request failures every ~5 minutes.

Thanks Lucas! Closing this issue.

cc @zayhanlon

fleet-release · 2024-08-08T17:41:15Z

Orbit's steady pulse,
Tamed by thoughtful code and care,
Servers breathe easier.

chuckyz added the ~feature fest Will be reviewed at next Feature Fest label Feb 5, 2024

chuckyz mentioned this issue Feb 5, 2024

add a backoff during orbit enroll retries #16596

Closed

5 tasks

noahtalerman added story A user story defining an entire feature :product Product Design department (shows up on 🦢 Drafting board) and removed ~feature fest Will be reviewed at next Feature Fest labels Feb 15, 2024

noahtalerman self-assigned this Feb 16, 2024

noahtalerman changed the title ~~Reduce impact on FleetDM server when many clients enroll with errors~~ Configure Orbit (component of fleetd) enrollment retry backoff Feb 19, 2024

noahtalerman assigned sharon-fdm and unassigned noahtalerman Feb 19, 2024

noahtalerman added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. #g-endpoint-ops Endpoint ops product group and removed :product Product Design department (shows up on 🦢 Drafting board) labels Feb 19, 2024

sharon-fdm assigned lucasmrod and unassigned sharon-fdm Feb 20, 2024

sharon-fdm added this to the 4.46.0-tentative milestone Feb 20, 2024

noahtalerman added the customer-rocher label Feb 22, 2024

lucasmrod mentioned this issue Mar 5, 2024

Add exponential backoff to orbit enroll retries #17368

Merged

5 tasks

lukeheath modified the milestones: 4.47.0, 4.48.0-tentative Mar 11, 2024

lukeheath added :product Product Design department (shows up on 🦢 Drafting board) and removed :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. labels Apr 4, 2024

noahtalerman assigned noahtalerman and unassigned lucasmrod Jul 30, 2024

noahtalerman closed this as completed Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure Orbit (component of fleetd) enrollment retry backoff #16594

Configure Orbit (component of fleetd) enrollment retry backoff #16594

chuckyz commented Feb 5, 2024 •

edited by noahtalerman

Loading

noahtalerman commented Feb 19, 2024

noahtalerman commented Feb 19, 2024

lukeheath commented Feb 22, 2024

sharon-fdm commented Feb 22, 2024

lucasmrod commented Feb 22, 2024

lucasmrod commented Mar 5, 2024

noahtalerman commented Mar 5, 2024

lucasmrod commented Mar 5, 2024

zayhanlon commented Jul 29, 2024

noahtalerman commented Jul 30, 2024

noahtalerman commented Aug 1, 2024

lucasmrod commented Aug 7, 2024

noahtalerman commented Aug 8, 2024

fleet-release commented Aug 8, 2024

Configure Orbit (component of fleetd) enrollment retry backoff #16594

Configure Orbit (component of fleetd) enrollment retry backoff #16594

Comments

chuckyz commented Feb 5, 2024 • edited by noahtalerman Loading

Goal

Changes

Product

Engineering

Context

QA

Risk assessment

Manual testing steps

Testing notes

Confirmation

noahtalerman commented Feb 19, 2024

Problem

Potential solutions

noahtalerman commented Feb 19, 2024

lukeheath commented Feb 22, 2024

sharon-fdm commented Feb 22, 2024

lucasmrod commented Feb 22, 2024

lucasmrod commented Mar 5, 2024

noahtalerman commented Mar 5, 2024

lucasmrod commented Mar 5, 2024

zayhanlon commented Jul 29, 2024

noahtalerman commented Jul 30, 2024

noahtalerman commented Aug 1, 2024

lucasmrod commented Aug 7, 2024

noahtalerman commented Aug 8, 2024

fleet-release commented Aug 8, 2024

chuckyz commented Feb 5, 2024 •

edited by noahtalerman

Loading