On Fleet Server reboot gives a grace period #1605

ph · 2022-06-27T20:56:41Z

When Fleet Server is offline for a period bigger or equal to the
unenrollTimeout, on reboot the Server will start to automatically
unenroll Elastic Agent without giving them a chance to communicate with
the system. Instead on reboot we will give a grace period equivalent to
the unenrollTimeout this will give enough time to the Elastic Agent to
communicate back to Fleet Server and update their last checkin time.

Fixes: #1500

What is the problem this PR solves?

// Please do not just reference an issue. Explain WHAT the problem this PR solves here.

How does this PR solve the problem?

// Explain HOW you solved the problem in your code. It is possible that during PR reviews this changes and then this section should be updated.

How to test this PR locally

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

ph · 2022-06-27T20:58:01Z

@joshdover I think this will help api key issues we saw.

elasticmachine · 2022-06-27T20:59:31Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-06-30T14:29:04.699+0000
Duration: 11 min 45 sec

Test stats 🧪

Test	Results
Failed	0
Passed	304
Skipped	1
Total	305

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

ph · 2022-06-28T12:14:08Z

@Mergifyio update

mergify · 2022-06-28T12:14:20Z

update

✅ Branch has been successfully updated

jlind23 · 2022-06-28T12:49:34Z

@ph i have relabelled the issue appropriately to make it land in 8.4.

ph · 2022-06-29T14:00:12Z

Test is fixed and it's ready for review.

michalpristas

i like the change

aleksmaus · 2022-06-30T11:17:48Z

internal/pkg/coordinator/monitor.go

@@ -453,6 +453,23 @@ func runCoordinatorOutput(ctx context.Context, cord Coordinator, bulker bulk.Bul
 }

 func runUnenroller(ctx context.Context, bulker bulk.Bulk, policyID string, unenrollTimeout time.Duration, l zerolog.Logger, checkInterval time.Duration, agentsIndex string) {
+	// When fleet-server is offline for a long period and finally recover, it means that the connected


nit: recover -> recovers

internal/pkg/coordinator/monitor.go

When Fleet Server is offline for a period bigger or equal to the unenrollTimeout, on reboot the Server will start to automatically unenroll Elastic Agent without giving them a change to communicate with the system. Instead on reboot we will give a grace period equivalent to the unenrollTimeout this will give enough time to the Elastic Agent to communicate back to Fleet Server and update their last checkin time. Fixes: elastic#1500

Consider the grace period when executing the test.

internal/pkg/coordinator/monitor.go

aleksmaus · 2022-06-30T16:28:08Z

internal/pkg/coordinator/monitor.go

+		Msg("giving a grace period to Elastic Agent before enforcing unenrollTimeout monitor")
+
+	if err := waitWithContext(ctx, unenrollTimeout); err != nil {
+		l.Err(err).Dur("unenroll_timeout", unenrollTimeout).


nit: maybe log "context canceled" at debug level?

[8.3](backport #1605) On Fleet Server reboot gives a grace period

ph added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 27, 2022

ph requested a review from a team June 27, 2022 20:56

ph self-assigned this Jun 27, 2022

ph requested review from aleksmaus, narph and joshdover and removed request for a team June 27, 2022 20:56

ph force-pushed the unenrolltimeout branch from 5044848 to 15fd674 Compare June 27, 2022 20:57

ph requested a review from a team as a code owner June 27, 2022 20:57

ph force-pushed the unenrolltimeout branch from 37f0705 to ff37be1 Compare June 29, 2022 13:59

ph requested review from michalpristas and blakerouse June 29, 2022 14:00

ph closed this Jun 29, 2022

ph reopened this Jun 29, 2022

michalpristas approved these changes Jun 30, 2022

View reviewed changes

aleksmaus reviewed Jun 30, 2022

View reviewed changes

ph added 3 commits June 30, 2022 10:06

Valid failures in test suite for unenroller

3b7a61c

Consider the grace period when executing the test.

Uses NewTimer so it can be GC

cbaf32e

ph force-pushed the unenrolltimeout branch from 5341087 to cbaf32e Compare June 30, 2022 14:06

aleksmaus reviewed Jun 30, 2022

View reviewed changes

internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved

wrap function

3a52c01

ph merged commit e7622db into elastic:main Jun 30, 2022

aleksmaus reviewed Jun 30, 2022

View reviewed changes

ph added the backport-v8.3.0 Automated backport with mergify label Jun 30, 2022

mergify bot mentioned this pull request Jun 30, 2022

[8.3](backport #1605) On Fleet Server reboot gives a grace period #1624

Merged

ph added a commit that referenced this pull request Jun 30, 2022

Merge pull request #1624 from elastic/mergify/bp/8.3/pr-1605

ecdb0bf

[8.3](backport #1605) On Fleet Server reboot gives a grace period

joshdover mentioned this pull request Oct 17, 2022

[Fleet] Replace unenrollment timeout with UI-only inactivity timeout elastic/kibana#143455

Closed

7 tasks

joshdover mentioned this pull request Aug 12, 2022

Unenrollment timeout monitor can crash with concurrent access panic #1738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On Fleet Server reboot gives a grace period #1605

On Fleet Server reboot gives a grace period #1605

ph commented Jun 27, 2022 •

edited

ph commented Jun 27, 2022

elasticmachine commented Jun 27, 2022 •

edited

Build stats

Test stats 🧪

ph commented Jun 28, 2022

mergify bot commented Jun 28, 2022

jlind23 commented Jun 28, 2022

ph commented Jun 29, 2022

michalpristas left a comment

aleksmaus Jun 30, 2022

aleksmaus Jun 30, 2022

On Fleet Server reboot gives a grace period #1605

On Fleet Server reboot gives a grace period #1605

Conversation

ph commented Jun 27, 2022 • edited

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Checklist

Related issues

ph commented Jun 27, 2022

elasticmachine commented Jun 27, 2022 • edited

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

ph commented Jun 28, 2022

mergify bot commented Jun 28, 2022

✅ Branch has been successfully updated

jlind23 commented Jun 28, 2022

ph commented Jun 29, 2022

michalpristas left a comment

Choose a reason for hiding this comment

aleksmaus Jun 30, 2022

Choose a reason for hiding this comment

aleksmaus Jun 30, 2022

Choose a reason for hiding this comment

ph commented Jun 27, 2022 •

edited

elasticmachine commented Jun 27, 2022 •

edited