Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Fleet Server reboot gives a grace period #1605

Merged
merged 4 commits into from
Jun 30, 2022
Merged

Conversation

ph
Copy link
Contributor

@ph ph commented Jun 27, 2022

When Fleet Server is offline for a period bigger or equal to the
unenrollTimeout, on reboot the Server will start to automatically
unenroll Elastic Agent without giving them a chance to communicate with
the system. Instead on reboot we will give a grace period equivalent to
the unenrollTimeout this will give enough time to the Elastic Agent to
communicate back to Fleet Server and update their last checkin time.

Fixes: #1500

What is the problem this PR solves?

// Please do not just reference an issue. Explain WHAT the problem this PR solves here.

How does this PR solve the problem?

// Explain HOW you solved the problem in your code. It is possible that during PR reviews this changes and then this section should be updated.

How to test this PR locally

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@ph ph added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 27, 2022
@ph ph requested a review from a team June 27, 2022 20:56
@ph ph self-assigned this Jun 27, 2022
@ph ph requested review from aleksmaus, narph and joshdover and removed request for a team June 27, 2022 20:56
@ph ph requested a review from a team as a code owner June 27, 2022 20:57
@ph
Copy link
Contributor Author

ph commented Jun 27, 2022

@joshdover I think this will help api key issues we saw.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 27, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-06-30T14:29:04.699+0000

  • Duration: 11 min 45 sec

Test stats 🧪

Test Results
Failed 0
Passed 304
Skipped 1
Total 305

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@ph
Copy link
Contributor Author

ph commented Jun 28, 2022

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Jun 28, 2022

update

✅ Branch has been successfully updated

@jlind23
Copy link
Contributor

jlind23 commented Jun 28, 2022

@ph i have relabelled the issue appropriately to make it land in 8.4.

@ph
Copy link
Contributor Author

ph commented Jun 29, 2022

Test is fixed and it's ready for review.

@ph ph closed this Jun 29, 2022
@ph ph reopened this Jun 29, 2022
Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the change

@@ -453,6 +453,23 @@ func runCoordinatorOutput(ctx context.Context, cord Coordinator, bulker bulk.Bul
}

func runUnenroller(ctx context.Context, bulker bulk.Bulk, policyID string, unenrollTimeout time.Duration, l zerolog.Logger, checkInterval time.Duration, agentsIndex string) {
// When fleet-server is offline for a long period and finally recover, it means that the connected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: recover -> recovers

internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved
internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved
internal/pkg/coordinator/monitor.go Outdated Show resolved Hide resolved
ph added 3 commits June 30, 2022 10:06
When Fleet Server is offline for a period bigger or equal to the
unenrollTimeout, on reboot the Server will start to automatically
unenroll Elastic Agent without giving them a change to communicate with
the system. Instead on reboot we will give a grace period equivalent to
the unenrollTimeout this will give enough time to the Elastic Agent to
communicate back to Fleet Server and update their last checkin time.

Fixes: elastic#1500
Consider the grace period when executing the test.
@ph ph merged commit e7622db into elastic:main Jun 30, 2022
Msg("giving a grace period to Elastic Agent before enforcing unenrollTimeout monitor")

if err := waitWithContext(ctx, unenrollTimeout); err != nil {
l.Err(err).Dur("unenroll_timeout", unenrollTimeout).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe log "context canceled" at debug level?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.3.0 Automated backport with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not invalidate API Key right after restarting the Elastic Agent.
5 participants