Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid crashing on startup if Elasticsearch is not available #2693

Merged
merged 28 commits into from
Jul 6, 2023

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Jun 13, 2023

What is the problem this PR solves?

Keep Fleet Server process running if Elasticsearch is not available on startup.

How does this PR solve the problem?

  • Initial Info request is removed, so the client can be eventually used even if during startup Elasticsearch is not available.
  • Subsystems that need initialization keep retrying if they fail.
  • Standalone monitor keeps running once the service is healthy. If it loses connection with Elasticsearch, it goes to degraded state.

How to test this PR locally

  • An e2e is added that uses toxiproxy to simulate connectivity loss.
  • In general, this can be tested by:
    • Starting Fleet Server without connectivity with Elasticsearch (or with Elasticsearch down), it should still start, but without reaching the healthy state.
    • Start Fleet Server, and once it is healthy, interrupt connectivity with Elasticsearch.

Locally I have simulated connectivity interruptions with iptables.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@jsoriano jsoriano requested a review from a team as a code owner June 13, 2023 21:18
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 13, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-07-06T16:19:14.121+0000

  • Duration: 41 min 18 sec

Test stats 🧪

Test Results
Failed 0
Passed 745
Skipped 1
Total 746

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@mergify
Copy link
Contributor

mergify bot commented Jun 14, 2023

This pull request is now in conflicts. Could you fix it @jsoriano? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b retry-connections-startup upstream/retry-connections-startup
git merge upstream/main
git push upstream retry-connections-startup

Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a changelog fragment?

Also I think with these changes we should have a discussion about health vs readiness endpoints for the fleet-server

internal/pkg/es/client.go Show resolved Hide resolved
@@ -25,6 +25,7 @@ func TestStandAloneSelfMonitor(t *testing.T) {
title string
searchResult *es.ResultT
searchErr error
initialState client.UnitState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

err = m.ensureLeadership(ctx)
if err != nil {
return err
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coordinator and monitors get started separate goroutines ( in server/fleet.go) so I don't think anything should go wrong with these changes.
The only difference should be that the API is "available", but should return a 503 in a non-status endpoint (like enroll) is called. Can you add that as an e2e test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference should be that the API is "available", but should return a 503 in a non-status endpoint (like enroll) is called. Can you add that as an e2e test?

Added a middleware that makes any non-status endpoint to return a 503 when the service is not available. And some assertions in the test to check this. Let me know what you think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And removed after #2693 (comment) and the discussion about health checks.

Not sure then if there is an E2E test that we can add here, maybe in the controllers that use it?

// stand-alone mode.
if _, isStandAlone := sm.(*policy.StandAloneSelfMonitor); isStandAlone {
r.Use(statusChecker(sm))
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I like to have this difference in behaviour, but I think we may need it, we don't have the same healthcheck needings when running as standalone and when running inside agent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our alternative is to create a health endpoint and have the platform direct traffic if to fleet-server once ES is healthy. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, once we have healthchecks in place, the platform won't send traffic to us when we cannot reach ES. Should I remove this?
I started adding this after this comment about returning 503s when the healthcheck fails. But I see now that maybe you were referring to the platform and not directly here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am removing the middleware I added, and we will rely on readiness probes.

@joshdover
Copy link
Contributor

Also I think with these changes we should have a discussion about health vs readiness endpoints for the fleet-server

Did we have this discussion? I think we need readiness not to be successful until Fleet Server can accept traffic (so it's connection to ES is working), while liveness should still pass to avoid unnecessary container restarts.

@jsoriano
Copy link
Member Author

Also I think with these changes we should have a discussion about health vs readiness endpoints for the fleet-server

Did we have this discussion? I think we need readiness not to be successful until Fleet Server can accept traffic (so it's connection to ES is working), while liveness should still pass to avoid unnecessary container restarts.

Not really, thanks for the heads-up on this.

TLDR; I would add a healthcheck script to the docker image to check readiness based on current status endpoint, and I wouldn't add a liveness probe at least at the moment.

For readiness, we can check now if the status is healthy. We would need a command probe for this.
If we want to use a simpler HTTP probe, and rely only on the status code, we would need to modify the current handler, because it returns 200 now when it is in degraded state, or to define a new /healthz endpoint that returns 200 only when healthy.

So three options:

  • Use a command probe that uses the status endpoint to check if the service is healthy. We could add a healthcheck.sh script to the image for this.
  • Modify current status endpoint to return 200 only when the service is healthy. This is potentially a breaking change.
  • Add a new /healthz endpoint that returns 200 only when the service is healthy.

I will go for the first option by now so we don't need to modify or extend our APIs.

Regarding liveness, I don't think we need to add a probe for this at the moment. I would add it if we know that the Fleet Server can reach some unrecoverable unhealthy state, that afaik is not the case. If we add a liveness probe now, it will be always succeeding unless the pod crashes, what is no different to not having any probe.

@michel-laterman @joshdover wdyt?

@joshdover
Copy link
Contributor

Modify current status endpoint to return 200 only when the service is healthy. This is potentially a breaking change.

I lean towards this option as I see it more of a bug fix than a breaking change, but also open to start with a command probe for now as we experiment and only make the change on the existing endpoint once we have it working the way we want.

Regarding liveness, I don't think we need to add a probe for this at the moment. I would add it if we know that the Fleet Server can reach some unrecoverable unhealthy state, that afaik is not the case. If we add a liveness probe now, it will be always succeeding unless the pod crashes, what is no different to not having any probe.

Makes sense to me 👍

@jsoriano
Copy link
Member Author

Modify current status endpoint to return 200 only when the service is healthy. This is potentially a breaking change.

I lean towards this option as I see it more of a bug fix than a breaking change, but also open to start with a command probe for now as we experiment and only make the change on the existing endpoint once we have it working the way we want.

Ok, I will give this a try before adding the script.

@jsoriano jsoriano force-pushed the retry-connections-startup branch 2 times, most recently from e87b188 to 194fdb2 Compare June 26, 2023 11:18
return m.updateState(client.UnitStateFailed, fmt.Sprintf("Failed to request policies: %s", err))
if errors.Is(err, es.ErrIndexNotFound) {
m.log.Debug().Str("index", m.policiesIndex).Msg(es.ErrIndexNotFound.Error())
message = "Running: Policies not available yet"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michel-laterman should it be actually healthy if the policies index hasn't been created yet? Would requests succeed if Fleet Server cannot use this index? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it should be unhealthy if the index does not exist, that way the fleet-controller/k8s can use it for health checks/traffic routing. What do you think @nchaulet?
This PR also changes the status endpoint to return 200 only on a healthy status

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it should be unhealthy if the index does not exist, that way the fleet-controller/k8s can use it for health checks/traffic routing. What do you think @nchaulet?

Not sure the index while not exists if the user do not create any agent policy, so it probably should be healthy even if the index do not exists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's add the behaviour to the description in the openapi doc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return m.updateState(client.UnitStateFailed, fmt.Sprintf("Failed to request policies: %s", err))
if errors.Is(err, es.ErrIndexNotFound) {
m.log.Debug().Str("index", m.policiesIndex).Msg(es.ErrIndexNotFound.Error())
message = "Running: Policies not available yet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's add the behaviour to the description in the openapi doc

@jlind23
Copy link
Contributor

jlind23 commented Jul 6, 2023

Looks like the buildkite failure comes from

go: cloud.google.com/go/pubsublite@v1.7.0: verifying go.mod: cloud.google.com/go/pubsublite@v1.7.0/go.mod: reading https://sum.golang.org/tile/8/1/252: 404 Not Found

@joshdover
Copy link
Contributor

Ran a rebuild and it seemed to get past that step 👍

@jlind23
Copy link
Contributor

jlind23 commented Jul 6, 2023

🚢

@jsoriano
Copy link
Member Author

jsoriano commented Jul 6, 2023

/test

@jsoriano jsoriano merged commit b6dba34 into elastic:main Jul 6, 2023
18 checks passed
@jsoriano jsoriano deleted the retry-connections-startup branch July 6, 2023 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fleet Server should not crash on startup if connection to Elasticsearch timeouts
6 participants