-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid crashing on startup if Elasticsearch is not available #2693
Conversation
This pull request is now in conflicts. Could you fix it @jsoriano? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a changelog fragment?
Also I think with these changes we should have a discussion about health vs readiness endpoints for the fleet-server
@@ -25,6 +25,7 @@ func TestStandAloneSelfMonitor(t *testing.T) { | |||
title string | |||
searchResult *es.ResultT | |||
searchErr error | |||
initialState client.UnitState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
err = m.ensureLeadership(ctx) | ||
if err != nil { | ||
return err | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coordinator and monitors get started separate goroutines ( in server/fleet.go
) so I don't think anything should go wrong with these changes.
The only difference should be that the API is "available", but should return a 503 in a non-status endpoint (like enroll) is called. Can you add that as an e2e test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference should be that the API is "available", but should return a 503 in a non-status endpoint (like enroll) is called. Can you add that as an e2e test?
Added a middleware that makes any non-status endpoint to return a 503 when the service is not available. And some assertions in the test to check this. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And removed after #2693 (comment) and the discussion about health checks.
Not sure then if there is an E2E test that we can add here, maybe in the controllers that use it?
110e3ba
to
79b7ae5
Compare
9be4d70
to
40194d8
Compare
f3d8816
to
317fb4c
Compare
internal/pkg/api/router.go
Outdated
// stand-alone mode. | ||
if _, isStandAlone := sm.(*policy.StandAloneSelfMonitor); isStandAlone { | ||
r.Use(statusChecker(sm)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I like to have this difference in behaviour, but I think we may need it, we don't have the same healthcheck needings when running as standalone and when running inside agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our alternative is to create a health endpoint and have the platform direct traffic if to fleet-server once ES is healthy. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, once we have healthchecks in place, the platform won't send traffic to us when we cannot reach ES. Should I remove this?
I started adding this after this comment about returning 503s when the healthcheck fails. But I see now that maybe you were referring to the platform and not directly here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am removing the middleware I added, and we will rely on readiness probes.
Did we have this discussion? I think we need readiness not to be successful until Fleet Server can accept traffic (so it's connection to ES is working), while liveness should still pass to avoid unnecessary container restarts. |
Not really, thanks for the heads-up on this. TLDR; I would add a healthcheck script to the docker image to check readiness based on current status endpoint, and I wouldn't add a liveness probe at least at the moment. For readiness, we can check now if the status is healthy. We would need a command probe for this. So three options:
I will go for the first option by now so we don't need to modify or extend our APIs. Regarding liveness, I don't think we need to add a probe for this at the moment. I would add it if we know that the Fleet Server can reach some unrecoverable unhealthy state, that afaik is not the case. If we add a liveness probe now, it will be always succeeding unless the pod crashes, what is no different to not having any probe. @michel-laterman @joshdover wdyt? |
I lean towards this option as I see it more of a bug fix than a breaking change, but also open to start with a command probe for now as we experiment and only make the change on the existing endpoint once we have it working the way we want.
Makes sense to me 👍 |
Ok, I will give this a try before adding the script. |
e87b188
to
194fdb2
Compare
194fdb2
to
a089bd8
Compare
return m.updateState(client.UnitStateFailed, fmt.Sprintf("Failed to request policies: %s", err)) | ||
if errors.Is(err, es.ErrIndexNotFound) { | ||
m.log.Debug().Str("index", m.policiesIndex).Msg(es.ErrIndexNotFound.Error()) | ||
message = "Running: Policies not available yet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michel-laterman should it be actually healthy if the policies index hasn't been created yet? Would requests succeed if Fleet Server cannot use this index? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it should be unhealthy if the index does not exist, that way the fleet-controller/k8s can use it for health checks/traffic routing. What do you think @nchaulet?
This PR also changes the status endpoint to return 200 only on a healthy status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it should be unhealthy if the index does not exist, that way the fleet-controller/k8s can use it for health checks/traffic routing. What do you think @nchaulet?
Not sure the index while not exists if the user do not create any agent policy, so it probably should be healthy even if the index do not exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, let's add the behaviour to the description in the openapi doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description updated in https://github.com/elastic/fleet-server/pull/2693/files#diff-5ad174233c27653aa08c5ce213ddfc13cbf7bcfc10cdf724a547da655e5743c0R830, let me know if this is fine.
return m.updateState(client.UnitStateFailed, fmt.Sprintf("Failed to request policies: %s", err)) | ||
if errors.Is(err, es.ErrIndexNotFound) { | ||
m.log.Debug().Str("index", m.policiesIndex).Msg(es.ErrIndexNotFound.Error()) | ||
message = "Running: Policies not available yet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, let's add the behaviour to the description in the openapi doc
Looks like the buildkite failure comes from
|
Ran a rebuild and it seemed to get past that step 👍 |
🚢 |
/test |
What is the problem this PR solves?
Keep Fleet Server process running if Elasticsearch is not available on startup.
How does this PR solve the problem?
Info
request is removed, so the client can be eventually used even if during startup Elasticsearch is not available.How to test this PR locally
Locally I have simulated connectivity interruptions with
iptables
.Design Checklist
Checklist
./changelog/fragments
using the changelog toolRelated issues