Avoid crashing on startup if Elasticsearch is not available #2693

jsoriano · 2023-06-13T21:18:28Z

What is the problem this PR solves?

Keep Fleet Server process running if Elasticsearch is not available on startup.

How does this PR solve the problem?

Initial Info request is removed, so the client can be eventually used even if during startup Elasticsearch is not available.
Subsystems that need initialization keep retrying if they fail.
Standalone monitor keeps running once the service is healthy. If it loses connection with Elasticsearch, it goes to degraded state.

How to test this PR locally

An e2e is added that uses toxiproxy to simulate connectivity loss.
In general, this can be tested by:
- Starting Fleet Server without connectivity with Elasticsearch (or with Elasticsearch down), it should still start, but without reaching the healthy state.
- Start Fleet Server, and once it is healthy, interrupt connectivity with Elasticsearch.

Locally I have simulated connectivity interruptions with iptables.

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Fixes Fleet Server should not crash on startup if connection to Elasticsearch timeouts #2683

elasticmachine · 2023-06-13T22:01:19Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-07-06T16:19:14.121+0000
Duration: 41 min 18 sec

Test stats 🧪

Test	Results
Failed	0
Passed	745
Skipped	1
Total	746

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

mergify · 2023-06-14T21:38:27Z

This pull request is now in conflicts. Could you fix it @jsoriano? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b retry-connections-startup upstream/retry-connections-startup
git merge upstream/main
git push upstream retry-connections-startup

michel-laterman

Can you add a changelog fragment?

Also I think with these changes we should have a discussion about health vs readiness endpoints for the fleet-server

internal/pkg/es/client.go

michel-laterman · 2023-06-14T22:38:18Z

internal/pkg/policy/standalone_test.go

@@ -25,6 +25,7 @@ func TestStandAloneSelfMonitor(t *testing.T) {
 		title         string
 		searchResult  *es.ResultT
 		searchErr     error
+		initialState  client.UnitState


michel-laterman · 2023-06-14T22:44:36Z

internal/pkg/coordinator/monitor.go

-	err = m.ensureLeadership(ctx)
-	if err != nil {
-		return err
+	for {


The coordinator and monitors get started separate goroutines ( in server/fleet.go) so I don't think anything should go wrong with these changes.
The only difference should be that the API is "available", but should return a 503 in a non-status endpoint (like enroll) is called. Can you add that as an e2e test?

The only difference should be that the API is "available", but should return a 503 in a non-status endpoint (like enroll) is called. Can you add that as an e2e test?

Added a middleware that makes any non-status endpoint to return a 503 when the service is not available. And some assertions in the test to check this. Let me know what you think.

And removed after #2693 (comment) and the discussion about health checks.

Not sure then if there is an E2E test that we can add here, maybe in the controllers that use it?

…rtup

internal/pkg/api/router.go

testing/e2e/api_version/client_api_current.go

…rtup

jsoriano · 2023-06-22T13:50:41Z

internal/pkg/api/router.go

+	// stand-alone mode.
+	if _, isStandAlone := sm.(*policy.StandAloneSelfMonitor); isStandAlone {
+		r.Use(statusChecker(sm))
+	}


Not sure if I like to have this difference in behaviour, but I think we may need it, we don't have the same healthcheck needings when running as standalone and when running inside agent.

I think our alternative is to create a health endpoint and have the platform direct traffic if to fleet-server once ES is healthy. What do you think?

Yeah, once we have healthchecks in place, the platform won't send traffic to us when we cannot reach ES. Should I remove this?
I started adding this after this comment about returning 503s when the healthcheck fails. But I see now that maybe you were referring to the platform and not directly here?

I am removing the middleware I added, and we will rely on readiness probes.

joshdover · 2023-06-26T07:14:42Z

Also I think with these changes we should have a discussion about health vs readiness endpoints for the fleet-server

Did we have this discussion? I think we need readiness not to be successful until Fleet Server can accept traffic (so it's connection to ES is working), while liveness should still pass to avoid unnecessary container restarts.

jsoriano · 2023-06-26T10:13:19Z

Also I think with these changes we should have a discussion about health vs readiness endpoints for the fleet-server

Did we have this discussion? I think we need readiness not to be successful until Fleet Server can accept traffic (so it's connection to ES is working), while liveness should still pass to avoid unnecessary container restarts.

Not really, thanks for the heads-up on this.

TLDR; I would add a healthcheck script to the docker image to check readiness based on current status endpoint, and I wouldn't add a liveness probe at least at the moment.

For readiness, we can check now if the status is healthy. We would need a command probe for this.
If we want to use a simpler HTTP probe, and rely only on the status code, we would need to modify the current handler, because it returns 200 now when it is in degraded state, or to define a new /healthz endpoint that returns 200 only when healthy.

So three options:

Use a command probe that uses the status endpoint to check if the service is healthy. We could add a healthcheck.sh script to the image for this.
Modify current status endpoint to return 200 only when the service is healthy. This is potentially a breaking change.
Add a new /healthz endpoint that returns 200 only when the service is healthy.

I will go for the first option by now so we don't need to modify or extend our APIs.

Regarding liveness, I don't think we need to add a probe for this at the moment. I would add it if we know that the Fleet Server can reach some unrecoverable unhealthy state, that afaik is not the case. If we add a liveness probe now, it will be always succeeding unless the pod crashes, what is no different to not having any probe.

@michel-laterman @joshdover wdyt?

joshdover · 2023-06-26T10:17:04Z

Modify current status endpoint to return 200 only when the service is healthy. This is potentially a breaking change.

I lean towards this option as I see it more of a bug fix than a breaking change, but also open to start with a command probe for now as we experiment and only make the change on the existing endpoint once we have it working the way we want.

Regarding liveness, I don't think we need to add a probe for this at the moment. I would add it if we know that the Fleet Server can reach some unrecoverable unhealthy state, that afaik is not the case. If we add a liveness probe now, it will be always succeeding unless the pod crashes, what is no different to not having any probe.

Makes sense to me 👍

jsoriano · 2023-06-26T10:19:41Z

Modify current status endpoint to return 200 only when the service is healthy. This is potentially a breaking change.

I lean towards this option as I see it more of a bug fix than a breaking change, but also open to start with a command probe for now as we experiment and only make the change on the existing endpoint once we have it working the way we want.

Ok, I will give this a try before adding the script.

…rtup

jsoriano · 2023-06-26T11:30:37Z

internal/pkg/policy/standalone.go

-			return m.updateState(client.UnitStateFailed, fmt.Sprintf("Failed to request policies: %s", err))
+	if errors.Is(err, es.ErrIndexNotFound) {
+		m.log.Debug().Str("index", m.policiesIndex).Msg(es.ErrIndexNotFound.Error())
+		message = "Running: Policies not available yet"


@michel-laterman should it be actually healthy if the policies index hasn't been created yet? Would requests succeed if Fleet Server cannot use this index? 🤔

IMO it should be unhealthy if the index does not exist, that way the fleet-controller/k8s can use it for health checks/traffic routing. What do you think @nchaulet?
This PR also changes the status endpoint to return 200 only on a healthy status

IMO it should be unhealthy if the index does not exist, that way the fleet-controller/k8s can use it for health checks/traffic routing. What do you think @nchaulet?

Not sure the index while not exists if the user do not create any agent policy, so it probably should be healthy even if the index do not exists

Sounds good, let's add the behaviour to the description in the openapi doc

Description updated in https://github.com/elastic/fleet-server/pull/2693/files#diff-5ad174233c27653aa08c5ce213ddfc13cbf7bcfc10cdf724a547da655e5743c0R830, let me know if this is fine.

…rtup

michel-laterman · 2023-07-05T17:52:30Z

internal/pkg/policy/standalone.go

-			return m.updateState(client.UnitStateFailed, fmt.Sprintf("Failed to request policies: %s", err))
+	if errors.Is(err, es.ErrIndexNotFound) {
+		m.log.Debug().Str("index", m.policiesIndex).Msg(es.ErrIndexNotFound.Error())
+		message = "Running: Policies not available yet"


Sounds good, let's add the behaviour to the description in the openapi doc

…does not exist

jlind23 · 2023-07-06T05:59:32Z

Looks like the buildkite failure comes from

go: cloud.google.com/go/pubsublite@v1.7.0: verifying go.mod: cloud.google.com/go/pubsublite@v1.7.0/go.mod: reading https://sum.golang.org/tile/8/1/252: 404 Not Found

joshdover · 2023-07-06T09:49:13Z

Ran a rebuild and it seemed to get past that step 👍

jlind23 · 2023-07-06T09:52:43Z

🚢

jsoriano · 2023-07-06T16:18:56Z

/test

jsoriano added 4 commits June 13, 2023 19:36

Start the server even if connections to Elasticsearch fail

8470196

Update test

33479c2

Add timeout to standalone monitor

b86c5da

Add e2e test

f81c850

jsoriano requested a review from a team as a code owner June 13, 2023 21:18

Remove unused function

5735ed1

jlind23 requested a review from michel-laterman June 14, 2023 06:34

michel-laterman reviewed Jun 14, 2023

View reviewed changes

jsoriano added 6 commits June 15, 2023 19:14

Remove unused struct

8db2e06

Return 503 for all API responses when the service is not healthy

51b3dc7

Add changelog

866e286

Merge remote-tracking branch 'origin/main' into retry-connections-sta…

d997d13

…rtup

Reduce locks

4ab800f

Reset timer on coordinator monitor

79b7ae5

jsoriano force-pushed the retry-connections-startup branch from 110e3ba to 79b7ae5 Compare June 15, 2023 19:10

michel-laterman reviewed Jun 15, 2023

View reviewed changes

internal/pkg/api/router.go Outdated Show resolved Hide resolved

michel-laterman reviewed Jun 15, 2023

View reviewed changes

internal/pkg/api/router.go Outdated Show resolved Hide resolved

michel-laterman reviewed Jun 15, 2023

View reviewed changes

testing/e2e/api_version/client_api_current.go Outdated Show resolved Hide resolved

jsoriano added 5 commits June 19, 2023 14:07

Set Content-Type header on degraded service

c92b989

Use provided assertion

18469e1

Merge remote-tracking branch 'origin/main' into retry-connections-sta…

bdf735e

…rtup

Linting

3511686

Fix integration tests

40194d8

jsoriano force-pushed the retry-connections-startup branch from 9be4d70 to 40194d8 Compare June 19, 2023 15:57

jsoriano added 4 commits June 19, 2023 18:08

Ignore linting error

a5a74ae

Merge remote-tracking branch 'origin/main' into retry-connections-sta…

a1aa6cf

…rtup

Fix e2e tests

5382fa7

Add healthchecks for test containers

a3b7632

jsoriano added 2 commits June 22, 2023 00:39

Merge remote-tracking branch 'origin/main' into retry-connections-sta…

e2432b1

…rtup

Use status checker middleware only on stand-alone

317fb4c

jsoriano force-pushed the retry-connections-startup branch from f3d8816 to 317fb4c Compare June 22, 2023 00:11

jsoriano requested a review from michel-laterman June 22, 2023 07:49

jsoriano commented Jun 22, 2023

View reviewed changes

jsoriano added 2 commits June 26, 2023 12:43

Status endpoint returns 503 also in degraded state

f924787

Merge remote-tracking branch 'origin/main' into retry-connections-sta…

cab8f25

…rtup

jsoriano force-pushed the retry-connections-startup branch 2 times, most recently from e87b188 to 194fdb2 Compare June 26, 2023 11:18

Remove status checker

a089bd8

jsoriano force-pushed the retry-connections-startup branch from 194fdb2 to a089bd8 Compare June 26, 2023 11:26

jsoriano commented Jun 26, 2023

View reviewed changes

jsoriano added 2 commits June 26, 2023 20:57

Add tests for standalone container

1b0325f

Merge remote-tracking branch 'origin/main' into retry-connections-sta…

6c236fb

…rtup

michel-laterman approved these changes Jul 5, 2023

View reviewed changes

Update openapi doc with clarification about healthy state when index …

5dec976

…does not exist

jlind23 assigned jsoriano Jul 6, 2023

jsoriano merged commit b6dba34 into elastic:main Jul 6, 2023
18 checks passed

jsoriano deleted the retry-connections-startup branch July 6, 2023 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid crashing on startup if Elasticsearch is not available #2693

Avoid crashing on startup if Elasticsearch is not available #2693

jsoriano commented Jun 13, 2023 •

edited

elasticmachine commented Jun 13, 2023 •

edited

Build stats

Test stats 🧪

mergify bot commented Jun 14, 2023

michel-laterman left a comment

michel-laterman Jun 14, 2023

michel-laterman Jun 14, 2023

jsoriano Jun 15, 2023

jsoriano Jun 26, 2023

jsoriano Jun 22, 2023

michel-laterman Jun 23, 2023

jsoriano Jun 23, 2023

jsoriano Jun 26, 2023

joshdover commented Jun 26, 2023

jsoriano commented Jun 26, 2023

joshdover commented Jun 26, 2023

jsoriano commented Jun 26, 2023

jsoriano Jun 26, 2023

michel-laterman Jul 5, 2023

nchaulet Jul 5, 2023

michel-laterman Jul 5, 2023

jsoriano Jul 5, 2023

michel-laterman Jul 5, 2023

jlind23 commented Jul 6, 2023

joshdover commented Jul 6, 2023

jlind23 commented Jul 6, 2023

jsoriano commented Jul 6, 2023

Avoid crashing on startup if Elasticsearch is not available #2693

Avoid crashing on startup if Elasticsearch is not available #2693

Conversation

jsoriano commented Jun 13, 2023 • edited

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

elasticmachine commented Jun 13, 2023 • edited

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

mergify bot commented Jun 14, 2023

michel-laterman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshdover commented Jun 26, 2023

jsoriano commented Jun 26, 2023

joshdover commented Jun 26, 2023

jsoriano commented Jun 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlind23 commented Jul 6, 2023

joshdover commented Jul 6, 2023

jlind23 commented Jul 6, 2023

jsoriano commented Jul 6, 2023

jsoriano commented Jun 13, 2023 •

edited

elasticmachine commented Jun 13, 2023 •

edited