Use _cluster/health API during rolling upgrades #3195

pebrc · 2020-06-05T13:20:49Z

This is to make sure there are no pending or ongoing shard fetches,
initializations or relocations when rolling the next node. Otherwise
the cluster status might go red if we roll forward before the indices
on the previous node have recovered.

this applies to upgrades only (i.e. version upgrades) see Shards from _cat/shards may contain stale information #3070 for rationale
adds a parameterized Health API call to ESState cache to be used during predicate evaluation
adds a SafeToRoll (naming open for discussion, I was thinking also: NoMovingShards or StableShards ) method on Health results, which checks for the conditions mentioned above

Fixes #3070

This is to make sure there are no pending or ongoing shard fetches, initializations or relocations when rolling the next node. Otherwise the cluster status might go red if we roll forward before the indices on the previous node have recovered. This applies to upgrades only (i.e. version upgrades)

pebrc · 2020-06-05T13:31:55Z

pkg/controller/elasticsearch/driver/driver.go

@@ -201,7 +201,7 @@ func (d *defaultDriver) Reconcile(ctx context.Context) *reconciler.Results {
 				events.EventReasonUnexpected,
 				fmt.Sprintf("Could not update cluster license: %s", err.Error()),
 			)
-			log.Error(err, "Could not update cluster license", "namespace", d.ES.Namespace, "es_name", d.ES.Name)
+			log.Info("Could not update cluster license", "err", err, "namespace", d.ES.Namespace, "es_name", d.ES.Name)


Happy to back this out, this change is unrelated, but the license errors were distracting during testing. And I think they are are red herring often when users report problems, so trying to reduce the prominence a bit.

🤷‍♀️ I could go either way on this, it is indeed an error and normally indicates the cluster is fubar. But is misleading to users a lot as you said

anyasabo · 2020-06-05T18:51:56Z

pkg/controller/elasticsearch/client/v6.go

-	return result, c.get(ctx, "/_cluster/health", &result)
+	err := c.get(ctx, "/_cluster/health?"+params.Encode(), &result)
+	if IsTimeout(err) {
+		// ignore timeout errors as they are communicated in the returned payload so we can reserve error handling


it's in the payload, but do we ever want to treat a timeout as not an error condition? I see some of the other callers we have don't check for timeouts but probably should? example: https://github.com/elastic/cloud-on-k8s/pull/3195/files#diff-972e1e28f90c08fd73595db4b03025d2L73

Could there be situations where the timeout status 408 we get here is returned by eg. a proxy and not Elasticsearch itself? In which case we get no response health payload.

Also should we only handle filtering a timeout error if timeout was passed in params? Which makes me wonder whether we should maybe introduce a different function for health + wait with timeout 0, different from the regular health() function. As a side benefit, it could keep the http query params "internal" and not exposed to the caller, which usually works at a higher abstraction level?

I believe those would happen on an idle connection in the pool and are handled by the Go http transport thingy IIUC https://go-review.googlesource.com/c/go/+/179457/4/src/net/http/transport.go#1931

But I like the idea of a separate API, my only concern is maybe that we are mixing levels of abstraction. So far all client APIs have been direct representations of Elasticsearch REST APIs this would be the first 'higher-level' API. My gut feeling is that we should do this in layers: have a low-level client that just exposes ES APIs and then a higher-level client interface on top that offers these more convenient abstractions. But maybe I am over-complicating things. wdyt?

You're right about the abstraction level. ClusterBootstrappedForZen2() is another example of a function doing a bit more than API calls.

Splitting lower vs. higher level functions attached to different structs may make things a bit cleaner but I'm not 100% sure it would reduce complexity.

Getting the cluster health with additional params is still a low-level API call?
I think I'd be fine with something like these 2 functions if it simplifies our life:

func (c *clientV6) GetClusterHealth(ctx context.Context) (Health, error) { var result Health return result, c.get(ctx, "/_cluster/health", &result) } func (c *clientV6) GetClusterHealthWaitForLanguid(ctx context.Context, timeout time.Time) (Health, error) { var result Health // wait for all "languid" events (lowest priority) to be processed, or time out url := fmt.Sprintf("/_cluster/health?wait_for_events=languid&timeout=%ds", timeout.Second()) err := c.get(ctx, url, &result) if IsTimeout(err) && result.TimedOut{ // timeout reached, which is not an unexpected error // we still have a response body to work with err = nil } return result, err }

anyasabo · 2020-06-05T18:56:10Z

pkg/controller/elasticsearch/license/apply.go

@@ -43,7 +43,7 @@ func applyLinkedLicense(
 	defer cancel()
 	current, err := updater.GetLicense(ctx)
 	if err != nil {
-		return err
+		return fmt.Errorf("while getting current license level %w", err)


nittiest of nits: personal preference for errors.wrap here but obviously this is very much personal pref.

I was also wondering if it makes sense to wrap it in the GetLicense func itself (or in clientV6.get but I think that might interfere with callers that are checking error types

anyasabo · 2020-06-05T19:01:16Z

LGTM overall, this was easy to read. SafeToRoll sounds fine. If you're unsure about it I also think we could go with the Is...() naming convention for things that return bools, so maybe IsSafeToRoll() or IsRollingUpgradeSafe() to be extra verbose

sebgl · 2020-06-08T09:55:05Z

pkg/controller/elasticsearch/client/v6.go

-	return result, c.get(ctx, "/_cluster/health", &result)
+	err := c.get(ctx, "/_cluster/health?"+params.Encode(), &result)
+	if IsTimeout(err) {
+		// ignore timeout errors as they are communicated in the returned payload so we can reserve error handling


Could there be situations where the timeout status 408 we get here is returned by eg. a proxy and not Elasticsearch itself? In which case we get no response health payload.

Also should we only handle filtering a timeout error if timeout was passed in params? Which makes me wonder whether we should maybe introduce a different function for health + wait with timeout 0, different from the regular health() function. As a side benefit, it could keep the http query params "internal" and not exposed to the caller, which usually works at a higher abstraction level?

sebgl · 2020-06-08T09:56:41Z

pkg/controller/elasticsearch/driver/esstate.go

@@ -21,6 +21,8 @@ type ESState interface {
 	ShardAllocationsEnabled() (bool, error)
 	// Health returns the health of the Elasticsearch cluster.
 	Health() (esv1.ElasticsearchHealth, error)
+	// SafeToRoll returns true if shards a not moving and primaries are allocated


are not moving

sebgl

LGTM

sebgl · 2020-06-09T07:21:15Z

pkg/controller/elasticsearch/client/model.go

+		h.NumberOfInFlightFetch == 0 && // no shards being fetched
+		h.InitializingShards == 0 && // no shards initializing
+		h.RelocatingShards == 0 // no shards relocating
+}


I think I would not have attached that function to the Health struct, and maybe instead move it to a func IsSafeToRoll(h Health) close to the rolling upgrade code. It's more a rolling upgrade predicate than something that belongs to the Health domain in my mind.

But that's really not a big deal, I'm fine with keeping the current code.

sebgl · 2020-06-09T07:28:57Z

pkg/controller/elasticsearch/driver/esstate.go

-	health, err := h.esClient.GetClusterHealth(ctx)
+
+	// get cluster health but make sure we have no pending shard initializations
+	// by requiring the event queue to be empty


by requiring the event queue to be empty

This is slightly misleading as I would think we return an error here if the event queue is not empty.
Maybe:

// get cluster health and status of the event queue (are there pending events?)

barkbay

LGTM

idanmo · 2020-06-09T17:37:00Z

pkg/controller/elasticsearch/driver/esstate.go

@@ -140,3 +147,11 @@ func (h *memoizingHealth) Health() (esv1.ElasticsearchHealth, error) {
 	}
 	return h.health, nil
 }
+
+// IsSafeToRoll returns true if shards a not moving and primaries are allocated


typo? should be "if shards are not moving"

sebgl

LGTM

pebrc added 2 commits June 5, 2020 15:17

Remove unused const

c8dc3e1

pebrc added >enhancement Enhancement of existing functionality v1.2.0 labels Jun 5, 2020

pebrc commented Jun 5, 2020

View reviewed changes

make linter happy

3ebf536

anyasabo reviewed Jun 5, 2020

View reviewed changes

sebgl reviewed Jun 8, 2020

View reviewed changes

review feedback

713c06a

anyasabo approved these changes Jun 8, 2020

View reviewed changes

sebgl approved these changes Jun 9, 2020

View reviewed changes

barkbay approved these changes Jun 9, 2020

View reviewed changes

idanmo reviewed Jun 9, 2020

View reviewed changes

move isSafeToRoll predicate to update predicates

46e19a9

pebrc requested a review from sebgl June 11, 2020 14:55

sebgl approved these changes Jun 12, 2020

View reviewed changes

pebrc merged commit 42fd9c6 into elastic:master Jun 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use _cluster/health API during rolling upgrades #3195

Use _cluster/health API during rolling upgrades #3195

pebrc commented Jun 5, 2020 •

edited

Loading

pebrc Jun 5, 2020 •

edited

Loading

anyasabo Jun 5, 2020

anyasabo Jun 5, 2020

sebgl Jun 8, 2020 •

edited

Loading

pebrc Jun 8, 2020

sebgl Jun 8, 2020

anyasabo Jun 5, 2020 •

edited

Loading

anyasabo commented Jun 5, 2020

sebgl Jun 8, 2020 •

edited

Loading

sebgl Jun 8, 2020

sebgl left a comment

sebgl Jun 9, 2020

sebgl Jun 9, 2020

barkbay left a comment

idanmo Jun 9, 2020

sebgl left a comment

Use _cluster/health API during rolling upgrades #3195

Use _cluster/health API during rolling upgrades #3195

Conversation

pebrc commented Jun 5, 2020 • edited Loading

pebrc Jun 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebgl Jun 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anyasabo Jun 5, 2020 • edited Loading

Choose a reason for hiding this comment

anyasabo commented Jun 5, 2020

sebgl Jun 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebgl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barkbay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebgl left a comment

Choose a reason for hiding this comment

pebrc commented Jun 5, 2020 •

edited

Loading

pebrc Jun 5, 2020 •

edited

Loading

sebgl Jun 8, 2020 •

edited

Loading

anyasabo Jun 5, 2020 •

edited

Loading

sebgl Jun 8, 2020 •

edited

Loading