Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use _cluster/health API during rolling upgrades #3195

Merged
merged 5 commits into from
Jun 12, 2020

Conversation

pebrc
Copy link
Collaborator

@pebrc pebrc commented Jun 5, 2020

This is to make sure there are no pending or ongoing shard fetches,
initializations or relocations when rolling the next node. Otherwise
the cluster status might go red if we roll forward before the indices
on the previous node have recovered.

  • this applies to upgrades only (i.e. version upgrades) see Shards from _cat/shards may contain stale information #3070 for rationale
  • adds a parameterized Health API call to ESState cache to be used during predicate evaluation
  • adds a SafeToRoll (naming open for discussion, I was thinking also: NoMovingShards or StableShards ) method on Health results, which checks for the conditions mentioned above

Fixes #3070

pebrc added 2 commits June 5, 2020 15:17
This is to make sure there are no pending or ongoing shard fetches,
initializations or relocations when rolling the next node. Otherwise
the cluster status might go red if we roll forward before the indices
on the previous node have recovered. This applies to upgrades
only (i.e. version upgrades)
@pebrc pebrc added >enhancement Enhancement of existing functionality v1.2.0 labels Jun 5, 2020
@@ -201,7 +201,7 @@ func (d *defaultDriver) Reconcile(ctx context.Context) *reconciler.Results {
events.EventReasonUnexpected,
fmt.Sprintf("Could not update cluster license: %s", err.Error()),
)
log.Error(err, "Could not update cluster license", "namespace", d.ES.Namespace, "es_name", d.ES.Name)
log.Info("Could not update cluster license", "err", err, "namespace", d.ES.Namespace, "es_name", d.ES.Name)
Copy link
Collaborator Author

@pebrc pebrc Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to back this out, this change is unrelated, but the license errors were distracting during testing. And I think they are are red herring often when users report problems, so trying to reduce the prominence a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♀️ I could go either way on this, it is indeed an error and normally indicates the cluster is fubar. But is misleading to users a lot as you said

return result, c.get(ctx, "/_cluster/health", &result)
err := c.get(ctx, "/_cluster/health?"+params.Encode(), &result)
if IsTimeout(err) {
// ignore timeout errors as they are communicated in the returned payload so we can reserve error handling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's in the payload, but do we ever want to treat a timeout as not an error condition? I see some of the other callers we have don't check for timeouts but probably should? example: https://github.com/elastic/cloud-on-k8s/pull/3195/files#diff-972e1e28f90c08fd73595db4b03025d2L73

Copy link
Contributor

@sebgl sebgl Jun 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be situations where the timeout status 408 we get here is returned by eg. a proxy and not Elasticsearch itself? In which case we get no response health payload.

Also should we only handle filtering a timeout error if timeout was passed in params? Which makes me wonder whether we should maybe introduce a different function for health + wait with timeout 0, different from the regular health() function. As a side benefit, it could keep the http query params "internal" and not exposed to the caller, which usually works at a higher abstraction level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe those would happen on an idle connection in the pool and are handled by the Go http transport thingy IIUC https://go-review.googlesource.com/c/go/+/179457/4/src/net/http/transport.go#1931

But I like the idea of a separate API, my only concern is maybe that we are mixing levels of abstraction. So far all client APIs have been direct representations of Elasticsearch REST APIs this would be the first 'higher-level' API. My gut feeling is that we should do this in layers: have a low-level client that just exposes ES APIs and then a higher-level client interface on top that offers these more convenient abstractions. But maybe I am over-complicating things. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about the abstraction level. ClusterBootstrappedForZen2() is another example of a function doing a bit more than API calls.

Splitting lower vs. higher level functions attached to different structs may make things a bit cleaner but I'm not 100% sure it would reduce complexity.

Getting the cluster health with additional params is still a low-level API call?
I think I'd be fine with something like these 2 functions if it simplifies our life:

func (c *clientV6) GetClusterHealth(ctx context.Context) (Health, error) {
	var result Health
	return result, c.get(ctx, "/_cluster/health", &result)
}

func (c *clientV6) GetClusterHealthWaitForLanguid(ctx context.Context, timeout time.Time) (Health, error) {
	var result Health
	// wait for all "languid" events (lowest priority) to be processed, or time out
	url := fmt.Sprintf("/_cluster/health?wait_for_events=languid&timeout=%ds", timeout.Second())
	err := c.get(ctx, url, &result)
	if IsTimeout(err) && result.TimedOut{
		// timeout reached, which is not an unexpected error
		// we still have a response body to work with
		err = nil
	}
	return result, err
}

@@ -43,7 +43,7 @@ func applyLinkedLicense(
defer cancel()
current, err := updater.GetLicense(ctx)
if err != nil {
return err
return fmt.Errorf("while getting current license level %w", err)
Copy link
Contributor

@anyasabo anyasabo Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nittiest of nits: personal preference for errors.wrap here but obviously this is very much personal pref.

I was also wondering if it makes sense to wrap it in the GetLicense func itself (or in clientV6.get but I think that might interfere with callers that are checking error types

@anyasabo
Copy link
Contributor

anyasabo commented Jun 5, 2020

LGTM overall, this was easy to read. SafeToRoll sounds fine. If you're unsure about it I also think we could go with the Is...() naming convention for things that return bools, so maybe IsSafeToRoll() or IsRollingUpgradeSafe() to be extra verbose

return result, c.get(ctx, "/_cluster/health", &result)
err := c.get(ctx, "/_cluster/health?"+params.Encode(), &result)
if IsTimeout(err) {
// ignore timeout errors as they are communicated in the returned payload so we can reserve error handling
Copy link
Contributor

@sebgl sebgl Jun 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be situations where the timeout status 408 we get here is returned by eg. a proxy and not Elasticsearch itself? In which case we get no response health payload.

Also should we only handle filtering a timeout error if timeout was passed in params? Which makes me wonder whether we should maybe introduce a different function for health + wait with timeout 0, different from the regular health() function. As a side benefit, it could keep the http query params "internal" and not exposed to the caller, which usually works at a higher abstraction level?

@@ -21,6 +21,8 @@ type ESState interface {
ShardAllocationsEnabled() (bool, error)
// Health returns the health of the Elasticsearch cluster.
Health() (esv1.ElasticsearchHealth, error)
// SafeToRoll returns true if shards a not moving and primaries are allocated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are not moving

Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

h.NumberOfInFlightFetch == 0 && // no shards being fetched
h.InitializingShards == 0 && // no shards initializing
h.RelocatingShards == 0 // no shards relocating
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would not have attached that function to the Health struct, and maybe instead move it to a func IsSafeToRoll(h Health) close to the rolling upgrade code. It's more a rolling upgrade predicate than something that belongs to the Health domain in my mind.

But that's really not a big deal, I'm fine with keeping the current code.

health, err := h.esClient.GetClusterHealth(ctx)

// get cluster health but make sure we have no pending shard initializations
// by requiring the event queue to be empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by requiring the event queue to be empty

This is slightly misleading as I would think we return an error here if the event queue is not empty.
Maybe:

// get cluster health and status of the event queue (are there pending events?)

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -140,3 +147,11 @@ func (h *memoizingHealth) Health() (esv1.ElasticsearchHealth, error) {
}
return h.health, nil
}

// IsSafeToRoll returns true if shards a not moving and primaries are allocated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? should be "if shards are not moving"

@pebrc pebrc requested a review from sebgl June 11, 2020 14:55
Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pebrc pebrc merged commit 42fd9c6 into elastic:master Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality v1.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shards from _cat/shards may contain stale information
5 participants