-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use _cluster/health API during rolling upgrades #3195
Conversation
This is to make sure there are no pending or ongoing shard fetches, initializations or relocations when rolling the next node. Otherwise the cluster status might go red if we roll forward before the indices on the previous node have recovered. This applies to upgrades only (i.e. version upgrades)
@@ -201,7 +201,7 @@ func (d *defaultDriver) Reconcile(ctx context.Context) *reconciler.Results { | |||
events.EventReasonUnexpected, | |||
fmt.Sprintf("Could not update cluster license: %s", err.Error()), | |||
) | |||
log.Error(err, "Could not update cluster license", "namespace", d.ES.Namespace, "es_name", d.ES.Name) | |||
log.Info("Could not update cluster license", "err", err, "namespace", d.ES.Namespace, "es_name", d.ES.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to back this out, this change is unrelated, but the license errors were distracting during testing. And I think they are are red herring often when users report problems, so trying to reduce the prominence a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♀️ I could go either way on this, it is indeed an error and normally indicates the cluster is fubar. But is misleading to users a lot as you said
return result, c.get(ctx, "/_cluster/health", &result) | ||
err := c.get(ctx, "/_cluster/health?"+params.Encode(), &result) | ||
if IsTimeout(err) { | ||
// ignore timeout errors as they are communicated in the returned payload so we can reserve error handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's in the payload, but do we ever want to treat a timeout as not an error condition? I see some of the other callers we have don't check for timeouts but probably should? example: https://github.com/elastic/cloud-on-k8s/pull/3195/files#diff-972e1e28f90c08fd73595db4b03025d2L73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be situations where the timeout status 408 we get here is returned by eg. a proxy and not Elasticsearch itself? In which case we get no response health payload.
Also should we only handle filtering a timeout error if timeout
was passed in params
? Which makes me wonder whether we should maybe introduce a different function for health + wait with timeout 0, different from the regular health()
function. As a side benefit, it could keep the http query params "internal" and not exposed to the caller, which usually works at a higher abstraction level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe those would happen on an idle connection in the pool and are handled by the Go http transport thingy IIUC https://go-review.googlesource.com/c/go/+/179457/4/src/net/http/transport.go#1931
But I like the idea of a separate API, my only concern is maybe that we are mixing levels of abstraction. So far all client APIs have been direct representations of Elasticsearch REST APIs this would be the first 'higher-level' API. My gut feeling is that we should do this in layers: have a low-level client that just exposes ES APIs and then a higher-level client interface on top that offers these more convenient abstractions. But maybe I am over-complicating things. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right about the abstraction level. ClusterBootstrappedForZen2()
is another example of a function doing a bit more than API calls.
Splitting lower vs. higher level functions attached to different structs may make things a bit cleaner but I'm not 100% sure it would reduce complexity.
Getting the cluster health with additional params is still a low-level API call?
I think I'd be fine with something like these 2 functions if it simplifies our life:
func (c *clientV6) GetClusterHealth(ctx context.Context) (Health, error) {
var result Health
return result, c.get(ctx, "/_cluster/health", &result)
}
func (c *clientV6) GetClusterHealthWaitForLanguid(ctx context.Context, timeout time.Time) (Health, error) {
var result Health
// wait for all "languid" events (lowest priority) to be processed, or time out
url := fmt.Sprintf("/_cluster/health?wait_for_events=languid&timeout=%ds", timeout.Second())
err := c.get(ctx, url, &result)
if IsTimeout(err) && result.TimedOut{
// timeout reached, which is not an unexpected error
// we still have a response body to work with
err = nil
}
return result, err
}
@@ -43,7 +43,7 @@ func applyLinkedLicense( | |||
defer cancel() | |||
current, err := updater.GetLicense(ctx) | |||
if err != nil { | |||
return err | |||
return fmt.Errorf("while getting current license level %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nittiest of nits: personal preference for errors.wrap
here but obviously this is very much personal pref.
I was also wondering if it makes sense to wrap it in the GetLicense
func itself (or in clientV6.get
but I think that might interfere with callers that are checking error types
LGTM overall, this was easy to read. |
return result, c.get(ctx, "/_cluster/health", &result) | ||
err := c.get(ctx, "/_cluster/health?"+params.Encode(), &result) | ||
if IsTimeout(err) { | ||
// ignore timeout errors as they are communicated in the returned payload so we can reserve error handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be situations where the timeout status 408 we get here is returned by eg. a proxy and not Elasticsearch itself? In which case we get no response health payload.
Also should we only handle filtering a timeout error if timeout
was passed in params
? Which makes me wonder whether we should maybe introduce a different function for health + wait with timeout 0, different from the regular health()
function. As a side benefit, it could keep the http query params "internal" and not exposed to the caller, which usually works at a higher abstraction level?
@@ -21,6 +21,8 @@ type ESState interface { | |||
ShardAllocationsEnabled() (bool, error) | |||
// Health returns the health of the Elasticsearch cluster. | |||
Health() (esv1.ElasticsearchHealth, error) | |||
// SafeToRoll returns true if shards a not moving and primaries are allocated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are not moving
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
h.NumberOfInFlightFetch == 0 && // no shards being fetched | ||
h.InitializingShards == 0 && // no shards initializing | ||
h.RelocatingShards == 0 // no shards relocating | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would not have attached that function to the Health
struct, and maybe instead move it to a func IsSafeToRoll(h Health)
close to the rolling upgrade code. It's more a rolling upgrade predicate than something that belongs to the Health domain in my mind.
But that's really not a big deal, I'm fine with keeping the current code.
health, err := h.esClient.GetClusterHealth(ctx) | ||
|
||
// get cluster health but make sure we have no pending shard initializations | ||
// by requiring the event queue to be empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by requiring the event queue to be empty
This is slightly misleading as I would think we return an error here if the event queue is not empty.
Maybe:
// get cluster health and status of the event queue (are there pending events?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -140,3 +147,11 @@ func (h *memoizingHealth) Health() (esv1.ElasticsearchHealth, error) { | |||
} | |||
return h.health, nil | |||
} | |||
|
|||
// IsSafeToRoll returns true if shards a not moving and primaries are allocated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? should be "if shards are not moving"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is to make sure there are no pending or ongoing shard fetches,
initializations or relocations when rolling the next node. Otherwise
the cluster status might go red if we roll forward before the indices
on the previous node have recovered.
SafeToRoll
(naming open for discussion, I was thinking also:NoMovingShards
orStableShards
) method on Health results, which checks for the conditions mentioned aboveFixes #3070