Elasticsearch does not indicate retryability when flood stage is exceeded #49393

jasontedor · 2019-11-20T16:34:56Z

Today if a node exceeds the disk flood stage watermark, the disk threshold monitor will apply a special read-only index block to any indices that have a shard allocated to the node that exceeded the watermark. This block carries with it a forbidden status code so that if an attempt is made to index into such an index, the client receives a HTTP 403 status code.

Clients assume that a 403 status code is not retryable and they drop data.

This situation is retryable though, as once the disk threshold monitor observes the free disk space go above the appropriate threshold, the index block is automatically removed.

Rather than expecting our clients to all account for this situation (by inspecting the specifics of the exception that led to the 403 status code), we should indicate retryability by using HTTP status code 429. While 429 is often translated as "too many requests", the HTTP specification is liberal about what this means:

Note that this specification does not define how the origin server identifies the user, nor how it counts requests. For example, an origin server that is limiting request rates can do so based upon counts of requests on a per-resource basis, across the entire server, or even among a set of servers.

By making this change, all of our clients can start retrying when faced with an index that was marked read-only due to a flood stage watermark exceeded event.

Similarly, the status codes of other cluster blocks should be reexamined in this context.

elasticmachine · 2019-11-20T16:34:58Z

Pinging @elastic/es-distributed (:Distributed/CRUD)

gaobinlong · 2019-12-02T09:32:58Z

Hi @jasontedor , I'm intersted in this issue. Should we return 429 status code if the cluster block is set manually rather than set automaticly when the flood stage is exceeded?

jasontedor · 2019-12-10T02:30:35Z

@gaobinlong I think it's fine to treat them the same. I wish we had an easy way to distinguish when it's automatically set versus when it's not, be we don't really so let's proceed to treat them as the same.

gaobinlong · 2019-12-10T03:24:58Z

@jasontedor ok, I got it.

gaobinlong · 2019-12-13T08:53:45Z

Hi @jasontedor , I hava made a PR for this issue, can you help to review the code change?

We consider index level read_only_allow_delete blocks temporary since the DiskThresholdMonitor can automatically release those when an index is no longer allocated on nodes above high threshold. The rest status has therefore been changed to 429 when encountering this index block to signal retryability to clients. Related to #49393

…#50166) We consider index level read_only_allow_delete blocks temporary since the DiskThresholdMonitor can automatically release those when an index is no longer allocated on nodes above high threshold. The rest status has therefore been changed to 429 when encountering this index block to signal retryability to clients. Related to elastic#49393

We consider index level read_only_allow_delete blocks temporary since the DiskThresholdMonitor can automatically release those when an index is no longer allocated on nodes above high threshold. The rest status has therefore been changed to 429 when encountering this index block to signal retryability to clients. Related to #49393

zez3 · 2021-03-27T15:40:59Z

#50166

This PR valid from 7.7 onwards has been brought to my attention

DaveCTurner · 2021-07-30T08:04:54Z

Closed by #50166.

jasontedor added >bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Nov 20, 2019

ywelsch added the help wanted adoptme label Nov 26, 2019

eedugon mentioned this issue Dec 4, 2019

Index in read-only mode not handled properly by Filebeat elastic/beats#13844

Closed

gaobinlong mentioned this issue Dec 13, 2019

Return 429 status code when there's a read_only cluster block #50166

Merged

henningandersen mentioned this issue Feb 22, 2020

Return 429 status code on read_only_allow_delete index block (#50166) #52672

Closed

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020

DaveCTurner closed this as completed Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch does not indicate retryability when flood stage is exceeded #49393

Elasticsearch does not indicate retryability when flood stage is exceeded #49393

jasontedor commented Nov 20, 2019 •

edited

elasticmachine commented Nov 20, 2019

gaobinlong commented Dec 2, 2019

jasontedor commented Dec 10, 2019

gaobinlong commented Dec 10, 2019

gaobinlong commented Dec 13, 2019

zez3 commented Mar 27, 2021 •

edited

DaveCTurner commented Jul 30, 2021

Elasticsearch does not indicate retryability when flood stage is exceeded #49393

Elasticsearch does not indicate retryability when flood stage is exceeded #49393

Comments

jasontedor commented Nov 20, 2019 • edited

elasticmachine commented Nov 20, 2019

gaobinlong commented Dec 2, 2019

jasontedor commented Dec 10, 2019

gaobinlong commented Dec 10, 2019

gaobinlong commented Dec 13, 2019

zez3 commented Mar 27, 2021 • edited

DaveCTurner commented Jul 30, 2021

jasontedor commented Nov 20, 2019 •

edited

zez3 commented Mar 27, 2021 •

edited