New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add safeguards to prevent simple user errors #11511

Closed
clintongormley opened this Issue Jun 5, 2015 · 24 comments

Comments

Projects
None yet
@clintongormley
Member

clintongormley commented Jun 5, 2015

There are a number of places where a naive user can break Elasticsearch very easily. We should add more (dynamically overridable) safeguards that prevent users from hurting themselves.

Note:

  • We are adding high limits to start so that we don't suddenly disable things that users already do today, but so that sysadmins have tools that they can use to protect their clusters. We can revisit the limits later on.
  • All these settings should be prefixed by policy. to make them easier to document together and to understand their purpose.

Accepted limits:

  • #9311 Hard limit on from/size
  • #12149 Global default value for search timeouts (Could be ridiculously high like an hour and it would still help)
  • #17386 Disable fielddata-loading on analyzed text fields by default (Adrien)
  • #17396 Limit the max number of shards to 1000 (Adrien)
  • #17133 Limit the size of all in-flight requests (Daniel)
  • #17357 Limit the number of fields that can be added to a mapping to 1000
  • #17400 Add maximum mapping depth to 20
  • Add sane limits for thread size and queue size (Jim)
  • #26423 Don't allow search requests greater than (eg) 10MB (Adrien)
  • #14983 Limit the number of nested fields per index to 50 (Yannick)
  • #17522 Limit window_size in rescore API (@nik9000)
  • #17558 Disable script access to _source fields by default
  • #18739 Limit the number of shards that can be rerouted at the same time
  • #26492 Hard limit on from/size in top hits and inner hits (much smaller than a normal query) (MVG)
  • #19694 Limit script compilation rate to avoid hard coding of params in scripts
  • #20705 Max number of shards per node (enforced as total shards per cluster)
  • #20760 Limit index creation rate
  • #23268 Add upper limit for scroll expiry time (Jim)
  • #26390 Add upper limit for the number of requested doc-values fields (Christoph)

For discussion:

  • #29050 Disable certain query types, eg wildcard, span etc?
  • #14046 Limit on the number of buckets returned by aggs
  • #9310 Limit the size of the response (eg for very large doc bodies)
  • Kill slow scripts when search timeout has lapsed aka while(true) should not require a rolling restart to recover from Don't run a script a second time when the first execution takes longer than 1 second
  • #6470 Disable searching on all indices by default Handled by max number of shards
  • Limit the number nested Lucene documents per document. #26962

Any other ideas?

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Jun 5, 2015

Contributor

Limit the max number of shards

I'm wondering if we should do it per index or per cluster. If we do it per index, then we might also want to have a max number of indices per cluster.

Limit the size of a bulk request

I guess it would also apply to multi-get and multi-search.

Contributor

jpountz commented Jun 5, 2015

Limit the max number of shards

I'm wondering if we should do it per index or per cluster. If we do it per index, then we might also want to have a max number of indices per cluster.

Limit the size of a bulk request

I guess it would also apply to multi-get and multi-search.

@alexbrasetvik

This comment has been minimized.

Show comment
Hide comment
@alexbrasetvik

alexbrasetvik Jun 5, 2015

Member

Some of this could go into a "sanity checker"-kind of plugin akin to the migration plugin that runs a bunch of tests as well.

That one could warn when e.g. minimum master nodes looks wrong, and when the number of shards/indexes/fields looks silly / approaches the above limits.

Member

alexbrasetvik commented Jun 5, 2015

Some of this could go into a "sanity checker"-kind of plugin akin to the migration plugin that runs a bunch of tests as well.

That one could warn when e.g. minimum master nodes looks wrong, and when the number of shards/indexes/fields looks silly / approaches the above limits.

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Jun 5, 2015

Member

@alexbrasetvik the requires the user to actually run the check. Often poor sysadmins are at the mercy of their users. What I'd like to do is to prevent users from blowing things up by mistake.

Member

clintongormley commented Jun 5, 2015

@alexbrasetvik the requires the user to actually run the check. Often poor sysadmins are at the mercy of their users. What I'd like to do is to prevent users from blowing things up by mistake.

@alexbrasetvik

This comment has been minimized.

Show comment
Hide comment
@alexbrasetvik

alexbrasetvik Jun 7, 2015

Member

@clintongormley Agreed! I still think there's room for both, though such a tool should be another issue.

For example, a high number of indexes with few documents and identical mappings can be a sign that the user is doing per-user index partitioning when he shouldn't. That will turn into a problem, even if the current values are far from hitting above mentioned limits.

Member

alexbrasetvik commented Jun 7, 2015

@clintongormley Agreed! I still think there's room for both, though such a tool should be another issue.

For example, a high number of indexes with few documents and identical mappings can be a sign that the user is doing per-user index partitioning when he shouldn't. That will turn into a problem, even if the current values are far from hitting above mentioned limits.

@pickypg

This comment has been minimized.

Show comment
Hide comment
@pickypg

pickypg Jul 15, 2015

Member

Any other ideas?

  • Limit the max number of indices
    • It's effectively covered by limiting by shards, but touching too many indices may indicate more of a logical issue than the shard count (e.g., with daily indices, it's much easier to realize that sending a request to 5 indices represents five days rather than 25 shards with default counts).
  • Limit the concurrent request size
    • Request circuit breaker across all concurrent requests
Member

pickypg commented Jul 15, 2015

Any other ideas?

  • Limit the max number of indices
    • It's effectively covered by limiting by shards, but touching too many indices may indicate more of a logical issue than the shard count (e.g., with daily indices, it's much easier to realize that sending a request to 5 indices represents five days rather than 25 shards with default counts).
  • Limit the concurrent request size
    • Request circuit breaker across all concurrent requests
@dakrone

This comment has been minimized.

Show comment
Hide comment
@dakrone

dakrone Jul 16, 2015

Member

Limit the concurrent request size

This is already available with the thread pools and queue_sizes to limit the number of requests per-node and apply backpressure.

EDIT: I guess I am taking "size" as "count", is that what you mean?

Member

dakrone commented Jul 16, 2015

Limit the concurrent request size

This is already available with the thread pools and queue_sizes to limit the number of requests per-node and apply backpressure.

EDIT: I guess I am taking "size" as "count", is that what you mean?

@pickypg

This comment has been minimized.

Show comment
Hide comment
@pickypg

pickypg Jul 17, 2015

Member

@dakrone Size of an actual request. For instance, if one request comes in with an aggregation that uses size: 0 at the same time as another, then maybe we should block the second one (or at least delay).

Member

pickypg commented Jul 17, 2015

@dakrone Size of an actual request. For instance, if one request comes in with an aggregation that uses size: 0 at the same time as another, then maybe we should block the second one (or at least delay).

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Oct 30, 2015

Contributor

Another protection to add: check mapping depth #14370

Contributor

jpountz commented Oct 30, 2015

Another protection to add: check mapping depth #14370

@ppf2

This comment has been minimized.

Show comment
Hide comment
@ppf2

ppf2 Nov 2, 2015

Member

Limit the max value that can be set for queue_size for our search, bulk, index, etc.. thread pools so users can't set them to unlimited, millions, etc..?

Member

ppf2 commented Nov 2, 2015

Limit the max value that can be set for queue_size for our search, bulk, index, etc.. thread pools so users can't set them to unlimited, millions, etc..?

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Mar 28, 2017

Contributor

@clintongormley I think we missed one rather important aspect when it comes to soft-limits. Today the user can override those limits via dynamic properties which is ok most of the time but in the case of a cloud hosting infrastructure where the org that runs the infrastructure needs to have full control over these limits they should be able to disable the dynamic property or should disable setting these settings entirely?

Contributor

s1monw commented Mar 28, 2017

@clintongormley I think we missed one rather important aspect when it comes to soft-limits. Today the user can override those limits via dynamic properties which is ok most of the time but in the case of a cloud hosting infrastructure where the org that runs the infrastructure needs to have full control over these limits they should be able to disable the dynamic property or should disable setting these settings entirely?

@clintongormley clintongormley added v6.0.0 and removed v6.0.0-alpha1 labels May 3, 2017

@javanna javanna removed the discuss label May 5, 2017

jimczi added a commit to jimczi/elasticsearch that referenced this issue Aug 30, 2017

Add upper limit for scroll expiry
This change adds a dynamic cluster setting named `search.max_keep_alive`.
It is used as an upper limit for scroll expiry time in scroll queries and defaults to 1 hour.
This change also ensures that the existing setting `search.default_keep_alive` is always smaller than `search.max_keep_alive`.

Relates #11511

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Sep 5, 2017

martijnvg added a commit that referenced this issue Sep 5, 2017

jimczi added a commit that referenced this issue Sep 6, 2017

Add upper limit for scroll expiry (#26448)
This change adds a dynamic cluster setting named `search.max_keep_alive`.
It is used as an upper limit for scroll expiry time in scroll queries and defaults to 1 hour.
This change also ensures that the existing setting `search.default_keep_alive` is always smaller than `search.max_keep_alive`.

Relates #11511

* check style

* add skip for bwc

* iter

* Add a maxium throttle wait time of 1h for reindex

* review

* remove empty line

jimczi added a commit that referenced this issue Sep 7, 2017

Add upper limit for scroll expiry (#26448)
This change adds a dynamic cluster setting named `search.max_keep_alive`.
It is used as an upper limit for scroll expiry time in scroll queries and defaults to 1 hour.
This change also ensures that the existing setting `search.default_keep_alive` is always smaller than `search.max_keep_alive`.

Relates #11511

* check style

* add skip for bwc

* iter

* Add a maxium throttle wait time of 1h for reindex

* review

* remove empty line

martijnvg added a commit that referenced this issue Sep 21, 2017

@lcawl lcawl added v6.0.1 and removed v6.0.0 labels Nov 13, 2017

@lcawl lcawl added v6.0.2 and removed v6.0.1 labels Dec 6, 2017

@jaymode jaymode added v6.0.3 and removed v6.0.2 labels Dec 13, 2017

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Mar 14, 2018

Contributor

Most of the work has been done, and items that have not been done have an assigned issue so I'll close this issue. Thanks everyone!

Contributor

jpountz commented Mar 14, 2018

Most of the work has been done, and items that have not been done have an assigned issue so I'll close this issue. Thanks everyone!

@jpountz jpountz closed this Mar 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment