CompletionStats need only be recomputed on a refresh #51915

DaveCTurner · 2020-02-05T09:24:01Z

Computing the completion stats involves walking every field of every segment of every relevant shard, looking for completion fields. By default the seemingly-innocuous GET _stats API does this for every shard in the cluster. I've seen more than a few cases where an external monitoring system is hitting an overly-broad stats API hard enough that the cluster can't keep up. The consequence is that these stats requests pile up in the management threadpool and interfere with the other users of that threadpool.

As far as I can tell, these stats only change on a refresh. In most cases this means they do not change much at all, so I think we can improve the situation by caching these stats between refreshes.

I also note that in #33847 we changed the source of these stats from the external searcher to the internal one. I'm not sure why - external seems more appropriate to me, and would help with the caching since external refreshes may be very infrequent indeed.

Relates:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-02-05T09:24:04Z

Pinging @elastic/es-distributed (:Distributed/Engine)

rtkgjacobs · 2020-02-05T15:01:27Z

There are number of external projects (just listing a few we've found with our audits so far) that hit this pinch point by default as an FYI

https://github.com/vvanholl/elasticsearch-prometheus-exporter
- This one uses elastics own library https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/action/admin/indices/stats/IndexStats.java (its not clear to me if said lib will default to being 'nice' and not go after suggestion stats? (we avoided its use to avoid having to install, version match plugins per node)
https://github.com/justwatchcom/elasticsearch_exporter
- Audit shows when a users uses es.indices, it hits the /_all/_stats/ API - impact high
https://github.com/lmenezes/cerebro
- Audit shows this is only a minor hit when a user clicks for details on a specific index, otherwise its not polling the greed /stats endpoint (only asks for docs, store) - impact low
- https://github.com/lmenezes/cerebro/blob/f2ed031fd89258c6ad85d40d846f5ff1f610fe73/app/elastic/HTTPElasticClient.scala

Having the ability to have a default _stats call be far less abusive / greedy r/e above would be ideal. Just tossing in my vote (otherwise for our own deployment, we may be forced to fork various assets and maintain private builds now)

Im sure other tools / fixtures out there can pinch others as it has our own ramping of a large scale set of clusters.

dnhatn · 2020-02-05T17:33:49Z

+1 to cache the completion stats (especially when we have many fields). Another optimization is to not compute the completion stats if we don't have any suggest field in the mapping.

rtkgjacobs · 2020-02-05T17:37:56Z

+1 to cache the completion stats (especially when we have many fields). Another optimization is to not compute the completion stats if we don't have any suggest field in the mapping.
We have turned off all things hitting variants of /stats API's and are still finding high management threading / queue backlogs. We confirmed our mappings do not call out or enable suggested fields.

Is ES7 set to always generate completion stats during ingestion ? Can we turn this off? It is possibly a blocker for us to upgrade to ES7 from prior (old) versions in production for us if so... We are finding with our latest tests removing the above fixtures, we are still seeing it stand out in our hot threads. Does x-pack monitoring hit or stimulate the gathering of completion stats?

DaveCTurner · 2020-02-05T17:57:44Z

Is ES7 set to always generate completion stats during ingestion ?

It's not really anything to do with ingestion, nor is it specific to version 7. By default the completion stats are computed for each stats call and this seems to be true as far back as version 1.7 (I haven't checked further back).

Can we turn this off?

Indeed you can; if you don't need to monitor completion stats so frequently then you should exclude them from these stats requests. I can't comment on how you might configure the third-party products linked above to do this, sorry, but Elasticsearch has had support for selecting specific stats for a long long time.

rtkkroland · 2020-02-05T19:00:07Z

but Elasticsearch has had support for selecting specific stats for a long long time.

Is this true for /_cluster/stats as well? It doesn't mention it in the Guide https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-stats.html and it returns the full set of completion stats.

DaveCTurner · 2020-02-05T20:14:24Z

@rtkkroland oh dear, yes, I didn't realise we also compute completion stats for the cluster-level API but you're right I think we do:

elasticsearch/server/src/main/java/org/elasticsearch/action/admin/cluster/stats/TransportClusterStatsAction.java

Lines 57 to 59 in 84dbadb

    
           private static final CommonStatsFlags SHARD_STATS_FLAGS = new CommonStatsFlags(CommonStatsFlags.Flag.Docs, CommonStatsFlags.Flag.Store, 
        
               CommonStatsFlags.Flag.FieldData, CommonStatsFlags.Flag.QueryCache, 
        
               CommonStatsFlags.Flag.Completion, CommonStatsFlags.Flag.Segments);

It doesn't look like there are selectors for this API 😕

Computing the stats for completion fields may involve a significant amount of work since it walks every field of every segment looking for completion fields. Innocuous-looking APIs like `GET _stats` or `GET _cluster/stats` do this for every shard in the cluster. This repeated work is unnecessary since these stats do not change between refreshes; in many indices they remain constant for a long time. This commit introduces a cache for these stats which is invalidated on a refresh, allowing most stats calls to bypass the work needed to compute them on most shards. Closes elastic#51915

rtkgjacobs · 2020-02-06T13:16:39Z

Some additional follow up / findings on our side. We had Netdata installed on all our cluster nodes and its default configuration hits _cluster/stats - and its default refresh rate seems too frequent. Removing that vastly improved our performance (several orders of magnitude). We have slightly raised the maximum threads of type:management to match our core counts to offset what 30s prometheus scrapes we run. We'll be watching closely balancing these API's use, but passing on a warning for others that might fall into our trap.

Strongly encourage the cost or risks of API's like _cluster/stats and _all/stats etc are derisked where possible. A few of us on this project have more grey hairs now.

Computing the stats for completion fields may involve a significant amount of work since it walks every field of every segment looking for completion fields. Innocuous-looking APIs like `GET _stats` or `GET _cluster/stats` do this for every shard in the cluster. This repeated work is unnecessary since these stats do not change between refreshes; in many indices they remain constant for a long time. This commit introduces a cache for these stats which is invalidated on a refresh, allowing most stats calls to bypass the work needed to compute them on most shards. Closes #51915

Computing the stats for completion fields may involve a significant amount of work since it walks every field of every segment looking for completion fields. Innocuous-looking APIs like `GET _stats` or `GET _cluster/stats` do this for every shard in the cluster. This repeated work is unnecessary since these stats do not change between refreshes; in many indices they remain constant for a long time. This commit introduces a cache for these stats which is invalidated on a refresh, allowing most stats calls to bypass the work needed to compute them on most shards. Closes elastic#51915 Backport of elastic#51991

Computing the stats for completion fields may involve a significant amount of work since it walks every field of every segment looking for completion fields. Innocuous-looking APIs like `GET _stats` or `GET _cluster/stats` do this for every shard in the cluster. This repeated work is unnecessary since these stats do not change between refreshes; in many indices they remain constant for a long time. This commit introduces a cache for these stats which is invalidated on a refresh, allowing most stats calls to bypass the work needed to compute them on most shards. Closes #51915 Backport of #51991

DaveCTurner added >enhancement :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. team-discuss labels Feb 5, 2020

This was referenced Feb 6, 2020

Cache completion stats between refreshes #51991

Merged

Push back on excessive requests for stats #51992

Open

DaveCTurner removed the team-discuss label Feb 12, 2020

DaveCTurner closed this as completed in #51991 Feb 27, 2020

DaveCTurner mentioned this issue Feb 27, 2020

Cache completion stats between refreshes #52872

Merged

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CompletionStats need only be recomputed on a refresh #51915

CompletionStats need only be recomputed on a refresh #51915

DaveCTurner commented Feb 5, 2020

elasticmachine commented Feb 5, 2020

rtkgjacobs commented Feb 5, 2020 •

edited

Loading

dnhatn commented Feb 5, 2020

rtkgjacobs commented Feb 5, 2020 •

edited

Loading

DaveCTurner commented Feb 5, 2020 •

edited

Loading

rtkkroland commented Feb 5, 2020 •

edited

Loading

DaveCTurner commented Feb 5, 2020

rtkgjacobs commented Feb 6, 2020

CompletionStats need only be recomputed on a refresh #51915

CompletionStats need only be recomputed on a refresh #51915

Comments

DaveCTurner commented Feb 5, 2020

elasticmachine commented Feb 5, 2020

rtkgjacobs commented Feb 5, 2020 • edited Loading

dnhatn commented Feb 5, 2020

rtkgjacobs commented Feb 5, 2020 • edited Loading

DaveCTurner commented Feb 5, 2020 • edited Loading

rtkkroland commented Feb 5, 2020 • edited Loading

DaveCTurner commented Feb 5, 2020

rtkgjacobs commented Feb 6, 2020

rtkgjacobs commented Feb 5, 2020 •

edited

Loading

rtkgjacobs commented Feb 5, 2020 •

edited

Loading

DaveCTurner commented Feb 5, 2020 •

edited

Loading

rtkkroland commented Feb 5, 2020 •

edited

Loading