Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run away search thread #21611

Closed
xgwu opened this issue Nov 17, 2016 · 18 comments
Closed

Run away search thread #21611

xgwu opened this issue Nov 17, 2016 · 18 comments
Labels
:Core/Infra/Core Core issues without another label discuss

Comments

@xgwu
Copy link

xgwu commented Nov 17, 2016

Elasticsearch version:
5.0.0
Plugins installed: []
None
JVM version:
1.8.0_77
OS version:
CentOS release 6.4 (Final)
Linux 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
On some data nodes with expensive query, the search threads could be active for hours even with search timeout set to 35 seconds. During that period, disk IO peaks in reading and search thread pilled up in queue with hard time getting into being served by active threads.
Steps to reproduce:
This happens rather frequently on a daily basis when there are some expensive search loads hitting the cluster. Back in v2.4.0 time , this never happened.

Provide logs (if relevant):
During the occurence of the issue , I observed from our monitoring system the number of active search thread peaks to max limit even though the open search context dropped to a few , and search completed is hardly stepping forward. Tasks management api shows active query running for a very long time.
image
image
image
image

Compared by similar stats weeks ago before upgrading to 5.0.0, the number of search active threads anytime was almost in line with that of search open context.
image
image
image

It looks to me search threads could be not freed after the search context had been terminated.

@jpountz
Copy link
Contributor

jpountz commented Nov 17, 2016

This looks bad. When this happens, does the cluster eventually heal, or do you need to restart it to get back to normal?

Given that we do not seem to leak search contexts (the number remains reasonable), it looks as if when reaching some threshold in terms of search load, then things get worse and worse exponentially. I'd be very curious what the search threads are busy doing when that problem occurs, would you be able to get the node hot threads while your queue is filling up?

In #21612 you suspect the fact that we use the mmap store to be related to this, we could easily check since you mention the problem occurs every day by temporarily setting index.store.type=niofs in the index settings. Note that this setting requires the index to be reopen in order to take effect.

@xgwu
Copy link
Author

xgwu commented Nov 17, 2016

@jpountz When this happens, I am not able to tell if it could heal by itself as this is a production cluster and there are constant queries from the users. I had to restart the nodes in problem in trying to bring service back to normal ASAP.

I also tried to get hot threads when this happens but the api call seems not returning anything. Some other api call such as node stats takes very long time to respond. I'll try get hot threads again when it reoccur.

I'll try index.store.type=niofs index setting and see if this is related.

@jpountz
Copy link
Contributor

jpountz commented Nov 17, 2016

The fact the stats APIs take a long time to return is concerning too, these APIs are supposed to be cheap. Out of curiosity, is garbage collection activity still normal when your cluster is having trouble?

@xgwu
Copy link
Author

xgwu commented Nov 17, 2016

Heap usage remains constant during that period. There is no old GC and even no young GC for some time. Below are the monitoring stats(I restarted the node at around 11:00):

image
image
image
image

@nik9000
Copy link
Member

nik9000 commented Nov 17, 2016

Can you use jstat to get a snapshot of the running threads both when it is ok and when it is behavior badly? That will work regardless of the APIs.

@xgwu
Copy link
Author

xgwu commented Nov 17, 2016

@nik9000 I've already updated indices to using niofs and tried some very expensive queries with high concurrency and the result seems positive.

image
image
image
image
image

The performance looks better but I'll wait to see if this solves the issue tomorrow during business hours.

If the issue happens again, I'll capture a snapshot of the running threads with jstat as you suggested.

@nik9000
Copy link
Member

nik9000 commented Nov 17, 2016

If the issue happens again, I'll capture a snapshot of the running threads with jstat as you suggested.

It'd probably be useful to have a before/after as well. I'd take a snapshot when you next have a chance just to have a picture of it. Also so you are practiced at whatever hoops you have to go through to do it. I know when things are going sideways I don't want to try new things.

@xgwu
Copy link
Author

xgwu commented Nov 17, 2016

I'll closely make minitor the cluster status today during business peak time to see if the change makes big difference.

If the issue doesn't re-occure, i'll arrange some time to revert back index store back to mmap, stress test to trigger the problem so to capture threads info before/after.

? 2016?11?18??01:08?Nik Everett <notifications@github.commailto:notifications@github.com> ???

If the issue happens again, I'll capture a snapshot of the running threads with jstat as you suggested.

It'd probably be useful to have a before/after as well. I'd take a snapshot when you next have a chance just to have a picture of it. Also so you are practiced at whatever hoops you have to go through to do it. I know when things are going sideways I don't want to try new things.

You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com//issues/21611#issuecomment-261305844, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKBgUMwW5ls2WpSqctJfuD4Yu8hd434Xks5q_ImbgaJpZM4K02l3.

@clintongormley clintongormley added :Core/Infra/Core Core issues without another label discuss feedback_needed labels Nov 19, 2016
@xgwu
Copy link
Author

xgwu commented Nov 20, 2016

Since I changed index store to niofs last Thursday night, the cluster survived the Friday business peak.

I managed to find some time today (Sunday) to carry on some tests.

First I changed index store of one type of big index back to default, ie, mmapfs, and send some queries to 3 days of indices using Kibana. All the queries are with timeout of 35 seconds set. Not long after I send these queries, a few nodes appeared struggling in searching again. It took quite a long time for active threads to drop down. I then changed the store to niofs and did more expensive queries to compare the threads stats. Below screenshot shows the result (stats from all nodes were stacked).
image

The first bunch of queries were sent at 13:00 and it took a long time for active threads to drop down. Several search threads not terminated even after an hour, so I have to restart some nodes . The second bunches were sent at about 14:40 and with more days of indices included in the request. This time search active peaked for a short time and search completed are much higher than previous test.

For the first test, i captured more details on a particular node in problem. Here's what its thread pool stats looked like:
image

Initially, open context stayed the same number as that of active threads. When I reverted index settings back by closing it at 14:08, I noticed open context dropped to 0 but active threads remains at 2 (The time difference may not be that obvious on screenshot I attached). I am not sure if it will eventually drop to 0 , but I am impatient to wait and so restarted the node at 14:18 after when the last 2 active thread is gone.

image

Followed @nik9000 's advice, I captured threads dump with jstack at different time point (Suffixed with HH:MM) on node in problem. Please refer to below attached file.
threads_dump.zip

At 14:13, one search thread looks to be building global ordinals.
"elasticsearch[oy_data_weak_1_10.15.118.11_B][search][T#4]" #135 daemon prio=5 os_prio=0 tid=0x00007f1e1c093000 nid=0x34cf runnable [0x00007f1e45be8000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.index.MultiDocValues$OrdinalMap.<init>(MultiDocValues.java:531) at org.apache.lucene.index.MultiDocValues$OrdinalMap.build(MultiDocValues.java:484) at org.apache.lucene.index.MultiDocValues$OrdinalMap.build(MultiDocValues.java:463) at org.elasticsearch.index.fielddata.ordinals.GlobalOrdinalsBuilder.build(GlobalOrdinalsBuilder.java:61) at org.elasticsearch.index.fielddata.plain.SortedSetDVOrdinalsIndexFieldData.localGlobalDirect(SortedSetDVOrdinalsIndexFieldData.java:103) at org.elasticsearch.index.fielddata.plain.SortedSetDVOrdinalsIndexFieldData.localGlobalDirect(SortedSetDVOrdinalsIndexFieldData.java:38) at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.lambda$load$1(IndicesFieldDataCache.java:159) at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache$$Lambda$1280/790193266.load(Unknown Source) at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:385) at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.load(IndicesFieldDataCache.java:154) at org.elasticsearch.index.fielddata.plain.SortedSetDVOrdinalsIndexFieldData.loadGlobal(SortedSetDVOrdinalsIndexFieldData.java:91) at org.elasticsearch.search.aggregations.support.ValuesSource$Bytes$WithOrdinals$FieldData.globalOrdinalsValues(ValuesSource.java:150) at org.elasticsearch.search.aggregations.support.ValuesSource$Bytes$WithOrdinals.globalMaxOrd(ValuesSource.java:123) at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory.doCreateInternal(TermsAggregatorFactory.java:120) at org.elasticsearch.search.aggregations.support.ValuesSourceAggregatorFactory.createInternal(ValuesSourceAggregatorFactory.java:55) at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:225) at org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:102) at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:61) at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:104) at org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$17(IndicesService.java:1109) at org.elasticsearch.indices.IndicesService$$Lambda$1410/1877142574.load(Unknown Source) at org.elasticsearch.indices.AbstractIndexShardCacheEntity.loadValue(AbstractIndexShardCacheEntity.java:73) at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:148) at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:133) at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:385) at org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:116) at org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1113) at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:236) at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:251) at org.elasticsearch.action.search.SearchTransportService.lambda$registerRequestHandler$6(SearchTransportService.java:276) at org.elasticsearch.action.search.SearchTransportService$$Lambda$1029/773765215.messageReceived(Unknown Source) at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1348) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:504) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

@jpountz
Copy link
Contributor

jpountz commented Nov 22, 2016

@xgwu Can you clarify which time range are using the default and which time range is using niofs?

@jpountz
Copy link
Contributor

jpountz commented Nov 22, 2016

Also how many nodes, data (in GB) and shards do you have in total? Also, how much RAM do you give to the JVM and to the filesystem cache on each node?

@xgwu
Copy link
Author

xgwu commented Nov 22, 2016

It was mmapfs between 13:00 to 14:15 and niofs since 14:30.

We have 120 data nodes, 17000 shards,800billion docs , 900TB data in total including one replica.

60 out of the 120 nodes are backup nodes without write. They are hosted by physical servers of 32 core, 128GB RAM & 12 4TB spin disks with raid0. Each physical host contains 3 es instances with 31GB heap each, ie,35GB left for filesystem cache.

Other nodes are mainly for write and only run 1 instance on each host. They only hold

@xgwu xgwu closed this as completed Nov 22, 2016
@xgwu
Copy link
Author

xgwu commented Nov 22, 2016

Sorry closed the issue by making mistake. Reopening...

@xgwu xgwu reopened this Nov 22, 2016
@xgwu
Copy link
Author

xgwu commented Nov 22, 2016

Write nodes only hold one day worth data. The problem is likely to happen only on some read only backup nodes.

@jpountz
Copy link
Contributor

jpountz commented Nov 22, 2016

To clarify: when you say node, you mean elatsticsearch process rather than physical machine right?

The most immediate issue here is the sizing of the filesystem cache. Out of 128GB per physical server, 93GB go to the JVM and only 35GB go to the filesystem cache. There should be at least 64GB for the filesystem, probably even more. Does your JVM make use of all the available memory? If yes, do you know why? For instance are you using the completion suggester? You might want to give less memory to your JVMs and/or start fewer nodes per physical machine in order to give more memory to the filesystem cache. Note that decreasing the amount of memory per JVM also has the side benefit of making garbage collection more efficient since smaller heaps can be collected more quickly.

The jstack traces show that an _optimize call is running. This is not a bad thing since you seem to have time-based data, but this might also explain why building global ordinals is so slow. Are all your readonly indices expected to be merged to a single segment (optimized)? Since you seem to be running terms aggregations, this actually helps since global ordinals come for free once a shard has been merged to a single segment.

Since your nodes spend a lot of time building global ordinals, you might also want to look into increasing the refresh_interval (eg maybe 10s or 30s would be fine for your use case?) as well as building global ordinals on refresh rather than in search requests. In order to do that, just set eager_global_ordinals: true on the fields that you use in terms aggregations.

@xgwu
Copy link
Author

xgwu commented Nov 23, 2016

@jpountz Yes, the node is meant an elasticsearch process.

One more point to clarify that this issue also happened on a write node that have only one elasticsearch process with 31GB JVM on a physical machine. In that case, 97GB ram is available for filesystem cache. But it only happened once compared to several times a day on those cold data nodes.

This cluster is for application/business logs search and analysis. The ops/business require a fairly long history data, for example last 3 months, be immediately available for search whenever they need it. To keep so huge cold indices open in elasticsearch, it would require a lot of space in JVM for segment memory. In 2.4 era, I found it a good balance to put 3 instances on 1 physical box for this kind of use case, as it allows us to hold about 30TB indices open on a single physical machine and the search performance is acceptable even only leaving 35GB memory to filesystem cache.

And you are right, that we force merge each shard to 1 segment every night on these cold data nodes to reduce segment memory footprint. The refresh_interval is already 60s globally on our cluster and worked well for a long time. I am not sure whether setting eager_global_ordinals: true would alleviate this particular problem as these are cold data never been written anymore.

We don't use completion suggester on this cluster. The mostly used features are queries, filters plus terms, data histogram and percentiles aggregation. For some terms aggregation, the cardinality could be very high, such as client ip field or url stem field.

Per your comment, it feels like when reading every part of an indices via mmapfs, the available filesystem cache becomes very critical. When the search work load crosses a certain threshold, the performance degraded very quickly. One other thing worrying me is that why search timeout takes no effect in this situation? Every query was sent with 35 seconds timeout but a search thread could last for an hour without terminating itself. That eventually saturated all search threads making the node not responding.

For now I'll stick to niofs as it looked performant enough in our situation and I am not seeing long running search threads anymore.

@jpountz
Copy link
Contributor

jpountz commented Nov 23, 2016

I am not sure whether setting eager_global_ordinals: true would alleviate this particular problem as these are cold data never been written anymore.

The reason that made me think about it is that the jstack traces show that some search threads are building global ordinals, which would never happen if global ordinals were built at refresh time.

One other thing worrying me is that why search timeout takes no effect in this situation?

The way that the timeout is implemented, Elasticsearch has some checkpoints during the execution of a search request when it checks the timeout. Unfortunately, some parts of query execution do not check the timeout while they are running, such as when rewriting multi-term queries (eg. ranges, wildcard or fuzzy queries) or building global ordinals.

For now I'll stick to niofs

This sounds like a good idea indeed since niofs seems to perform better in your case.

@jpountz
Copy link
Contributor

jpountz commented Nov 28, 2016

Since it looks like using niofs fixes your issue, I'll close.

@jpountz jpountz closed this as completed Nov 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label discuss
Projects
None yet
Development

No branches or pull requests

4 participants