search.default_search_timeout NOT working #21595

xgwu · 2016-11-16T13:56:17Z

Elasticsearch version:
5.0.0
Plugins installed: []
NONE
JVM version:
1.8.0_77
OS version:
CentOS release 6.4 (Final)
Linux 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Since upgrading from v2.4.0 to v5.0.0, the global search timeout has no effect any more.
Steps to reproduce:

Setting Global search timeout to a low value, for example 1s
PUT /_cluster/settings
{
"transient": {
"search.default_search_timeout": "1s"
}
}
Execute an expensive Query which takes far longer time than 1s
The Query is not timed out as expected and "timed_out" value returned within response body shows as "false".
If rerun the query with query level timeout set within query body, the query is timed out as expected.

Provide logs (if relevant):

Describe the feature:

jasontedor · 2016-11-16T14:53:46Z

Thanks for reporting, I think I see the issue.

xgwu · 2016-11-16T15:18:29Z

@jasontedor thanks for quick responding!

Besides global search timeout not working in v5.0.0, the query body timeout seems not functioning as good as in v2.4.0 under heavy search load.

We have a large cluster, with 120 data nodes running on 70 physical servers for Trillions of logs search& analysis. Search timeout in v2.4.0 protected the cluster pretty well by terminating expensive queries from inept users.

This been not the case since upgraded to v5.0.0 last weekend, some data nodes became wearing out its disk IO during peak usage time on a daily basis and we have to restart some nodes quite frequently.

When a node is wearing out , iostats shows consistent peak disk read and I observed from tasks management api that some queries seems running endless which however should have been timed out by the node. Even after we stopped sending queries to the node, it seemsed still trying very hard in executing its active search thread and search completed counter had no sign to increase. We also tried other measures such as tuning down search thread pool and queue length in hoping to reduce search concurrency and reject queries early during peak time but without any luck.

Below is one example output from tasks API when the node "stuck" in searching.

{

"nodes": {
"GS-XNaqeTRq3-hAJepD4RA": {
"name": "oy_data_weak_2_10.15.118.23_A",
"transport_address": "10.15.118.23:9301",
"host": "10.15.118.23",
"ip": "10.15.118.23:9301",
"roles": [
"data",
"ingest"
],
"attributes": {
"boxtype": "weak",
"instance": "A",
"group": "2"
},
"tasks": {
"GS-XNaqeTRq3-hAJepD4RA:31750": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 31750,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479301391951,
"running_time_in_nanos": 1434167740809,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:30673": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 30673,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479301064457,
"running_time_in_nanos": 1761662192887,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:29467": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 29467,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479300706805,
"running_time_in_nanos": 2119314046172,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:33197": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 33197,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479301815548,
"running_time_in_nanos": 1010571268993,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:33198": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 33198,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479301815548,
"running_time_in_nanos": 1010570972118,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:27695": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 27695,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479300190778,
"running_time_in_nanos": 2635341006636,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:27697": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 27697,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479300190778,
"running_time_in_nanos": 2635340988791,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:34611": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 34611,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479302089801,
"running_time_in_nanos": 736317917130,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:27703": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 27703,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479300190779,
"running_time_in_nanos": 2635340096275,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:27704": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 27704,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479300190779,
"running_time_in_nanos": 2635339708378,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:37049": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 37049,
"type": "netty",
"action": "indices:data/write/bulk",
"start_time_in_millis": 1479302739424,
"running_time_in_nanos": 86694868464,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:37051": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 37051,
"type": "transport",
"action": "indices:data/write/bulk[s]",
"start_time_in_millis": 1479302739426,
"running_time_in_nanos": 86692652830,
"cancellable": false,
"parent_task_id": "GS-XNaqeTRq3-hAJepD4RA:37049"
},
"GS-XNaqeTRq3-hAJepD4RA:37053": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 37053,
"type": "transport",
"action": "indices:data/write/bulk[s]",
"start_time_in_millis": 1479302739430,
"running_time_in_nanos": 86689387697,
"cancellable": false,
"parent_task_id": "GS-XNaqeTRq3-hAJepD4RA:37049"
}
}
}
}
}

jasontedor · 2016-11-16T15:36:17Z

Okay, I've confirmed what I see; I'll open a PR to fix soon.

jasontedor · 2016-11-16T16:31:57Z

I opened #21599.

jasontedor · 2016-11-16T17:58:29Z

This will be fixed in 5.0.2.

jasontedor mentioned this issue Nov 16, 2016

Respect default search timeout #21599

Merged

jasontedor closed this as completed in #21599 Nov 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

search.default_search_timeout NOT working #21595

search.default_search_timeout NOT working #21595

xgwu commented Nov 16, 2016

jasontedor commented Nov 16, 2016

xgwu commented Nov 16, 2016

jasontedor commented Nov 16, 2016

jasontedor commented Nov 16, 2016

jasontedor commented Nov 16, 2016

search.default_search_timeout NOT working #21595

search.default_search_timeout NOT working #21595

Comments

xgwu commented Nov 16, 2016

jasontedor commented Nov 16, 2016

xgwu commented Nov 16, 2016

jasontedor commented Nov 16, 2016

jasontedor commented Nov 16, 2016

jasontedor commented Nov 16, 2016