Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search.default_search_timeout NOT working #21595

Closed
xgwu opened this issue Nov 16, 2016 · 5 comments · Fixed by #21599
Closed

search.default_search_timeout NOT working #21595

xgwu opened this issue Nov 16, 2016 · 5 comments · Fixed by #21599

Comments

@xgwu
Copy link

xgwu commented Nov 16, 2016

Elasticsearch version:
5.0.0
Plugins installed: []
NONE
JVM version:
1.8.0_77
OS version:
CentOS release 6.4 (Final)
Linux 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Since upgrading from v2.4.0 to v5.0.0, the global search timeout has no effect any more.
Steps to reproduce:

  1. Setting Global search timeout to a low value, for example 1s
    PUT /_cluster/settings
    {
    "transient": {
    "search.default_search_timeout": "1s"
    }
    }
  2. Execute an expensive Query which takes far longer time than 1s
  3. The Query is not timed out as expected and "timed_out" value returned within response body shows as "false".
  4. If rerun the query with query level timeout set within query body, the query is timed out as expected.

Provide logs (if relevant):

Describe the feature:

@jasontedor
Copy link
Member

Thanks for reporting, I think I see the issue.

@xgwu
Copy link
Author

xgwu commented Nov 16, 2016

@jasontedor thanks for quick responding!

Besides global search timeout not working in v5.0.0, the query body timeout seems not functioning as good as in v2.4.0 under heavy search load.

We have a large cluster, with 120 data nodes running on 70 physical servers for Trillions of logs search& analysis. Search timeout in v2.4.0 protected the cluster pretty well by terminating expensive queries from inept users.

This been not the case since upgraded to v5.0.0 last weekend, some data nodes became wearing out its disk IO during peak usage time on a daily basis and we have to restart some nodes quite frequently.

When a node is wearing out , iostats shows consistent peak disk read and I observed from tasks management api that some queries seems running endless which however should have been timed out by the node. Even after we stopped sending queries to the node, it seemsed still trying very hard in executing its active search thread and search completed counter had no sign to increase. We also tried other measures such as tuning down search thread pool and queue length in hoping to reduce search concurrency and reject queries early during peak time but without any luck.

Below is one example output from tasks API when the node "stuck" in searching.

{

"nodes": {
"GS-XNaqeTRq3-hAJepD4RA": {
"name": "oy_data_weak_2_10.15.118.23_A",
"transport_address": "10.15.118.23:9301",
"host": "10.15.118.23",
"ip": "10.15.118.23:9301",
"roles": [
"data",
"ingest"
],
"attributes": {
"boxtype": "weak",
"instance": "A",
"group": "2"
},
"tasks": {
"GS-XNaqeTRq3-hAJepD4RA:31750": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 31750,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479301391951,
"running_time_in_nanos": 1434167740809,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:30673": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 30673,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479301064457,
"running_time_in_nanos": 1761662192887,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:29467": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 29467,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479300706805,
"running_time_in_nanos": 2119314046172,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:33197": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 33197,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479301815548,
"running_time_in_nanos": 1010571268993,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:33198": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 33198,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479301815548,
"running_time_in_nanos": 1010570972118,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:27695": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 27695,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479300190778,
"running_time_in_nanos": 2635341006636,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:27697": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 27697,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479300190778,
"running_time_in_nanos": 2635340988791,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:34611": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 34611,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479302089801,
"running_time_in_nanos": 736317917130,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:27703": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 27703,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479300190779,
"running_time_in_nanos": 2635340096275,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:27704": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 27704,
"type": "netty",
"action": "indices:data/read/search[phase/query]",
"start_time_in_millis": 1479300190779,
"running_time_in_nanos": 2635339708378,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:37049": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 37049,
"type": "netty",
"action": "indices:data/write/bulk",
"start_time_in_millis": 1479302739424,
"running_time_in_nanos": 86694868464,
"cancellable": false
},
"GS-XNaqeTRq3-hAJepD4RA:37051": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 37051,
"type": "transport",
"action": "indices:data/write/bulk[s]",
"start_time_in_millis": 1479302739426,
"running_time_in_nanos": 86692652830,
"cancellable": false,
"parent_task_id": "GS-XNaqeTRq3-hAJepD4RA:37049"
},
"GS-XNaqeTRq3-hAJepD4RA:37053": {
"node": "GS-XNaqeTRq3-hAJepD4RA",
"id": 37053,
"type": "transport",
"action": "indices:data/write/bulk[s]",
"start_time_in_millis": 1479302739430,
"running_time_in_nanos": 86689387697,
"cancellable": false,
"parent_task_id": "GS-XNaqeTRq3-hAJepD4RA:37049"
}
}
}
}
}

@jasontedor
Copy link
Member

Okay, I've confirmed what I see; I'll open a PR to fix soon.

@jasontedor
Copy link
Member

I opened #21599.

@jasontedor
Copy link
Member

This will be fixed in 5.0.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants