A new generic timeout handler and changes to existing search classes to use it #4586

markharwood · 2014-01-02T14:27:41Z

A more effective approach to time-limiting activities such as search requests. Special runtime exceptions can now short-cut the execution of long-running calls to Lucene classes and are caught and reported back, not as a fatal error but using the existing “timedOut” flag in results.

Phases like the FetchPhase can now exit early and so also have a timed-out status. The SearchPhaseController does its best to assemble whatever hits, aggregations and facets have been produced within the provided time limits rather than returning nothing and throwing an error.

ActivityTimeMonitor is the new central class for efficiently monitoring all forms of thread overrun in a JVM.
The SearchContext setup is modified to register the start and end of query tasks with ActivityTimeMonitor.
Store.java is modified to add timeout checks (via calls to ATM) in the low-level file access routines by using a delegating wrapper for Lucene's IndexInput and IndexInputSlicer.
ContextIndexSearcher is modified to catch and unwrap ActivityTimedOutExceptions that can now come out of the Lucene calls and report them as timeouts along with any partial results.
FetchPhase is similarly modified to deal with the possibility of timeout errors.

nik9000 · 2014-01-02T14:35:18Z

src/main/java/org/elasticsearch/common/util/concurrent/ActivityTimeMonitor.java

+        synchronized (timeLimitedThreads) {
+            // find out which is the next candidate for failure
+
+            // TODO Not the fastest conceivable data structure to achieve this.


Why not one of the priority queue style data structures?

Happy to consider any concrete recommendations.
I need 1) fast random lookups on an object key (Thread) 2) weak references on keys 3) sorted iteration over values (timestamps)

I think if we use message passing here and make all writes private to the timeout thread we can just use multiple datastructures and priority queues. I really think that there should only be a single thread modifying all datatstructures here and that thread pulls commands from a queue also in a single threaded fashion sending results back via callbacks - this goes together with my suggestion above related to ThreadTimeoutState

nik9000 · 2014-01-07T16:24:44Z

So I just found some weird runaway highlighting process that I found a request that took 57 minutes. So +1 for this.

s1monw · 2014-01-07T19:20:51Z

@nik9000 I don't think we can prevent this unless we fix the highlighter ;/

nik9000 · 2014-01-07T20:40:20Z

Even though we've got a fix for the highlighter I still think it'd be worth checking the timeout during the highlight process in case other bugs like this come up. No reason it has to be part of this pull request though.

clintongormley · 2014-07-18T11:11:49Z

@markharwood what's the status on this one?

markharwood · 2014-07-18T11:14:38Z

Need to gather confidence about impl performance via benchmarking - following that we should consider where to apply these timeout checks across the platform

avleen · 2014-08-05T00:00:21Z

Big +1 for this!

rmuir · 2014-08-26T17:05:42Z

Can the benchmark include docvalues access as well? Because the change wraps, but doesnt override randomAccessSlice, I'm afraid it would effectively undo many performance improvements from lucene 4.9

rmuir · 2014-08-26T17:10:57Z

By wrapping all indexinputs we also lose a good deal of NIOFS optimizations (e.g. reducing bounds checks for readlong/readvint and so on).

markharwood · 2014-08-27T16:47:42Z

@rmuir Thanks for taking a look. I wrapped the RandomAccessInputs and was re-running benchmarks with doc values. I got side-tracked though because the benchmarks revealed a stack overflow issue. It took a while to track down but the misbehaving query is this one with multiple scripts: https://gist.github.com/markharwood/0747e741b6fed9bbb32b#file-stack-trace
I'll submit an update on this PR tomorrow.

markharwood · 2014-08-28T11:23:49Z

Updated benchmarks using an index with Doc Values for one of the fields used in aggregations.

clintongormley · 2014-09-25T14:26:12Z

I talked to @rmuir about the query killer and his concerns are as follows (at least as I understand them):

This wraps everything in Lucene, which incurs overhead whether it is going to be used or not
The overhead may not be so apparent with searches (because of the way Lucene reads in bulk), but
It will have a significant impact on things like doc values, because reads are not bulked. He has gone through the doc values code with a fine tooth comb to reduce the number of CPU instructions in order to make them as fast as they are, and adding a check (even if disabled) for every doc value read will be costly.

Who needs this feature? Really, it's for the user who is running Elasticsearch as a service and who doesn't have control over the types of queries run against the cluster. The query killer is intended as a safety net. However, enabling this feature by default means that the user who does have control over their queries (or who doesn't want to use timeouts) will pay the cost of poorer performance regardless.

The query killer will be useful functionality for a subset of users, but should be disabled by default, and only enabled by those who need it. Which begs the question of how to implement this.

It could be a config setting that is applied only at startup, or it could be a per-index setting (via an alternate directory implementation) that can be enabled only on specific indices. Changing the setting would probably require closing and reopening the index. The latter would be my preference.

Thoughts?

avleen · 2014-09-25T14:43:02Z

I would be really happy with any implementation. I like the per-index idea too. If I can just add the setting in the mapping template, that's an easy win.

As for who needs it: Yesterday one of my users ran a simply query against the message field of 1 day of logstash data, and got 71 results back. The field is analyzed with the standard analyser, and the index is about 4Tb, so quite a number of tokens in that field. It ran really fast!
Then they did a terms aggregation on the same field. After 3 hours of constant GC as all 24 Elasticsearch nodes tried to work on the query, I had to restart the cluster. This took 45 minutes.
After that, there was ~8 hours of lag as logs slowly filtered back in. Replicas didn't finish rebuilding until this morning, almost 24 hours later.
There are a number of us who don't have control over the queries run, but nor should we need to. We use Elasticsearch to enable others to work faster. Adding a layer in between to approve every possible query that could be run.. it's just not feasible.

clintongormley · 2014-09-25T15:26:32Z

@avleen agreed - for those who need it (like yourself) this functionality would be very very helpful. just as a side note: you know that you can disable the loading of field data on certain fields? That would at least help you to prevent problems like the one you experienced:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/fielddata-formats.html#_disabling_field_data_loading

…ffectively time-limiting search requests. Special runtime exceptions can now short-cut the execution of calls to Lucene and are caught and reported back, not as a fatal error but the existing “timedOut” flag in results. Phases like the FetchPhase can now exit early and so also have a timed-out status. The SearchPhaseController does its best to assemble whatever hits, aggregations and facets have been produced within the provided time limits rather than returning nothing and throwing an error. ActivityTimeMonitor is the new central class for efficiently monitoring all forms of thread overrun in a JVM. The SearchContext setup is modified to register the start and end of query tasks with ActivityTimeMonitor. Store.java is modified to add timeout checks (via calls to ATM) in the low-level file access routines by using a delegating wrapper for Lucene's IndexInput. ContextIndexSearcher is modified to catch and unwrap ActivityTimedOutExceptions that can now come out of the Lucene or script calls and report them as timeouts along with any partial results. FetchPhase is similarly modified to deal with the possibility of timeout errors.

…nt in disk accesses

…acked aggs

markharwood · 2014-09-26T14:22:53Z

Coding something up with this new index setting (requires to be set on closed index):

index.thorough_timeout_checks : true

Default is false

…ing requests “thorough” timeout checking. Avoids paying the performance penalty of timeout checks if not required

…work

…o doesn’t trigger the Lucene disk accesses that contain timeout checks

avleen · 2014-09-26T21:49:19Z

Thanks Clinton! I'm turning that on :-)

hazzadous · 2015-02-04T16:00:03Z

Is there a target release for this? Or is there any way I could help, this is one of my all time desired features. Has anyone given this a whirl in production?

clintongormley · 2015-05-26T09:59:49Z

Closing based on this comment: #9168 (comment)

nik9000 reviewed Jan 2, 2014
View reviewed changes

clintongormley mentioned this pull request May 8, 2014

Should we lower the default merge IO throttle rate? #6081

Closed

clintongormley assigned markharwood Jul 18, 2014

This was referenced Jul 18, 2014

Timeout is not respected #3129

Closed

Force faster failures on slow requests. #6918

Closed

This was referenced Aug 1, 2014

Query timeout ignored #3627

Closed

Task management #6914

Closed

imotov mentioned this pull request Aug 4, 2014

Find and kill long running queries #7157

Closed

markharwood added the review label Aug 26, 2014

markharwood mentioned this pull request Sep 12, 2014

Aggs enhancement - allow Include/Exclude clauses to use array of terms #7529

Closed

markharwood added 3 commits September 26, 2014 13:23

Added timeout checks on scripts to handle cases where time is not spe…

f74acb2

…nt in disk accesses

Added support for RandomAccessInput used in accesses like DocValues-b…

e8912d3

…acked aggs

markharwood added 3 commits September 26, 2014 15:56

Added settings that only wrap Lucene data structures if an index sett…

bafa34f

…ing requests “thorough” timeout checking. Avoids paying the performance penalty of timeout checks if not required

Changes to make background monitor thread play nicely with test frame…

30aca17

…work

Added timeout check in slow section that doesn’t appear to hit disk s…

bae1a7e

…o doesn’t trigger the Lucene disk accesses that contain timeout checks

markharwood mentioned this pull request Oct 17, 2014

Sampling aggregation to filter down to top-scoring docs #8108

Closed

clintongormley added the :Search/Search Search-related issues that do not fall into other categories label Nov 11, 2014

clintongormley mentioned this pull request Nov 27, 2014

Need the ability to limit the number of terms in phrase queries #8604

Closed

ppf2 mentioned this pull request Dec 29, 2014

Clarification on setTimeout and isTimeout Java API methods #9092

Closed

markharwood mentioned this pull request Jan 6, 2015

Improved timeout implementation #9156

Closed

drewr force-pushed the master branch from dcc3da0 to 7c20a8a Compare February 20, 2015 16:48

Mpdreamz mentioned this pull request Mar 3, 2015

ElasticClient timeout semantic changed elastic/elasticsearch-net#1262

Closed

clintongormley mentioned this pull request May 26, 2015

Improved search timeout checking capabilities: #9168

Closed

clintongormley closed this May 26, 2015

clintongormley removed the review label May 26, 2015

clintongormley mentioned this pull request Feb 15, 2016

Query TimeAllowed in Elastic Server Side #16410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new generic timeout handler and changes to existing search classes to use it #4586

A new generic timeout handler and changes to existing search classes to use it #4586

markharwood commented Jan 2, 2014

nik9000 Jan 2, 2014

markharwood Jan 2, 2014

s1monw Jan 5, 2014

nik9000 commented Jan 7, 2014

s1monw commented Jan 7, 2014

nik9000 commented Jan 7, 2014

clintongormley commented Jul 18, 2014

markharwood commented Jul 18, 2014

avleen commented Aug 5, 2014

rmuir commented Aug 26, 2014

rmuir commented Aug 26, 2014

markharwood commented Aug 27, 2014

markharwood commented Aug 28, 2014

clintongormley commented Sep 25, 2014

avleen commented Sep 25, 2014

clintongormley commented Sep 25, 2014

markharwood commented Sep 26, 2014

avleen commented Sep 26, 2014

hazzadous commented Feb 4, 2015

clintongormley commented May 26, 2015

A new generic timeout handler and changes to existing search classes to use it #4586

A new generic timeout handler and changes to existing search classes to use it #4586

Conversation

markharwood commented Jan 2, 2014

nik9000 Jan 2, 2014

Choose a reason for hiding this comment

markharwood Jan 2, 2014

Choose a reason for hiding this comment

s1monw Jan 5, 2014

Choose a reason for hiding this comment

nik9000 commented Jan 7, 2014

s1monw commented Jan 7, 2014

nik9000 commented Jan 7, 2014

clintongormley commented Jul 18, 2014

markharwood commented Jul 18, 2014

avleen commented Aug 5, 2014

rmuir commented Aug 26, 2014

rmuir commented Aug 26, 2014

markharwood commented Aug 27, 2014

markharwood commented Aug 28, 2014

clintongormley commented Sep 25, 2014

avleen commented Sep 25, 2014

clintongormley commented Sep 25, 2014

markharwood commented Sep 26, 2014

avleen commented Sep 26, 2014

hazzadous commented Feb 4, 2015

clintongormley commented May 26, 2015