Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new generic timeout handler and changes to existing search classes to use it #4586

Closed
wants to merge 6 commits into from
Closed

A new generic timeout handler and changes to existing search classes to use it #4586

wants to merge 6 commits into from

Conversation

markharwood
Copy link
Contributor

A more effective approach to time-limiting activities such as search requests. Special runtime exceptions can now short-cut the execution of long-running calls to Lucene classes and are caught and reported back, not as a fatal error but using the existing “timedOut” flag in results.

Phases like the FetchPhase can now exit early and so also have a timed-out status. The SearchPhaseController does its best to assemble whatever hits, aggregations and facets have been produced within the provided time limits rather than returning nothing and throwing an error.

ActivityTimeMonitor is the new central class for efficiently monitoring all forms of thread overrun in a JVM.
The SearchContext setup is modified to register the start and end of query tasks with ActivityTimeMonitor.
Store.java is modified to add timeout checks (via calls to ATM) in the low-level file access routines by using a delegating wrapper for Lucene's IndexInput and IndexInputSlicer.
ContextIndexSearcher is modified to catch and unwrap ActivityTimedOutExceptions that can now come out of the Lucene calls and report them as timeouts along with any partial results.
FetchPhase is similarly modified to deal with the possibility of timeout errors.

synchronized (timeLimitedThreads) {
// find out which is the next candidate for failure

// TODO Not the fastest conceivable data structure to achieve this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not one of the priority queue style data structures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to consider any concrete recommendations.
I need 1) fast random lookups on an object key (Thread) 2) weak references on keys 3) sorted iteration over values (timestamps)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we use message passing here and make all writes private to the timeout thread we can just use multiple datastructures and priority queues. I really think that there should only be a single thread modifying all datatstructures here and that thread pulls commands from a queue also in a single threaded fashion sending results back via callbacks - this goes together with my suggestion above related to ThreadTimeoutState

@nik9000
Copy link
Member

nik9000 commented Jan 7, 2014

So I just found some weird runaway highlighting process that I found a request that took 57 minutes. So +1 for this.

@s1monw
Copy link
Contributor

s1monw commented Jan 7, 2014

@nik9000 I don't think we can prevent this unless we fix the highlighter ;/

@nik9000
Copy link
Member

nik9000 commented Jan 7, 2014

Even though we've got a fix for the highlighter I still think it'd be worth checking the timeout during the highlight process in case other bugs like this come up. No reason it has to be part of this pull request though.

@clintongormley
Copy link

@markharwood what's the status on this one?

@markharwood
Copy link
Contributor Author

Need to gather confidence about impl performance via benchmarking - following that we should consider where to apply these timeout checks across the platform

This was referenced Aug 1, 2014
@avleen
Copy link

avleen commented Aug 5, 2014

Big +1 for this!

@rmuir
Copy link
Contributor

rmuir commented Aug 26, 2014

Can the benchmark include docvalues access as well? Because the change wraps, but doesnt override randomAccessSlice, I'm afraid it would effectively undo many performance improvements from lucene 4.9

@rmuir
Copy link
Contributor

rmuir commented Aug 26, 2014

By wrapping all indexinputs we also lose a good deal of NIOFS optimizations (e.g. reducing bounds checks for readlong/readvint and so on).

@markharwood
Copy link
Contributor Author

@rmuir Thanks for taking a look. I wrapped the RandomAccessInputs and was re-running benchmarks with doc values. I got side-tracked though because the benchmarks revealed a stack overflow issue. It took a while to track down but the misbehaving query is this one with multiple scripts: https://gist.github.com/markharwood/0747e741b6fed9bbb32b#file-stack-trace
I'll submit an update on this PR tomorrow.

@markharwood
Copy link
Contributor Author

Updated benchmarks using an index with Doc Values for one of the fields used in aggregations.

pr4586dv

@clintongormley
Copy link

I talked to @rmuir about the query killer and his concerns are as follows (at least as I understand them):

  • This wraps everything in Lucene, which incurs overhead whether it is going to be used or not
  • The overhead may not be so apparent with searches (because of the way Lucene reads in bulk), but
  • It will have a significant impact on things like doc values, because reads are not bulked. He has gone through the doc values code with a fine tooth comb to reduce the number of CPU instructions in order to make them as fast as they are, and adding a check (even if disabled) for every doc value read will be costly.

Who needs this feature? Really, it's for the user who is running Elasticsearch as a service and who doesn't have control over the types of queries run against the cluster. The query killer is intended as a safety net. However, enabling this feature by default means that the user who does have control over their queries (or who doesn't want to use timeouts) will pay the cost of poorer performance regardless.

The query killer will be useful functionality for a subset of users, but should be disabled by default, and only enabled by those who need it. Which begs the question of how to implement this.

It could be a config setting that is applied only at startup, or it could be a per-index setting (via an alternate directory implementation) that can be enabled only on specific indices. Changing the setting would probably require closing and reopening the index. The latter would be my preference.

Thoughts?

@avleen
Copy link

avleen commented Sep 25, 2014

I would be really happy with any implementation. I like the per-index idea too. If I can just add the setting in the mapping template, that's an easy win.

As for who needs it: Yesterday one of my users ran a simply query against the message field of 1 day of logstash data, and got 71 results back. The field is analyzed with the standard analyser, and the index is about 4Tb, so quite a number of tokens in that field. It ran really fast!
Then they did a terms aggregation on the same field. After 3 hours of constant GC as all 24 Elasticsearch nodes tried to work on the query, I had to restart the cluster. This took 45 minutes.
After that, there was ~8 hours of lag as logs slowly filtered back in. Replicas didn't finish rebuilding until this morning, almost 24 hours later.
There are a number of us who don't have control over the queries run, but nor should we need to. We use Elasticsearch to enable others to work faster. Adding a layer in between to approve every possible query that could be run.. it's just not feasible.

@clintongormley
Copy link

@avleen agreed - for those who need it (like yourself) this functionality would be very very helpful. just as a side note: you know that you can disable the loading of field data on certain fields? That would at least help you to prevent problems like the one you experienced:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/fielddata-formats.html#_disabling_field_data_loading

…ffectively time-limiting search requests. Special runtime exceptions can now short-cut the execution of calls to Lucene and are caught and reported back, not as a fatal error but the existing “timedOut” flag in results.

Phases like the FetchPhase can now exit early and so also have a timed-out status. The SearchPhaseController does its best to assemble whatever hits, aggregations and facets have been produced within the provided time limits rather than returning nothing and throwing an error.

ActivityTimeMonitor is the new central class for efficiently monitoring all forms of thread overrun in a JVM.
The SearchContext setup is modified to register the start and end of query tasks with ActivityTimeMonitor.
Store.java is modified to add timeout checks (via calls to ATM) in the low-level file access routines by using a delegating wrapper for Lucene's IndexInput.
ContextIndexSearcher is modified to catch and unwrap ActivityTimedOutExceptions that can now come out of the Lucene or script calls and report them as timeouts along with any partial results.
FetchPhase is similarly modified to deal with the possibility of timeout errors.
@markharwood
Copy link
Contributor Author

Coding something up with this new index setting (requires to be set on closed index):

index.thorough_timeout_checks : true

Default is false

…ing requests “thorough” timeout checking. Avoids paying the performance penalty of timeout checks if not required
…o doesn’t trigger the Lucene disk accesses that contain timeout checks
@avleen
Copy link

avleen commented Sep 26, 2014

Thanks Clinton! I'm turning that on :-)

@hazzadous
Copy link

Is there a target release for this? Or is there any way I could help, this is one of my all time desired features. Has anyone given this a whirl in production?

@clintongormley
Copy link

Closing based on this comment: #9168 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants