Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Cancellable DirectoryReader #52822

Merged
merged 32 commits into from
Mar 5, 2020
Merged

Conversation

matriv
Copy link
Contributor

@matriv matriv commented Feb 26, 2020

Implement a Cancellable DirectoryReader that wraps the original
DirectoryReader so that when a search task is cancelled the
DirectoryReaders also stop their work fast. This is usuful for
expensive operations like wilcard/prefix queries where the
DirectoryReaders can spend lots of time and consume resources,
as previously their work wouldn't stop even though the original
search task was cancelled (e.g. because of timeout or dropped client
connection).

Implement a Cancellable DirectoryReader that wraps the original
DirectoryReader so that when a search task is cancelled the
DirectoryReaders also stop their work fast. This is usuful for
expensive operations like wilcard/prefix queries where the
DirectoryReaders can spend lots of time and consume resources,
as previously their work wouldn't stop even though the original
search task was cancelled (e.g. because of timeout or dropped client
connection).
@matriv matriv added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Feb 26, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@jpountz jpountz self-requested a review February 26, 2020 15:02
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some suggestions but this looks like it's on the right track to me.

@@ -708,6 +708,8 @@ private DefaultSearchContext createSearchContext(SearchRewriteContext rewriteCon
searcher, clusterService, indexService, indexShard, bigArrays, threadPool::relativeTimeInMillis, timeout, fetchPhase);
success = true;
return searchContext;
} catch (IOException e) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this leniency looks dangerous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I simply propagate the exception? or any other suggestion?

}
cancellable.get().run();
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation feels a bit awkward, I'd rather like to fork ExitableDirectoryReader entirely to not inherit from its QueryTimeout abstraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that in the 1st approach but this means we have to copy the whole ExitablePointValues to wrap the point values and therefore the ExitableIntersectVisitor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind copying it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to copy

IndexSearcher searcher = new IndexSearcher(reader);
PercentileRanks ranks = search(searcher, new MatchAllDocsQuery(), aggBuilder, fieldType);
Percentile rank = ranks.iterator().next();
assertEquals(Double.NaN, rank.getPercent(), 0d);
assertEquals(0.5, rank.getValue(), 0d);
assertFalse(AggregationInspectionHelper.hasValue(((InternalTDigestPercentileRanks)ranks)));
} finally {
unmappedIndexWriter.close();
directory.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use try-with-resources?

public ContextIndexSearcher(IndexReader reader, Similarity similarity,
QueryCache queryCache, QueryCachingPolicy queryCachingPolicy,
boolean shouldWrap) throws IOException {
super(shouldWrap? new CancellableIndexReader((DirectoryReader) reader, new Cancellable()) : reader);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CancellableIndexReader shouldn't have any overhead, so it might be simpler to wrap all the time here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This boolean was added together with a second constructor because of the AggregatorTestCase and hundreds of tests that derive from that. If we wrap the IndexReader we get:

java.lang.AssertionError: The top-reader used to create Weight is not the same as the current reader's top-reader (org.apache.lucene.index.CompositeReaderContext@382edaaa

which I tried to fix by changing the AggregatorTestCase to receive IndexReader and not IndexSearcher as an argument. and all the tests to use the derived IndexSearcher from the context created. But even with this there were a few more tests failing that didn't manage to fix, so after discussion with @jimczi we decided to make this workaround for the moment and address the issue in a separate PR afterwards.

I can add a TODO though to not miss it.

/**
* Wraps an {@link IndexReader} with a cancellation Runnable task.
*/
private static class CancellableIndexReader extends FilterDirectoryReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it CancellableDirectoryReader if it extends DirectoryReader.

@matriv
Copy link
Contributor Author

matriv commented Feb 27, 2020

@jpountz @jimczi I had to fork the ExitableDirectoryReader because of this: https://github.com/elastic/elasticsearch/pull/52822/files#diff-913ad694a7d744ee93ae1dac48d67b0eR415

The CompletionTerms is a final class and the code in CompletionWeight here: https://github.com/apache/lucene-solr/blob/master/lucene/suggest/src/java/org/apache/lucene/search/suggest/document/CompletionWeight.java#L76 expects CompletionTerms, so if they are wrapped into ExitableTerms the if leads to an IllegalArguementException.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forking sounds totally fine to me.

Can you explain why we need to unset the timeout at some point? I'm guessing that it is to allow aggregations to build their results without errors? A comment would help clarify.

Also I'm not sure why some tests need to become integration tests? These tests used to be integration tests that we migrated to unit test.

@@ -310,6 +297,8 @@ static boolean executeInternal(SearchContext searchContext) throws QueryPhaseExe
queryResult.nodeQueueSize(rExecutor.getCurrentQueueSize());
queryResult.serviceTimeEWMA((long) rExecutor.getTaskExecutionEWMA());
}
// Search phase has finished, no longer need to check for timeout
searcher.unsetCheckTimeout();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that we don't experience a timeout during the aggregation phase. (there were integ tests that were failing because of this). I changed the comment to mention it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this on the cancellable object directly. The searcher shouldn't be able to unset this way. See my comment on the context searcher.

@matriv
Copy link
Contributor Author

matriv commented Feb 28, 2020

Forking sounds totally fine to me.

Thx!

Also I'm not sure why some tests need to become integration tests? These tests used to be integration tests that we migrated to unit test.

@jimczi Could you please help explaining?
Since we have at the moment that 2nd constructor that's necessary for the AggregatorTestCase tests maybe we can use it also for those so we don't attempt to wrap the MultiReader.

@jimczi
Copy link
Contributor

jimczi commented Feb 28, 2020

Also I'm not sure why some tests need to become integration tests? These tests used to be integration tests that we migrated to unit test.

We can probably fix these tests rather than moving them again into an integration tests. We've made some workaround in the AggregatorTestCase that are now hitting us badly. For instance we allow to use a MultiReader instead of a DirectoryReader so a lot of new tests are using this shortcut even though we always expect a DirectoryReader when used outside of tests. We also have a unit test that simulates the mixing of an unmapped field with a mapped one but the base class AggregatorTestCase only exposes a single mapping. I think we should cleanup those tests and ensure that we're making the same assumption in tests than when deploying, however this is out of scope for this pr so I told Marios that we should tackle this in a separate one. To unblock his work we have the workaround here so that we don't try to wrap the reader if it's not a standard directory reader.

@matriv matriv marked this pull request as ready for review February 28, 2020 15:14
@matriv matriv requested a review from jimczi February 28, 2020 15:14
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me overall. I left some comments to simplify the logic in the ContextIndexSearcher and to move the cancellation impl in the QueryPhase directly.

void unsetCheckTimeout();
}

public static class CancellableImpl implements Cancellable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation should live on the QueryPhase imo. This way you don't need to expose the checkDirReaderCancelled and checkDirReaderCancelled in the searcher.


boolean isEnabled();
void checkCancelled();
default void checkDirReaderCancelled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions should not be part of the interface, checkCancelled should be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that much either, I exposed the checkDirReaderCancelled() so that I we can have tests that differentiate between exit in the searcher and exit from the dirReader: https://github.com/elastic/elasticsearch/pull/52822/files#diff-2abce48f3a52657ce3740afade6d5f8fR132
But as discussed I could just unit test the Cancellable reader instead.

/**
* iFace which implements the query timeout / cancellation logic
*/
public interface Cancellable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you call QueryCancellable ?

@@ -77,10 +89,27 @@

private AggregatedDfs aggregatedDfs;
private QueryProfiler profiler;
private Runnable checkCancelled;
private Holder<Cancellable> cancellable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to use a custom Holder ? Now that cancellable directory reader is forked, you should be able to set the cancellable lazily ? So instead of passing the Cancellable here you'd set it in ContextIndexSearcher#setCancellable ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we need a reference to the cancellable (the holder actually) during construction of the Reader: https://github.com/elastic/elasticsearch/pull/52822/files#diff-913ad694a7d744ee93ae1dac48d67b0eR403
So that later on with the setCancellable we can set it to the holder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alreanatively I changed so that the ContextIndexSearcher doesn't know of the holder, but then a cast to CancellableDirectoryReader is required to call the setCancellable.

}
cancellable.get().run();
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to copy

@@ -310,6 +297,8 @@ static boolean executeInternal(SearchContext searchContext) throws QueryPhaseExe
queryResult.nodeQueueSize(rExecutor.getCurrentQueueSize());
queryResult.serviceTimeEWMA((long) rExecutor.getTaskExecutionEWMA());
}
// Search phase has finished, no longer need to check for timeout
searcher.unsetCheckTimeout();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this on the cancellable object directly. The searcher shouldn't be able to unset this way. See my comment on the context searcher.

@@ -611,14 +611,8 @@ public void testCacheAggregation() throws IOException {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this change ? It should work without this modification so I'd like to keep this for a different pr since the issue is not related to the exitable directory reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

@@ -920,14 +920,8 @@ public void testCacheAggregation() throws IOException {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this shouldn't be needed with the extra constructor.

@@ -22,7 +22,6 @@
import org.apache.lucene.document.Document;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too ;)

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional to @jimczi 's.

public void setCheckCancelled(Runnable checkCancelled) {
this.checkCancelled = checkCancelled;
public void setCancellable(Cancellable cancellable) {
this.cancellable.set(cancellable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check that the argument is not null since we seem to assume it's never null in some places?

@@ -244,7 +270,7 @@ public Scorer scorer(LeafReaderContext context) throws IOException {
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
BulkScorer in = weight.bulkScorer(context);
if (in != null) {
return new CancellableBulkScorer(in, checkCancelled);
return new CancellableBulkScorer(in, () -> cancellable.get().checkCancelled());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do the get() call once for all instead of forcing the bulk scorer to do it every time it wants to check whether the request has been cancelled?

Suggested change
return new CancellableBulkScorer(in, () -> cancellable.get().checkCancelled());
Runnable checkCancelled = cancellable.get();
return new CancellableBulkScorer(in, checkCancelled);


@Override
protected DirectoryReader doWrapDirectoryReader(DirectoryReader in) {
return in;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's throw an UnsupportedOperationException? (this is only used when asking a DirectoryReader to take into account some new changes in a directory, which should never happen with this impl)

}

private void checkAndThrowWithSampling() {
if (calls++ % MAX_CALLS_BEFORE_QUERY_TIMEOUT_CHECK == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to reduce the overhead to a minimum, we could make MAX_CALLS_BEFORE_QUERY_TIMEOUT_CHECK a power of two minus 1 (e.g. (1 << 4) - 1, ie. 15, and use a mask instead of a remainder: (calls++ & MAX_CALLS_BEFORE_QUERY_TIMEOUT_CHECK) == 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why power of two - 1 is better?

Copy link
Contributor

@jpountz jpountz Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Masks are typically cheaper than remainders. And a%b is the same as a & (b-1) when a is positive and b is a power of two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the iterations @matriv, LGTM!

}

public void setProfiler(QueryProfiler profiler) {
this.profiler = profiler;
}

/**
* Set a {@link Runnable} that will be run on a regular basis while
* collecting documents.
* Add a {@link Runnable} that will be run on a regular basis while fetching document from the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/while accessing/while fetching document/ this is confusing since we don't check the cancellation when visiting the stored fields ?

@matriv
Copy link
Contributor Author

matriv commented Mar 5, 2020

@jimczi Thx you too for you patience and all the help and guidance you provided.

@matriv matriv merged commit 67acaf6 into elastic:master Mar 5, 2020
@matriv matriv deleted the impl-cancellable branch March 5, 2020 11:50
matriv added a commit that referenced this pull request Mar 5, 2020
Implement an Exitable DirectoryReader that wraps the original
DirectoryReader so that when a search task is cancelled the
DirectoryReaders also stop their work fast. This is usuful for
expensive operations like wilcard/prefix queries where the
DirectoryReaders can spend lots of time and consume resources,
as previously their work wouldn't stop even though the original
search task was cancelled (e.g. because of timeout or dropped client
connection).

(cherry picked from commit 67acaf6)
matriv added a commit to matriv/elasticsearch that referenced this pull request Mar 5, 2020
With ExitableDirectoryReader in place, check for query cancellation
during QueryPhase#preProcess where the query rewriting takes place.

Follows: elastic#52822
matriv added a commit that referenced this pull request Mar 5, 2020
With ExitableDirectoryReader in place, check for query cancellation
during QueryPhase#preProcess where the query rewriting takes place.

Follows: #52822
matriv added a commit that referenced this pull request Mar 6, 2020
With ExitableDirectoryReader in place, check for query cancellation
during QueryPhase#preProcess where the query rewriting takes place.

Follows: #52822

(cherry picked from commit 0d38626)
matriv added a commit that referenced this pull request Mar 18, 2020
The introduction of the ExitableDirectoryReader showed increase of
latencies for range queries using pointvalues.

Check for cancellation every 1024 docs instead of every 15 to lower
the impact of the check in query's performance.

Follows: #52822
Fixes: #53496
matriv added a commit that referenced this pull request Mar 18, 2020
The introduction of the ExitableDirectoryReader showed increase of
latencies for range queries using pointvalues.

Check for cancellation every 1024 docs instead of every 15 to lower
the impact of the check in query's performance.

Follows: #52822
Fixes: #53496
(cherry picked from commit 6b5fc35)
matriv added a commit to matriv/elasticsearch that referenced this pull request Mar 23, 2020
Benchmarking showed that the effect of the ExitableDirectoryReader
is reduced considerably when checking every 4095 docs. Moreover,
set the cancellable task before calling QueryPhase.preProcess()
and make sure we don't wrap with an ExitableDirectoryReader at all
when lowLevelCancellation() is set to false to avoid completely any
performance impact.

Follows: elastic#52822
Follows: elastic#53166
Follows: elastic#53496
matriv added a commit that referenced this pull request Mar 23, 2020
Benchmarking showed that the effect of the ExitableDirectoryReader
is reduced considerably when checking every 8191 docs. Moreover,
set the cancellable task before calling QueryPhase#preProcess()
and make sure we don't wrap with an ExitableDirectoryReader at all
when lowLevelCancellation is set to false to avoid completely any
performance impact.

Follows: #52822
Follows: #53166
Follows: #53496
matriv added a commit that referenced this pull request Mar 23, 2020
Benchmarking showed that the effect of the ExitableDirectoryReader
is reduced considerably when checking every 8191 docs. Moreover,
set the cancellable task before calling QueryPhase#preProcess()
and make sure we don't wrap with an ExitableDirectoryReader at all
when lowLevelCancellation is set to false to avoid completely any
performance impact.

Follows: #52822
Follows: #53166
Follows: #53496

(cherry picked from commit cdc377e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants