Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove percolator cache #18434

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
4 participants
@martijnvg
Copy link
Member

commented May 18, 2016

Before 5.0 it was required that the percolator queries were cached in jvm heap as Lucene queries for two reasons:

  • Performance. The percolator evaluated all percolator queries all the time. There was no pre-selecting queries optimization that selects candidate matches like we have today.
  • Updates made to percolator queries were visible in realtime, today these changes are visible in near realtime. So updating no longer requires the percolator to have the queries in jvm heap.

So having the percolator queries in jvm heap via the percolator cache is now less attractive. Especially when there are many percolator queries then these queries can consume many GBs of jvm heap. Removing the percolator cache does make the percolate query slower compared to how the execution time in 5.0.0-alpha1 and alpha2, but it is still significantly faster compared to 2.x and before.

I think removing the percolator cache is the right tradeoff. I've been running some tests with an index that contains ~35M queries. With this change this change the used heap space is ~250MB and without 24GB. The query time slowdown compared with this change depends on how many candidate queries need to be evaluated if they actually match. With some test queries I have seen a slow down up to 5 times. However if query time is compared with how things work in 2.x and before then query time is still significantly better with this change.

Also if this gets merged we can improve other things, so that the slowdown introduced by this change get smaller:

  • For percolator queries that are a disjunction query we know that they will match if they are are a candidate match. So there is no need to evaluate these queries by the MemoryIndex.
  • Add range query extract logic, so that we can intelligently pre-select percolator queries that contain range queries.
  • Right now we store the xcontent representation of the percolator queries. We can instead store the binary representation which improve query time (parsing logic takes less time) for this change at the cost of bwc for stored percolator queries. The binary format of query builders has bwc within a major version as apposed for across a major version for the xcontent format. So that isn't too bad and with the reindex api, updating the percolator queries is now much more pleasant. So this could be explored as well.

Note the breaking part is the removal of percolator related statistics from the stats and cat APIs that are no longer needed now.

@martijnvg martijnvg force-pushed the martijnvg:remove_percolator_cache branch May 18, 2016

@clintongormley

This comment has been minimized.

Copy link
Member

commented May 18, 2016

With this change this change the used heap space is ~250MB and without 24GB.

Wow!

@jpountz

This comment has been minimized.

Copy link
Contributor

commented May 18, 2016

For percolator queries that are a conjunction query we know that they will match if they are are a candidate match.

I think you meant disjunction?

I just realized something else that we could easily improve: if the query is for instance +name:martijn #age:[20 TO 40], then we currently fail the extraction since range queries are unsupported. However it would be correct to extract name:martijn: maybe we should improve term extraction fo catch UnsupportedQueryExceptions in the case of conjunctions, and only rethrow if no clause could be extracted?

Back to this PR: I agree this is the right direction and we should work on improving query pre-selection rather than making verification faster at the expense of memory usage. Maybe index sorting could also help if visiting all matches is not required.

Using the binary representation of queries is indeed appealing. Hopefully this will come for free if we ever support running clusters with mixed major versions.

@nik9000

This comment has been minimized.

Copy link
Contributor

commented May 18, 2016

I've been running some tests with an index that contains ~35M queries. With this change this change the used heap space is ~250MB and without 24GB. The query time slowdown compared with this change depends on how many candidate queries need to be evaluated if they actually match. With some test queries I have seen a slow down up to 5 times.

These are both huge swings.

I wonder if it makes sense to use a cache with a limited size rather than pre-building all the queries? You wouldn't pay the cost of building queries that are frequently candidates. It might just be an exercise in needless bit shuffling if the hit rate isn't good though.

@martijnvg

This comment has been minimized.

Copy link
Member Author

commented May 18, 2016

I think you meant disjunction?

Oops, yes. Fixed the description.

maybe we should improve term extraction fo catch UnsupportedQueryExceptions in the case of conjunctions, and only rethrow if no clause could be extracted?

Good point. This enhancement can be added via this PR, since it is a small change.

Using the binary representation of queries is indeed appealing. Hopefully this will come for free if we ever support running clusters with mixed major versions.

I'm hesitant to change to use the binary format, because we don't support this. So I think we should stick with using the xcontent format for now. I think bigger wins can be made improving the query preselecting better then changing the to use the binary format for storage.

@martijnvg

This comment has been minimized.

Copy link
Member Author

commented May 18, 2016

I wonder if it makes sense to use a cache with a limited size rather than pre-building all the queries? You wouldn't pay the cost of building queries that are frequently candidates. It might just be an exercise in needless bit shuffling if the hit rate isn't good though.

I think a limited sized cache won't help that much, since there will always be misses (lets say we limit this cache to 5% of the available heap space). I think if we take a look at entire percolator refactoring effort, the execution time improvements compared to 2.x are so large that this trade off for using the percolator with far less memory is justified.

@nik9000

This comment has been minimized.

Copy link
Contributor

commented May 18, 2016

since there will always be misses

Makes sense to me.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented May 18, 2016

Good point. This enhancement can be added via this PR, since it is a small change.

Maybe a different PR would be better since this PR might be controversial, while the discussed improvement should not be?

I'm hesitant to change to use the binary format, because we don't support this. So I think we should stick with using the xcontent format for now. I think bigger wins can be made improving the query preselecting better then changing the to use the binary format for storage.

+1

@martijnvg

This comment has been minimized.

Copy link
Member Author

commented May 18, 2016

Maybe a different PR would be better since this PR might be controversial, while the discussed improvement should not be?

True, lets make this change in a different PR. This improvement doesn't depend on this PR at all.

@martijnvg martijnvg added review and removed discuss labels May 18, 2016

@jpountz

This comment has been minimized.

Copy link
Contributor

commented May 18, 2016

I can handle the review, but I would like @s1monw and @clintongormley to confirm this is the direction we want to take as this change has important implications.

@martijnvg martijnvg added discuss and removed review labels May 19, 2016

@clintongormley

This comment has been minimized.

Copy link
Member

commented May 19, 2016

I think removing the cache in favour of improved execution is the right direction to go.

I'm hesitant to change to use the binary format, because we don't support this. So I think we should stick with using the xcontent format for now. I think bigger wins can be made improving the query preselecting better then changing the to use the binary format for storage.

I agree - this part is trickier. It did occur to me that we could (a) store the binary format and (b) fall back to rebuilding the query from source if it is not parseable on a later version (where a reindex would rebuild it in the right format). but agreed, this is more controversial and should be discussed on a different ticket.

boolean mapUnmappedFieldsAsString) {
return ctx -> {
LeafReader leafReader = ctx.reader();
BinaryDocValues binaryDocValues = leafReader.getBinaryDocValues(fieldType.getQueryBuilderFieldName());

This comment has been minimized.

Copy link
@jpountz

jpountz May 19, 2016

Contributor

Do we need to check for null? It could be null if no queries were indexed so far I believe?

This comment has been minimized.

Copy link
@martijnvg

martijnvg May 19, 2016

Author Member

In theory it is possible, so I'll add a null check.

But I think that in practise this will not happen since we will only verify docs that have query terms indexed fields and those docs also have a query builder dv field

This comment has been minimized.

Copy link
@jpountz

jpountz May 19, 2016

Contributor

ok

@jpountz

View changes

core/src/main/java/org/elasticsearch/index/query/PercolateQueryBuilder.java Outdated
return docId -> {
LegacyQueryFieldVisitor visitor = new LegacyQueryFieldVisitor();
leafReader.document(docId, visitor);
try (XContentParser sourceParser = XContentHelper.createParser(visitor.source)) {

This comment has been minimized.

Copy link
@jpountz

jpountz May 19, 2016

Contributor

can visitor.source be null? I guess this is not possible on a legacy index but maybe we should check and throw eg. an IllegalStateException with a helpful message if it is null?

This comment has been minimized.

Copy link
@martijnvg

martijnvg May 19, 2016

Author Member

+1

@jpountz

View changes

core/src/test/java/org/elasticsearch/index/query/PercolateQueryBuilderTests.java Outdated
@@ -35,6 +41,9 @@
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.index.get.GetResult;
import org.elasticsearch.index.mapper.MapperService;
import org.elasticsearch.index.mapper.internal.SourceFieldMapper;
import org.elasticsearch.index.mapper.internal.TypeFieldMapper;
import org.elasticsearch.index.percolator.PercolatorFieldMapper;

This comment has been minimized.

Copy link
@jpountz

jpountz May 19, 2016

Contributor

these look like unused imports?

@jpountz

View changes

docs/reference/query-dsl/percolate-query.asciidoc Outdated
temporary Lucene index. This in-memory index can just hold this one document and it is optimized for that. After this
a special query is build based on the terms in the in-memory index that select candidate percolator queries based on the
their indexed query terms. These queries are then eveluated by the in-memory index if they actually match.

This comment has been minimized.

Copy link
@jpountz

jpountz May 19, 2016

Contributor

s/the their/their/
s/eveluated/evaluated/

@jpountz

This comment has been minimized.

Copy link
Contributor

commented May 19, 2016

I left some comments about corner cases but otherwise LGTM

@martijnvg martijnvg added >enhancement and removed discuss labels May 20, 2016

@martijnvg martijnvg force-pushed the martijnvg:remove_percolator_cache branch May 20, 2016

percolator: Removed percolator cache
Before 5.0 for it was required that the percolator queries were cached in jvm heap as Lucene queries for two reasons:
1) Performance. The percolator evaluated all percolator queries all the time. There was no pre-selecting queries that are likely to match like we have today.
2) Updates made to percolator queries were visible in realtime, Today these changes are visible in near realtime. So updating no longer requires the percolator to have the queries in jvm heap.

So having the percolator queries in jvm heap via the percolator cache is now less attractive. Especially when there are many percolator queries then these queries can consume many GBs of jvm heap.
Removing the percolator cache does make the percolate query slower compared to how the execution time in 5.0.0-alpha1 and alpha2, but it is still faster compared to 2.x and before.

@martijnvg martijnvg force-pushed the martijnvg:remove_percolator_cache branch to 064a591 May 20, 2016

@martijnvg

This comment has been minimized.

Copy link
Member Author

commented May 20, 2016

Pushed via: 80fee86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.