Cut over to the Lucene filter cache #10897

Merged
merged 1 commit into from May 4, 2015

Projects

None yet

4 participants

@jpountz
Contributor
jpountz commented Apr 30, 2015

This removes Elasticsearch's filter cache and uses Lucene's instead. It has some
implications:

  • custom cache keys (_cache_key) are unsupported
  • decisions are made internally and can't be overridden by users ('_cache`)
  • not only filters can be cached but also all queries that do not need scores
  • parent/child queries can now be cached, however cached entries are only
    valid for the current top-level reader so in practice it will likely only
    be used in practice on read-only indices
  • the cache deduplicates filters, which plays nicer with large keys (eg. terms)
  • better stats: we already had ram usage and evictions, but now also hit count,
    miss count, lookup count, number of cached doc id sets and current number of
    doc id sets in the cache
  • dynamically changing the filter cache size is not supported anymore

Internally, an important change is that it removes the NoCacheFilter infrastructure
in favour of making Query.rewrite specializing the query for the current reader so
that it will only be cached on this reader (look for IndexCacheableQuery).

Note that consuming filters with the query API (createWeight/scorer) instead of
the filter API (getDocIdSet) is important for parent/child queries because
otherwise a QueryWrapperFilter(ParentQuery) would run the wrapped query per
segment while relations might be cross segments.

@jpountz jpountz added the breaking label Apr 30, 2015
@jpountz
Contributor
jpountz commented Apr 30, 2015

The PR is large so feel free to ping me if you would like to discuss it.

@rmuir rmuir commented on the diff Apr 30, 2015
.../elasticsearch/common/lucene/IndexCacheableQuery.java
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.Weight;
+
+import java.io.IOException;
+import java.util.Objects;
+
+/**
+ * Base implementation for a query which is cacheable at the index level but
+ * not the segment level as usually expected.
+ */
+public abstract class IndexCacheableQuery extends Query {
+
+ private Object readerCacheKey;
+
+ @Override
+ public Query rewrite(IndexReader reader) throws IOException {
@rmuir
rmuir Apr 30, 2015 Contributor

Can this be final? If a subclass overrides this but does not call super, its caching will break.

@jpountz
jpountz Apr 30, 2015 Contributor

Some parent/child queries need to plug in other rewrite rules. I can try to make it final and add a protected doRewrite method for these queries to do what they need.

@rmuir
rmuir Apr 30, 2015 Contributor

I see. Not sure if its worth it, if doRewrite makes things complicated in its own way, given that createWeight is final and it will catch the issue.

@rmuir rmuir commented on an outdated diff Apr 30, 2015
.../org/elasticsearch/common/lucene/ShardCoreKeyMap.java
+
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReader.CoreClosedListener;
+import org.elasticsearch.index.shard.ShardId;
+import org.elasticsearch.index.shard.ShardUtils;
+
+import java.io.IOException;
+import java.util.IdentityHashMap;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ * A map between segment core cache keys and the shard that these segments
+ * belong to.
+ */
+public final class ShardCoreKeyMap {
@rmuir
rmuir Apr 30, 2015 Contributor

what is this class used for? Can we elaborate in javadocs header?

@rmuir rmuir commented on the diff Apr 30, 2015
...icsearch/indices/cache/filter/IndicesFilterCache.java
- public static final String INDICES_CACHE_FILTER_SIZE = "indices.cache.filter.size";
- public static final String INDICES_CACHE_FILTER_EXPIRE = "indices.cache.filter.expire";
- public static final String INDICES_CACHE_FILTER_CONCURRENCY_LEVEL = "indices.cache.filter.concurrency_level";
- public static final String INDICES_CACHE_FILTER_CLEAN_INTERVAL = "indices.cache.filter.clean_interval";
- public static final String INDICES_CACHE_FILTER_MINIMUM_ENTRY_WEIGHT = "indices.cache.filter.minimum_entry_weight";
+ // It's ok to not protect these callbacks by a lock since it is
+ // done in LRUQueryCache
@rmuir
rmuir Apr 30, 2015 Contributor

can we add assert Thread.holdsLock?

@rmuir
Contributor
rmuir commented Apr 30, 2015

look good.

I pushed a few cleanups.

I'm not happy with the complexity surrounding ShardCoreKeyMap etc: it is clearly not worth it.
I am happy for this to be simplified in a follow-up, but I think in general we have to stop doing
stuff like this in the name of back compat / inertia / etc. The codebase is just too large and we should
apply software engineering and keep things simple when its the right thing to do.

Related to that, I hit a test fail while playing:
mvn test -Dtests.class="*ShardCoreKeyMapTests" -Dtests.method=testBasics -Dtests.seed=D176C7AD20084F10

I think the 10 places where we do 'new IndexSearcher' should be reviewed and always have an explicit cache set
or not set in a followup as well (its rather unrelated to the change at hand).

@jpountz
Contributor
jpountz commented Apr 30, 2015

@rmuir I pushed more commits to address your comments.

I tend to agree with the complexity of ShardCoreKeyMap, it needs to be balanced with the benefit of having per-shard stats. The reason why I tried to hard to keep having per-shard stats is that doing otherwise would require an important refactoring of our monitoring APIs.

@rmuir
Contributor
rmuir commented Apr 30, 2015

Changes look great, +1

Yeah, as i said about the map, lets defer it to another issue. This one makes so much progress on its own.

@jpountz jpountz Core: Cut over to the Lucene filter cache.
This removes Elasticsearch's filter cache and uses Lucene's instead. It has some
implications:
 - custom cache keys (`_cache_key`) are unsupported
 - decisions are made internally and can't be overridden by users ('_cache`)
 - not only filters can be cached but also all queries that do not need scores
 - parent/child queries can now be cached, however cached entries are only
   valid for the current top-level reader so in practice it will likely only
   be used on read-only indices
 - the cache deduplicates filters, which plays nicer with large keys (eg. `terms`)
 - better stats: we already had ram usage and evictions, but now also hit count,
   miss count, lookup count, number of cached doc id sets and current number of
   doc id sets in the cache
 - dynamically changing the filter cache size is not supported anymore

Internally, an important change is that it removes the NoCacheFilter infrastructure
in favour of making Query.rewrite specializing the query for the current reader so
that it will only be cached on this reader (look for IndexCacheableQuery).

Note that consuming filters with the query API (createWeight/scorer) instead of
the filter API (getDocIdSet) is important for parent/child queries because
otherwise a QueryWrapperFilter(ParentQuery) would run the wrapped query per
segment while relations might be cross segments.
b72f27a
@jpountz jpountz merged commit 3409994 into elastic:master May 4, 2015

1 check passed

CLA Commit author is a member of Elasticsearch
Details
@kevinkluge kevinkluge removed the review label May 4, 2015
@jpountz jpountz deleted the jpountz:fix/nocache branch May 4, 2015
@clintongormley clintongormley changed the title from Core: Cut over to the Lucene filter cache. to Cut over to the Lucene filter cache Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment