-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freq Terms Enum #5489
Freq Terms Enum #5489
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,265 @@ | ||
/* | ||
* Licensed to Elasticsearch under one or more contributor | ||
* license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright | ||
* ownership. Elasticsearch licenses this file to you under | ||
* the Apache License, Version 2.0 (the "License"); you may | ||
* not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
|
||
package org.elasticsearch.common.lucene.index; | ||
|
||
import com.google.common.collect.Lists; | ||
import org.apache.lucene.index.*; | ||
import org.apache.lucene.search.DocIdSet; | ||
import org.apache.lucene.search.DocIdSetIterator; | ||
import org.apache.lucene.search.Filter; | ||
import org.apache.lucene.util.Bits; | ||
import org.apache.lucene.util.BytesRef; | ||
import org.elasticsearch.ElasticsearchException; | ||
import org.elasticsearch.ElasticsearchIllegalArgumentException; | ||
import org.elasticsearch.common.Nullable; | ||
import org.elasticsearch.common.lease.Releasable; | ||
import org.elasticsearch.common.lease.Releasables; | ||
import org.elasticsearch.common.lucene.docset.DocIdSets; | ||
import org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter; | ||
import org.elasticsearch.common.lucene.search.Queries; | ||
import org.elasticsearch.common.util.BigArrays; | ||
import org.elasticsearch.common.util.BytesRefHash; | ||
import org.elasticsearch.common.util.IntArray; | ||
import org.elasticsearch.common.util.LongArray; | ||
|
||
import java.io.IOException; | ||
import java.util.Comparator; | ||
import java.util.List; | ||
|
||
/** | ||
* A frequency terms enum that maintains a cache of docFreq, totalTermFreq, or both for repeated term lookup. It also | ||
* allows to provide a filter to explicitly compute frequencies only for docs that match the filter (heavier!). | ||
*/ | ||
public class FreqTermsEnum extends TermsEnum implements Releasable { | ||
|
||
static class Holder { | ||
final TermsEnum termsEnum; | ||
@Nullable | ||
DocsEnum docsEnum; | ||
@Nullable | ||
final Bits bits; | ||
|
||
Holder(TermsEnum termsEnum, Bits bits) { | ||
this.termsEnum = termsEnum; | ||
this.bits = bits; | ||
} | ||
} | ||
|
||
private final static int NOT_FOUND = -2; | ||
|
||
private static final int INITIAL_NUM_TERM_FREQS_CACHED = 512; | ||
|
||
private final boolean docFreq; | ||
private final boolean totalTermFreq; | ||
private final Holder[] enums; | ||
|
||
private final BigArrays bigArrays; | ||
private IntArray termDocFreqs; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should have a baseclass that only handles DocFreq and then subclass it if we need TTF that should remove a lot of branches here though. I don't like the long methods that basically repeat code because of the TTF / DF swtiches. I mean it makes sense do split it since they have a higher cost if TTF is needed though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that feels like a complication? I wonder how this base class would look like, it will need to have callbacks in several places to allow to compute the TTF in the deriving class. Seems to me it will be more complicated to do so. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will provide you with a PR showing what I mean! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @s1monw if you're proposing we use inheritance and you assume the base class will always be caching DF then we could just remove all the "if(this.docFreq)" checks in the existing code as a simple way to clean things up? That would leave us with just the "if(this.totalTermFreq)" checks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scrub last comment - looks like the PhraseSuggester default approach is to use TTF but not DF so can't assume DF is always required to be cached. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @markharwood I thought about this and I wonder if we should get rid of all the if's and cache both by default. We can make the decision by either supplying There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have two modes - filtered and unfiltered. DocsEnum is only relevant in the less-common filtered mode and it looks like we already pass that choice of FLAG_FREQS/FLAG_NONE so this would just save the if statement around the call to docsEnum.freq().
I'm not sure about the trade-off between removing ifs and caching counts unnecessarily? Signif-terms does a lot of DF lookups and I wouldn't want to add a cache of TTF count for each unique term when it is of no interest to me. |
||
private LongArray termsTotalFreqs; | ||
private BytesRefHash cachedTermOrds; | ||
|
||
private int currentDocFreq = 0; | ||
private long currentTotalTermFreq = 0; | ||
|
||
private BytesRef current; | ||
|
||
public FreqTermsEnum(IndexReader reader, String field, boolean docFreq, boolean totalTermFreq, @Nullable Filter filter, BigArrays bigArrays) throws IOException { | ||
this.docFreq = docFreq; | ||
this.totalTermFreq = totalTermFreq; | ||
if (!docFreq && !totalTermFreq) { | ||
throw new ElasticsearchIllegalArgumentException("either docFreq or totalTermFreq must be true"); | ||
} | ||
List<AtomicReaderContext> leaves = reader.leaves(); | ||
List<Holder> enums = Lists.newArrayListWithExpectedSize(leaves.size()); | ||
for (AtomicReaderContext context : leaves) { | ||
Terms terms = context.reader().terms(field); | ||
if (terms == null) { | ||
continue; | ||
} | ||
TermsEnum termsEnum = terms.iterator(null); | ||
if (termsEnum == null) { | ||
continue; | ||
} | ||
Bits bits = null; | ||
if (filter != null) { | ||
if (filter == Queries.MATCH_ALL_FILTER) { | ||
bits = context.reader().getLiveDocs(); | ||
} else { | ||
// we want to force apply deleted docs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is confusing that deleted docs are going to be ignored if no filter is present and taken into account otherwise? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this maintains the existing behavior, and keep the optimization of not iterating over the docs. Its more on the consumers of this class to decide if they want it or not, by providing a filter. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually I was not thinking of always applying deleted docs but rather to leave this responsibility to consumers of this class: if they just want to apply their filter, they pass it as-is and if they want to apply deleted docs, they can just wrap it with |
||
filter = new ApplyAcceptedDocsFilter(filter); | ||
DocIdSet docIdSet = filter.getDocIdSet(context, context.reader().getLiveDocs()); | ||
if (DocIdSets.isEmpty(docIdSet)) { | ||
// fully filtered, none matching, no need to iterate on this | ||
continue; | ||
} | ||
bits = DocIdSets.toSafeBits(context.reader(), docIdSet); | ||
} | ||
} | ||
enums.add(new Holder(termsEnum, bits)); | ||
} | ||
this.bigArrays = bigArrays; | ||
|
||
this.enums = enums.toArray(new Holder[enums.size()]); | ||
|
||
if (docFreq) { | ||
termDocFreqs = bigArrays.newIntArray(INITIAL_NUM_TERM_FREQS_CACHED, false); | ||
} else { | ||
termDocFreqs = null; | ||
} | ||
if (totalTermFreq) { | ||
termsTotalFreqs = bigArrays.newLongArray(INITIAL_NUM_TERM_FREQS_CACHED, false); | ||
} else { | ||
termsTotalFreqs = null; | ||
} | ||
cachedTermOrds = new BytesRefHash(INITIAL_NUM_TERM_FREQS_CACHED, bigArrays); | ||
} | ||
|
||
@Override | ||
public BytesRef term() throws IOException { | ||
return current; | ||
} | ||
|
||
@Override | ||
public boolean seekExact(BytesRef text) throws IOException { | ||
long currentTermOrd = cachedTermOrds.add(text); | ||
if (currentTermOrd < 0) { // already seen, initialize instance data with the cached frequencies | ||
currentTermOrd = -1 - currentTermOrd; | ||
boolean found = true; | ||
if (docFreq) { | ||
currentDocFreq = termDocFreqs.get(currentTermOrd); | ||
if (currentDocFreq == NOT_FOUND) { | ||
found = false; | ||
} | ||
} | ||
if (totalTermFreq) { | ||
currentTotalTermFreq = termsTotalFreqs.get(currentTermOrd); | ||
if (currentTotalTermFreq == NOT_FOUND) { | ||
found = false; | ||
} | ||
} | ||
current = found ? text : null; | ||
return found; | ||
} | ||
|
||
boolean found = false; | ||
int docFreq = 0; | ||
long totalTermFreq = 0; | ||
for (Holder anEnum : enums) { | ||
if (!anEnum.termsEnum.seekExact(text)) { | ||
continue; | ||
} | ||
found = true; | ||
if (anEnum.bits == null) { | ||
docFreq += anEnum.termsEnum.docFreq(); | ||
if (this.totalTermFreq) { | ||
totalTermFreq += anEnum.termsEnum.totalTermFreq(); | ||
} | ||
} else { | ||
DocsEnum docsEnum = anEnum.docsEnum = anEnum.termsEnum.docs(anEnum.bits, anEnum.docsEnum, this.totalTermFreq ? DocsEnum.FLAG_FREQS : DocsEnum.FLAG_NONE); | ||
for (int docId = docsEnum.nextDoc(); docId != DocIdSetIterator.NO_MORE_DOCS; docId = docsEnum.nextDoc()) { | ||
docFreq++; | ||
if (this.totalTermFreq) { | ||
totalTermFreq += docsEnum.freq(); | ||
} | ||
} | ||
} | ||
} | ||
|
||
current = found ? text : null; | ||
if (this.docFreq) { | ||
if (!found) { | ||
docFreq = NOT_FOUND; | ||
} | ||
currentDocFreq = docFreq; | ||
termDocFreqs = bigArrays.grow(termDocFreqs, currentTermOrd + 1); | ||
termDocFreqs.set(currentTermOrd, docFreq); | ||
} | ||
if (this.totalTermFreq) { | ||
if (!found) { | ||
totalTermFreq = NOT_FOUND; | ||
} else if (totalTermFreq < 0) { | ||
// no freqs really..., blast | ||
totalTermFreq = -1; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Different leaves might use different codecs, and some of them might support There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 there is a good example in BlendedTermsQuery though |
||
} | ||
currentTotalTermFreq = totalTermFreq; | ||
termsTotalFreqs = bigArrays.grow(termsTotalFreqs, currentTermOrd + 1); | ||
termsTotalFreqs.set(currentTermOrd, totalTermFreq); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the significant terms use-case, it is not a problem to cache missing entries since it is never supposed to lookup missing terms but for more generic usage, I'm wondering if we should actually cache missing entries. In addition to the fact that it might use a lot of memory if this terms enum is queried with lots of random terms, I think the current default terms index should do a good job at not looking up on disk terms that don't exist? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I personally think we should, the usecase is the current phrase suggester that doesn't go crazy in terms of terms but it looks up missing terms over and over again. Maybe we can make it optional and default to false? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking about it, and decided to cache missing terms for @s1monw decision. Was thinking of making this optional, but wanted to keep the initial code simple, and add more features later on. |
||
|
||
return found; | ||
} | ||
|
||
@Override | ||
public int docFreq() throws IOException { | ||
return currentDocFreq; | ||
} | ||
|
||
@Override | ||
public long totalTermFreq() throws IOException { | ||
return currentTotalTermFreq; | ||
} | ||
|
||
@Override | ||
public boolean release() throws ElasticsearchException { | ||
try { | ||
Releasables.release(cachedTermOrds, termDocFreqs, termsTotalFreqs); | ||
} finally { | ||
cachedTermOrds = null; | ||
termDocFreqs = null; | ||
termsTotalFreqs = null; | ||
} | ||
return true; | ||
} | ||
|
||
@Override | ||
public void seekExact(long ord) throws IOException { | ||
throw new UnsupportedOperationException("freq terms enum"); | ||
} | ||
|
||
@Override | ||
public SeekStatus seekCeil(BytesRef text) throws IOException { | ||
throw new UnsupportedOperationException("freq terms enum"); | ||
} | ||
|
||
@Override | ||
public long ord() throws IOException { | ||
throw new UnsupportedOperationException("freq terms enum"); | ||
} | ||
|
||
@Override | ||
public DocsEnum docs(Bits liveDocs, DocsEnum reuse, int flags) throws IOException { | ||
throw new UnsupportedOperationException("freq terms enum"); | ||
} | ||
|
||
@Override | ||
public DocsAndPositionsEnum docsAndPositions(Bits liveDocs, DocsAndPositionsEnum reuse, int flags) throws IOException { | ||
throw new UnsupportedOperationException("freq terms enum"); | ||
} | ||
|
||
@Override | ||
public BytesRef next() throws IOException { | ||
throw new UnsupportedOperationException("freq terms enum"); | ||
} | ||
|
||
@Override | ||
public Comparator<BytesRef> getComparator() { | ||
throw new UnsupportedOperationException("freq terms enum"); | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -79,6 +79,7 @@ public SignificantStringTerms buildAggregation(long owningBucketOrdinal) { | |
for (int i = 0; i < bucketOrds.size(); i++) { | ||
if (spare == null) { | ||
spare = new SignificantStringTerms.Bucket(new BytesRef(), 0, 0, 0, 0, null); | ||
termsAggFactory.buildTermsEnum(context); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should it be called in front of the for-loop instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to only call it when we create the spare, otherwise we don't need it. Same logic applies to creating the spare lazily, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe the code is a bit confusing, but because insertion into the priority queue is done with |
||
} | ||
|
||
bucketOrds.get(i, spare.termBytes); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the class name reflect that this class does some caching?