Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freq Terms Enum #5489

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
265 changes: 265 additions & 0 deletions src/main/java/org/elasticsearch/common/lucene/index/FreqTermsEnum.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.common.lucene.index;

import com.google.common.collect.Lists;
import org.apache.lucene.index.*;
import org.apache.lucene.search.DocIdSet;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.Filter;
import org.apache.lucene.util.Bits;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.ElasticsearchException;
import org.elasticsearch.ElasticsearchIllegalArgumentException;
import org.elasticsearch.common.Nullable;
import org.elasticsearch.common.lease.Releasable;
import org.elasticsearch.common.lease.Releasables;
import org.elasticsearch.common.lucene.docset.DocIdSets;
import org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter;
import org.elasticsearch.common.lucene.search.Queries;
import org.elasticsearch.common.util.BigArrays;
import org.elasticsearch.common.util.BytesRefHash;
import org.elasticsearch.common.util.IntArray;
import org.elasticsearch.common.util.LongArray;

import java.io.IOException;
import java.util.Comparator;
import java.util.List;

/**
* A frequency terms enum that maintains a cache of docFreq, totalTermFreq, or both for repeated term lookup. It also
* allows to provide a filter to explicitly compute frequencies only for docs that match the filter (heavier!).
*/
public class FreqTermsEnum extends TermsEnum implements Releasable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the class name reflect that this class does some caching?


static class Holder {
final TermsEnum termsEnum;
@Nullable
DocsEnum docsEnum;
@Nullable
final Bits bits;

Holder(TermsEnum termsEnum, Bits bits) {
this.termsEnum = termsEnum;
this.bits = bits;
}
}

private final static int NOT_FOUND = -2;

private static final int INITIAL_NUM_TERM_FREQS_CACHED = 512;

private final boolean docFreq;
private final boolean totalTermFreq;
private final Holder[] enums;

private final BigArrays bigArrays;
private IntArray termDocFreqs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a baseclass that only handles DocFreq and then subclass it if we need TTF that should remove a lot of branches here though. I don't like the long methods that basically repeat code because of the TTF / DF swtiches. I mean it makes sense do split it since they have a higher cost if TTF is needed though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that feels like a complication? I wonder how this base class would look like, it will need to have callbacks in several places to allow to compute the TTF in the deriving class. Seems to me it will be more complicated to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will provide you with a PR showing what I mean!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1monw if you're proposing we use inheritance and you assume the base class will always be caching DF then we could just remove all the "if(this.docFreq)" checks in the existing code as a simple way to clean things up? That would leave us with just the "if(this.totalTermFreq)" checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scrub last comment - looks like the PhraseSuggester default approach is to use TTF but not DF so can't assume DF is always required to be cached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markharwood I thought about this and I wonder if we should get rid of all the if's and cache both by default. We can make the decision by either supplying DocsEnum.FLAG_FREQS if we need the ttf or DocsEnum.FLAG_NONE if we don't then just sum it up anyways without special casing. that way thsi becomes much simpler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two modes - filtered and unfiltered. DocsEnum is only relevant in the less-common filtered mode and it looks like we already pass that choice of FLAG_FREQS/FLAG_NONE so this would just save the if statement around the call to docsEnum.freq().

I wonder if we should get rid of all the if's and cache both by default.

I'm not sure about the trade-off between removing ifs and caching counts unnecessarily? Signif-terms does a lot of DF lookups and I wouldn't want to add a cache of TTF count for each unique term when it is of no interest to me.

private LongArray termsTotalFreqs;
private BytesRefHash cachedTermOrds;

private int currentDocFreq = 0;
private long currentTotalTermFreq = 0;

private BytesRef current;

public FreqTermsEnum(IndexReader reader, String field, boolean docFreq, boolean totalTermFreq, @Nullable Filter filter, BigArrays bigArrays) throws IOException {
this.docFreq = docFreq;
this.totalTermFreq = totalTermFreq;
if (!docFreq && !totalTermFreq) {
throw new ElasticsearchIllegalArgumentException("either docFreq or totalTermFreq must be true");
}
List<AtomicReaderContext> leaves = reader.leaves();
List<Holder> enums = Lists.newArrayListWithExpectedSize(leaves.size());
for (AtomicReaderContext context : leaves) {
Terms terms = context.reader().terms(field);
if (terms == null) {
continue;
}
TermsEnum termsEnum = terms.iterator(null);
if (termsEnum == null) {
continue;
}
Bits bits = null;
if (filter != null) {
if (filter == Queries.MATCH_ALL_FILTER) {
bits = context.reader().getLiveDocs();
} else {
// we want to force apply deleted docs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is confusing that deleted docs are going to be ignored if no filter is present and taken into account otherwise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this maintains the existing behavior, and keep the optimization of not iterating over the docs. Its more on the consumers of this class to decide if they want it or not, by providing a filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I was not thinking of always applying deleted docs but rather to leave this responsibility to consumers of this class: if they just want to apply their filter, they pass it as-is and if they want to apply deleted docs, they can just wrap it with ApplyAcceptedDocsFilter.

filter = new ApplyAcceptedDocsFilter(filter);
DocIdSet docIdSet = filter.getDocIdSet(context, context.reader().getLiveDocs());
if (DocIdSets.isEmpty(docIdSet)) {
// fully filtered, none matching, no need to iterate on this
continue;
}
bits = DocIdSets.toSafeBits(context.reader(), docIdSet);
}
}
enums.add(new Holder(termsEnum, bits));
}
this.bigArrays = bigArrays;

this.enums = enums.toArray(new Holder[enums.size()]);

if (docFreq) {
termDocFreqs = bigArrays.newIntArray(INITIAL_NUM_TERM_FREQS_CACHED, false);
} else {
termDocFreqs = null;
}
if (totalTermFreq) {
termsTotalFreqs = bigArrays.newLongArray(INITIAL_NUM_TERM_FREQS_CACHED, false);
} else {
termsTotalFreqs = null;
}
cachedTermOrds = new BytesRefHash(INITIAL_NUM_TERM_FREQS_CACHED, bigArrays);
}

@Override
public BytesRef term() throws IOException {
return current;
}

@Override
public boolean seekExact(BytesRef text) throws IOException {
long currentTermOrd = cachedTermOrds.add(text);
if (currentTermOrd < 0) { // already seen, initialize instance data with the cached frequencies
currentTermOrd = -1 - currentTermOrd;
boolean found = true;
if (docFreq) {
currentDocFreq = termDocFreqs.get(currentTermOrd);
if (currentDocFreq == NOT_FOUND) {
found = false;
}
}
if (totalTermFreq) {
currentTotalTermFreq = termsTotalFreqs.get(currentTermOrd);
if (currentTotalTermFreq == NOT_FOUND) {
found = false;
}
}
current = found ? text : null;
return found;
}

boolean found = false;
int docFreq = 0;
long totalTermFreq = 0;
for (Holder anEnum : enums) {
if (!anEnum.termsEnum.seekExact(text)) {
continue;
}
found = true;
if (anEnum.bits == null) {
docFreq += anEnum.termsEnum.docFreq();
if (this.totalTermFreq) {
totalTermFreq += anEnum.termsEnum.totalTermFreq();
}
} else {
DocsEnum docsEnum = anEnum.docsEnum = anEnum.termsEnum.docs(anEnum.bits, anEnum.docsEnum, this.totalTermFreq ? DocsEnum.FLAG_FREQS : DocsEnum.FLAG_NONE);
for (int docId = docsEnum.nextDoc(); docId != DocIdSetIterator.NO_MORE_DOCS; docId = docsEnum.nextDoc()) {
docFreq++;
if (this.totalTermFreq) {
totalTermFreq += docsEnum.freq();
}
}
}
}

current = found ? text : null;
if (this.docFreq) {
if (!found) {
docFreq = NOT_FOUND;
}
currentDocFreq = docFreq;
termDocFreqs = bigArrays.grow(termDocFreqs, currentTermOrd + 1);
termDocFreqs.set(currentTermOrd, docFreq);
}
if (this.totalTermFreq) {
if (!found) {
totalTermFreq = NOT_FOUND;
} else if (totalTermFreq < 0) {
// no freqs really..., blast
totalTermFreq = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different leaves might use different codecs, and some of them might support totalTermFreq while other leaves might not. So this should return -1 if any of the leaves returned -1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 there is a good example in BlendedTermsQuery though

}
currentTotalTermFreq = totalTermFreq;
termsTotalFreqs = bigArrays.grow(termsTotalFreqs, currentTermOrd + 1);
termsTotalFreqs.set(currentTermOrd, totalTermFreq);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the significant terms use-case, it is not a problem to cache missing entries since it is never supposed to lookup missing terms but for more generic usage, I'm wondering if we should actually cache missing entries. In addition to the fact that it might use a lot of memory if this terms enum is queried with lots of random terms, I think the current default terms index should do a good job at not looking up on disk terms that don't exist?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think we should, the usecase is the current phrase suggester that doesn't go crazy in terms of terms but it looks up missing terms over and over again. Maybe we can make it optional and default to false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it, and decided to cache missing terms for @s1monw decision. Was thinking of making this optional, but wanted to keep the initial code simple, and add more features later on.


return found;
}

@Override
public int docFreq() throws IOException {
return currentDocFreq;
}

@Override
public long totalTermFreq() throws IOException {
return currentTotalTermFreq;
}

@Override
public boolean release() throws ElasticsearchException {
try {
Releasables.release(cachedTermOrds, termDocFreqs, termsTotalFreqs);
} finally {
cachedTermOrds = null;
termDocFreqs = null;
termsTotalFreqs = null;
}
return true;
}

@Override
public void seekExact(long ord) throws IOException {
throw new UnsupportedOperationException("freq terms enum");
}

@Override
public SeekStatus seekCeil(BytesRef text) throws IOException {
throw new UnsupportedOperationException("freq terms enum");
}

@Override
public long ord() throws IOException {
throw new UnsupportedOperationException("freq terms enum");
}

@Override
public DocsEnum docs(Bits liveDocs, DocsEnum reuse, int flags) throws IOException {
throw new UnsupportedOperationException("freq terms enum");
}

@Override
public DocsAndPositionsEnum docsAndPositions(Bits liveDocs, DocsAndPositionsEnum reuse, int flags) throws IOException {
throw new UnsupportedOperationException("freq terms enum");
}

@Override
public BytesRef next() throws IOException {
throw new UnsupportedOperationException("freq terms enum");
}

@Override
public Comparator<BytesRef> getComparator() {
throw new UnsupportedOperationException("freq terms enum");
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ public SignificantLongTerms buildAggregation(long owningBucketOrdinal) {

if (spare == null) {
spare = new SignificantLongTerms.Bucket(0, 0, 0, 0, 0, null);
termsAggFactory.buildTermsEnum(context);
}
spare.term = bucketOrds.key(i);
spare.subsetDf = bucketDocCount(ord);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ public SignificantStringTerms buildAggregation(long owningBucketOrdinal) {
for (int i = 0; i < bucketOrds.size(); i++) {
if (spare == null) {
spare = new SignificantStringTerms.Bucket(new BytesRef(), 0, 0, 0, 0, null);
termsAggFactory.buildTermsEnum(context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be called in front of the for-loop instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to only call it when we create the spare, otherwise we don't need it. Same logic applies to creating the spare lazily, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the code is a bit confusing, but because insertion into the priority queue is done with insertWithOverflow, spare == null is going to be true on the first ordered.size() iterations of the loop and false afterwards. So here termsAggFactory.buildTermsEnum(context); would not be called only once but ordered.size() times.

}

bucketOrds.get(i, spare.termBytes);
Expand Down