Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NRTSuggester: Support returning StoredFields with suggestions & Optional output duplication #7650

Closed
wants to merge 3 commits into from

Conversation

Projects
None yet
3 participants
@areek
Copy link
Contributor

areek commented Sep 9, 2014

NOTE: This is a PR against feature/nrt_suggester and not the master!

The ability to return field values for given storedField(s), along with returned suggestions makes the suggesters more flexible, allowing to return associated field values for every suggestion without bloating up the in-memory FST (using payload increases the FST size) or having to specify the fields to return at index-time.

Implementation details:

  • Support returning associated StoredFields of suggestions
    • Lucene docid is extracted from suggestion and used to get named StoredField value(s) at query-time
    • User can specify arbitrary number of StoredFields to be returned
    • Returned StoredField(s) can be accessed through the new XLookupResult
    • Currently supports StoredField values of type: String, BytesRef and Number
  • Optional deduplication (query time option)
  • Remove exactFirst option - simplify lookup code
  • Minor changes to TopNSearcher & how suggestions are collected
  • Introduce XLookup interface
  • Benchmarks

Benchmark Results:

  • data set: LineFileDocs
  • AnalyzingSuggester : vanilla Lucene suggester built from InputIterator, uses payload to store field value.
  • XAnalyzingSuggester : current ES suggester built with custom postings format, uses payload to store field value
  • XNRTSuggester : new NRT suggester built with custom postings format, retrieves field value from provided stored field name.
  • Summary
    • number of results returned is inversely proportional to QPS
    • longer prefix length yields to higher QPS
    • Retrieving storedField is slower than using payload, at the cost of more flexibility
    • Due to the relatively small size of the dataset used, the benchmark assumes StoredFields to be in RAM
....
-- Stored Field Retrieval Performance
  -- prefixes: 2-4, num: 2
    AnalyzingSuggester queries: 3472, time[ms]: 83 [+- 8.64], ~kQPS: 42
    XAnalyzingSuggester queries: 3472, time[ms]: 62 [+- 1.48], ~kQPS: 56
    XNRTSuggester queries: 3472, time[ms]: 174 [+- 5.04], ~kQPS: 20
  -- prefixes: 2-4, num: 4
    AnalyzingSuggester queries: 3472, time[ms]: 122 [+- 3.34], ~kQPS: 29
    XAnalyzingSuggester queries: 3472, time[ms]: 96 [+- 2.35], ~kQPS: 36
    XNRTSuggester queries: 3472, time[ms]: 307 [+- 6.29], ~kQPS: 11
  -- prefixes: 2-4, num: 6
    AnalyzingSuggester queries: 3472, time[ms]: 172 [+- 5.38], ~kQPS: 20
    XAnalyzingSuggester queries: 3472, time[ms]: 135 [+- 4.58], ~kQPS: 26
    XNRTSuggester queries: 3472, time[ms]: 425 [+- 6.98], ~kQPS: 8
  -- prefixes: 3-6, num: 2
    AnalyzingSuggester queries: 3472, time[ms]: 72 [+- 2.84], ~kQPS: 48
    XAnalyzingSuggester queries: 3472, time[ms]: 60 [+- 3.02], ~kQPS: 58
    XNRTSuggester queries: 3472, time[ms]: 169 [+- 4.06], ~kQPS: 20
  -- prefixes: 3-6, num: 4
    AnalyzingSuggester queries: 3472, time[ms]: 107 [+- 4.29], ~kQPS: 32
    XAnalyzingSuggester queries: 3472, time[ms]: 85 [+- 3.98], ~kQPS: 41
    XNRTSuggester queries: 3472, time[ms]: 266 [+- 5.58], ~kQPS: 13
  -- prefixes: 3-6, num: 6
    AnalyzingSuggester queries: 3472, time[ms]: 139 [+- 2.76], ~kQPS: 25
    XAnalyzingSuggester queries: 3472, time[ms]: 112 [+- 3.71], ~kQPS: 31
    XNRTSuggester queries: 3472, time[ms]: 350 [+- 5.82], ~kQPS: 10
  -- prefixes: 100-200, num: 2
    AnalyzingSuggester queries: 3472, time[ms]: 134 [+- 3.11], ~kQPS: 26
    XAnalyzingSuggester queries: 3472, time[ms]: 122 [+- 3.85], ~kQPS: 28
    XNRTSuggester queries: 3472, time[ms]: 191 [+- 7.18], ~kQPS: 18
  -- prefixes: 100-200, num: 4
    AnalyzingSuggester queries: 3472, time[ms]: 136 [+- 2.72], ~kQPS: 26
    XAnalyzingSuggester queries: 3472, time[ms]: 123 [+- 3.85], ~kQPS: 28
    XNRTSuggester queries: 3472, time[ms]: 191 [+- 4.50], ~kQPS: 18
  -- prefixes: 100-200, num: 6
    AnalyzingSuggester queries: 3472, time[ms]: 136 [+- 4.07], ~kQPS: 26
    XAnalyzingSuggester queries: 3472, time[ms]: 122 [+- 4.15], ~kQPS: 28
    XNRTSuggester queries: 3472, time[ms]: 191 [+- 3.81], ~kQPS: 18

 - Storage benchmark
    AnalyzingSuggester (with payload) size[B]:    1,198,856
    XAnalyzingSuggester (with payload) size[B]:      762,592
    XNRTSuggester size[B]:      774,504
...

related to #7353

areek added some commits Sep 8, 2014

NRTSuggester: Support near real-time deleted document filtering for s…
…uggestions

 - additionally adds suggester benchmark code & data

@areek areek added the review label Sep 9, 2014

@areek areek self-assigned this Sep 9, 2014

*/
public XNRTSuggester(Analyzer analyzer) {
this(analyzer, null, analyzer, EXACT_FIRST | PRESERVE_SEP, 256, -1, true, null, false, 0, SEP_LABEL, PAYLOAD_SEP, END_BYTE, HOLE_CHARACTER);

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 10, 2014

Contributor

at some point we should think of a SuggesterConfig class ala IndexWrtierConfig these ctors are a nightmare

This comment has been minimized.

Copy link
@areek

areek Sep 11, 2014

Author Contributor

I already started playing with the idea, thanks for the pointer.

final int num = lookupOptions.num;
final AtomicReader reader = lookupOptions.reader;
final Set<String> payloadFields = lookupOptions.payloadFields;
/* DEBUG

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 10, 2014

Contributor

unneeded ?

This comment has been minimized.

Copy link
@areek

areek Sep 11, 2014

Author Contributor

removed (reverted the XLookupOptions class)

liveDocRatio = (double) reader.numDocs() / reader.maxDoc();
} else {
*/
assert num > 0;

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 10, 2014

Contributor

it's usually a good idea to print the actual value as the assert message it can be very helpful if it doesn't reproduce

This comment has been minimized.

Copy link
@areek

areek Sep 11, 2014

Author Contributor

added printing the actual value, thanks for suggesting

return key.bytesEquals(spare);
}
return key.bytesEquals(output2);
public List<XLookupResult> lookup(final XLookupOptions lookupOptions) {

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 10, 2014

Contributor

IMO we can name this calss just Result since it's already an inner class in XLookup no? so it has the namespace twice?! :)

This comment has been minimized.

Copy link
@areek

areek Sep 11, 2014

Author Contributor

makes sense. changed XLookupResult to Result.

num - results.size(),
getMaxTopNSearcherQueueSize(num, liveDocsRatio),
weightComparator) {
XUtil.TopNSearcher<Pair<Long,BytesRef>> searcher;

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 10, 2014

Contributor

I love that change :) all unneeded stuff IMO

This comment has been minimized.

Copy link
@areek

areek Sep 11, 2014

Author Contributor

It simplifies the lookup method a lot!

} else {
return true;
}
} catch (IOException e) {

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 10, 2014

Contributor

IMO we should allow the interface to throw IOException that just makes sense here

This comment has been minimized.

Copy link
@areek

areek Sep 11, 2014

Author Contributor

Added IOException to collectResult (renamed from accetResult) in TopNSearcher

@@ -627,16 +484,6 @@ private int getMaxTopNSearcherQueueSize(int num, final double liveDocRatio) {
return Math.min(MAX_TOP_N_QUEUE_SIZE, (int) (maxQueueSize / liveDocRatio));
}

@Override
public boolean store(DataOutput output) throws IOException {

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 10, 2014

Contributor

nice!

public final BytesRef payload;

/** list of returned stored field values */
public final List<XStoredField> storedFields;

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 10, 2014

Contributor

I wonder if we want to actually name this more generic. Now we have payload, stored field and maybe tomorrow we will have docValueField? I wonder if it makes sense to make this more generic and hide all of them behind the same abstraction. We also might want to parse the source in ES since we don't necessarily store the field as a seperate field. I also think that payloads might be better stored as DocValues for fast access instead of in the FST.

Thinking out loud it might be useful to just abstract this as a DocValueLookup class like this

interface DocValueLookup<T> {
  public <T> lookup(int docId, AtomicReader reader);
}

which can then be reflected as

public static final class XLookupResult<T> implements Comparable<XLookupResult> {
  ....

   public T fieldValue;
}

We can have several implementations which can be app specific etc. Maybe I am missing something here but that might work no?

This comment has been minimized.

Copy link
@areek

areek Sep 11, 2014

Author Contributor

I agree that this should be more generic. I am thinking about how to do so though, I have a few ideas. The problem with the proposed way is that one suggestion can have more than one return fields (stored/docVal/payload). And I think it makes sense to lazily lookup the return fields, by just storing the docID (maybe even the reader) for each entry in the LookupResult. I will update once I figure out a way that will work.

/**
* Encapsulates all options that are accepted by the lookup method
*/
public final static class XLookupOptions {

This comment has been minimized.

Copy link
@s1monw

s1monw Sep 10, 2014

Contributor

hmm not sure if I like this one.. To me this should be something like an enum and things like reader, num, key I'd like to treat as we treat it in IndexSearcher as method args since we need them all the time? The payloadFields can be maybe omitted if we do the DocValueLookup right?

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Sep 10, 2014

I left some comments - I like where this is going!

@s1monw s1monw removed the review label Sep 10, 2014

@areek areek force-pushed the elastic:feature/nrt_suggester branch from b1eef02 to 00b5c64 Feb 11, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.