Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate UnifiedHighlighter #21621

Merged
merged 4 commits into from Jan 31, 2017

Conversation

@jimczi
Copy link
Member

commented Nov 17, 2016

This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter has multiple modes:

  • plain: a mode that analyzes the plain text directly
  • postings: a mode that uses the postings offsets to perform the highlight
  • fvh: a mode that uses the term vectors to perform the highlighting

Since this is a "unified" highlighter here is the complete list of highlighting features supported or not by this integration:

  • works on fields with term_vectors, offsets or simplystored(or extracted from _source).
  • force_source
  • pre_tags and post_tags
    Multiple tags are not allowed
  • encoder (html and default)
  • number_of_fragments
  • no_match_size
  • require_field_match
    Requires http://issues.apache.org/jira/browse/LUCENE-7575. Note that LUCENE-7575 is more than just require_field_match since it can select which fields to highlight
  • matched_fields
  • fragment_size
    Requires http://issues.apache.org/jira/browse/LUCENE-7565
  • boundary_chars
  • boundary_max_scan
  • Queries with rewrite that needs an index reader are ignored (MultiPhrasePrefixQuery, CommonTermsQuery, AllFieldQuery, ...).
  • Collapse contiguous highlights: <b>a</b><b>b</b><b>c</b> => <b>abc</b>

Fixes #21376

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Nov 17, 2016

Documentation is missing (which is why this is a WIP ;) )

@jpountz
Copy link
Contributor

left a comment

I left some questions/comments but this looks good to me overall.

core/src/main/java/org/apache/lucene/search/uhighlight/CustomPassageFormatter.java Outdated
return snippets;
}

protected void append(StringBuilder dest, String content, int start, int end) {

This comment has been minimized.

Copy link
@jpountz

jpountz Nov 17, 2016

Contributor

Can you write some javadoc as to why a custom impl would want to override this?

core/src/main/java/org/elasticsearch/index/search/MatchQuery.java Outdated
if (bq.getQuery() instanceof TermQuery) {
prefixQuery.add(((TermQuery) bq.getQuery()).getTerm());
return prefixQuery;
}

This comment has been minimized.

Copy link
@jpountz

jpountz Nov 17, 2016

Contributor

this looks like a different change?

core/src/main/java/org/elasticsearch/search/SearchModule.java Outdated
highlighters.register("unified", uh);
highlighters.register("unified_plain", uh);
highlighters.register("unified_postings", uh);
highlighters.register("unified_fvh", uh);

This comment has been minimized.

Copy link
@jpountz

jpountz Nov 17, 2016

Contributor

why do we need to register the same instance under multiple names?

This comment has been minimized.

Copy link
@jpountz

jpountz Nov 17, 2016

Contributor

oh I see then we have a switch on the highlighter type in the impl, maybe write a comment about it here?

This comment has been minimized.

Copy link
@nik9000

nik9000 Nov 17, 2016

Contributor

The experimental highlighter used another field if you were going to override. It called it "hit_source" I think. I like that kind of thing better because it makes it more clear that you are only changing where the hits are calculated from not the actual highlighter implementation.

core/src/main/java/org/elasticsearch/search/fetch/subphase/highlight/UnifiedHighlighter.java Outdated

if (field.fieldOptions().scoreOrdered()) {
//let's sort the snippets by score if needed
CollectionUtil.introSort(snippets, (o1, o2) -> (int) Math.signum(o2.getScore() - o1.getScore()));

This comment has been minimized.

Copy link
@jpountz

jpountz Nov 17, 2016

Contributor

maybe use Double.compare rather than Math.signum?

@@ -22,6 +22,7 @@
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Snippet;

This comment has been minimized.

Copy link
@javanna

javanna Nov 17, 2016

Member

leftover?

core/src/main/java/org/apache/lucene/search/uhighlight/CustomPassageFormatter.java Outdated
1) extract different snippets (instead of a single big string) together with their scores ({@link Snippet})
2) use the {@link Encoder} implementations that are already used with the other highlighters
*/
public class CustomPassageFormatter extends PassageFormatter {

This comment has been minimized.

Copy link
@javanna

javanna Nov 17, 2016

Member

is there no way to re-use the already existing CustomPassageFormatter class?

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 31, 2017

Contributor

Looks like they have to extend different classes with the same name....

* under the License.
*/

package org.apache.lucene.search.uhighlight;

This comment has been minimized.

Copy link
@javanna

javanna Nov 17, 2016

Member

I am not a big fan of uhighlight as a package name, maybe we can simply use "unifiedhighlight", since we already have "postingshighlight"

This comment has been minimized.

Copy link
@Timothy055

Timothy055 Nov 28, 2016

We had similar thoughts (but were instead considering unifiedhighlighter as a package). Still it seemed too long.

This comment has been minimized.

Copy link
@dsmiley

dsmiley Nov 30, 2016

Contributor

Feel free to file an issue to LUCENE to rename the package to "unifiedhighlight". It's labelled "@lucene.experimental" and was just released.

@@ -31,6 +31,7 @@
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Snippet;

This comment has been minimized.

Copy link
@javanna

javanna Nov 17, 2016

Member

leftover?

@@ -19,6 +19,7 @@

package org.apache.lucene.search.postingshighlight;

import org.apache.lucene.search.highlight.Snippet;

This comment has been minimized.

Copy link
@javanna

javanna Nov 17, 2016

Member

leftover?

@javanna

This comment has been minimized.

Copy link
Member

commented Nov 17, 2016

I think docs are super important. We already have 3 highlighters and it is hard enough for users to figure out which one to use. Would be nice to come up with reasons to prefer this new implementation over the existing ones etc.

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Nov 17, 2016

Would be nice to come up with reasons to prefer this new implementation over the existing ones etc.

I expect this one allows you to source the hits from postings while still using fvh-like snippet segmentation. So it works properly on more free form fields. I expect this but I have not confirmed it to be so.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Nov 17, 2016

I expect this one allows you to source the hits from postings while still using fvh-like snippet segmentation. So it works properly on more free form fields. I expect this but I have not confirmed it to be so.

Well it's the other way around ;). The unified highlighter is a fork of the postings highlighter that can also works with term vectors or plain analysis. The snippet segmentation is exactly the same as the postings one. I only see this highlighter as a generalization of the postings highlighter. After some benchmarking it seems to perform quite well on plain analysis (twice faster than the original plain highlighter on wikipedia text) but the drawbacks are the same than for the original Postings Highlighter.

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Nov 17, 2016

Well it's the other way around ;). The unified highlighter is a fork of the postings highlighter that can also works with term vectors or plain analysis. The snippet segmentation is exactly the same as the postings one. I only see this highlighter as a generalization of the postings highlighter. After some benchmarking it seems to perform quite well on plain analysis (twice faster than the original plain highlighter on wikipedia text) but the drawbacks are the same than for the original Postings Highlighter.

Ok then.....

@javanna

This comment has been minimized.

Copy link
Member

commented Nov 17, 2016

@jimczi were your benchmarks based on lucene code or elasticsearch code from this PR?

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Nov 17, 2016

@javanna elasticsearch code from this PR. Well not exactly a benchmark though, only me playing with some queries ;)

@clintongormley

This comment has been minimized.

Copy link
Member

commented Nov 24, 2016

Does the unified highlighter support fragment length? I tried it out on #9442 and it gives better results than the existing highlighter, but does not appear to limit the length.

@clintongormley

This comment has been minimized.

Copy link
Member

commented Nov 24, 2016

Does this issue apply to the unified highlighter too? #11223

@clintongormley

This comment has been minimized.

Copy link
Member

commented Nov 24, 2016

The unified highlighter doesn't seem to work with boundary characters but doesn't complain about them not being supported either. See #11777

@jimczi jimczi added v5.2.0 and removed WIP labels Dec 12, 2016

@dsmiley

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2016

Unless I misunderstand what ES's merge_fields is, I don't believe LUCENE-7575 handles it. AFAIK merge_fields would consume the positions from multiple fields and highlight one stored field. A use-case is analyzing the same text differently, perhaps with stemming in one field but not another, and then only storing the text once.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Dec 12, 2016

Unless I misunderstand what ES's merge_fields is, I don't believe LUCENE-7575 handles it.

That's correct. What I meant is that LUCENE-7575 is more than require_field_match and that it allows to select which field to extract from the query. I realized now that I mixed up merged fields and matched_fields that does what you describe. I'll update the description, thanks.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Dec 13, 2016

@elasticmachine retest this please

@clintongormley clintongormley added v5.3.0 and removed v5.2.0 labels Jan 24, 2017

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jan 24, 2017

I pushed another iteration now that we upgraded to the latest Lucene release. The unified highlighter now supports require_field_match. I think it is enough to merge this in 5.x.
@nik9000 can you take a look ?

@jimczi jimczi added the review label Jan 24, 2017

@jimczi jimczi requested a review from nik9000 Jan 24, 2017

@nik9000
Copy link
Contributor

left a comment

I left some comments. I'm not sure about the "rewrites with empty reader" thing. The _all query looks like it should be fine. I'm not sure about the other one. I'll have to do some more digging.

core/src/main/java/org/apache/lucene/search/uhighlight/CustomPassageFormatter.java Outdated
import org.elasticsearch.search.fetch.subphase.highlight.HighlightUtils;

/**
Custom passage formatter that allows us to:

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 31, 2017

Contributor

This isn't valid javadoc. I think we should also document why this has to be in the org.apache.lucene package.

This comment has been minimized.

Copy link
@jimczi

jimczi Jan 31, 2017

Author Member

We could move it to another package, the only reason to put it here is because we already have the custom postings highlighter in the org.apache.lucene package.

core/src/main/java/org/apache/lucene/search/uhighlight/CustomPassageFormatter.java Outdated
1) extract different snippets (instead of a single big string) together with their scores ({@link Snippet})
2) use the {@link Encoder} implementations that are already used with the other highlighters
*/
public class CustomPassageFormatter extends PassageFormatter {

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 31, 2017

Contributor

Looks like they have to extend different classes with the same name....

core/src/main/java/org/apache/lucene/search/uhighlight/CustomUnifiedHighlighter.java Outdated

/**
* Subclass of the {@link UnifiedHighlighter} that works for a single field in a single document.
* Uses a custom {@link org.apache.lucene.search.uhighlight.PassageFormatter}. Accepts field content as a constructor

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 31, 2017

Contributor

I'd just import PassageFormatter and remove the package from the {@link bit.

core/src/main/java/org/apache/lucene/search/uhighlight/CustomUnifiedHighlighter.java Outdated
* Creates a new instance of {@link CustomUnifiedHighlighter}
*
* @param analyzer the analyzer used for the field at index time, used for multi term queries internally
* @param passageFormatter our own {@link org.apache.lucene.search.uhighlight.CustomPassageFormatter}

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 31, 2017

Contributor

Another spot I wouldn't include the whole package for brevity.

return new MatchNoDocsQuery();
}
// if the terms does not exist we could return a MatchNoDocsQuery but this would break the unified highlighter
// which rewrites query with an empty reader.

This comment has been minimized.

Copy link
@nik9000

nik9000 Jan 31, 2017

Contributor

Is this just going to make queries again _all a little less efficient what _all is disabled? If so I think that is ok.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2017

Thanks @nik9000
I pushed some change to address your comments.
Regarding the "rewrites with empty reader" thing, this is required because the unified highlighter uses the rewritten query to build the automaton for the re-analysis strategy.

@nik9000
Copy link
Contributor

left a comment

Yeah. I think it is time we get this in so folks can experiment with it.

jimczi added some commits Nov 17, 2016

Integrate UnifiedHighlighter
This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter has multiple modes:
  * plain: a mode that analyzes the plain text directly
  * postings: a mode that uses the postings offsets to perform the highlight
  * fvh: a mode that uses the term vectors to perform the highlighting
By default the mode is choosen automatically depending on the type of the field.
Currently it supports the following options:
* `force_source`
* `encoder`
* `highlight_query`
* `pre_tags and `post_tags`

@jimczi jimczi merged commit f6d38d4 into elastic:master Jan 31, 2017

1 check was pending

elasticsearch-ci Build started sha1 is merged.
Details

@jimczi jimczi deleted the jimczi:unified_highlighter branch Jan 31, 2017

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2017

Thanks @nik9000 !
Now merging to 5.x

jimczi added a commit that referenced this pull request Jan 31, 2017

Integrate UnifiedHighlighter (#21621)
* Integrate UnifiedHighlighter

This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter can extract offsets from either postings, term vectors, or via re-analyzing text.
The best strategy is picked automatically at query time and depends on the field and the query to highlight.
@intrafindBreno

This comment has been minimized.

Copy link

commented Feb 21, 2017

The current integration doesn't seem to support setting the maxLength field in UnifiedHighlighter. Is this correct, or did I miss something?

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2017

That's correct @fariaintrafind , the UnifiedHighlighter uses a sentence break iterator that does not take the maxFragmentLength in account. Though we plan to support this setting soon, it's on my todos ;)

@lami02

This comment has been minimized.

Copy link

commented Apr 4, 2017

I have tested the new highlighter and it works great so far. The only setting that does not seem to work (except from the settings that are not supported yet) is no_match_size. The number_of_fragments setting doesn't seem to do anything either, but I suspect thats because it is not possible to set a fragment_size yet.

@bleskes

This comment has been minimized.

Copy link
Member

commented Apr 4, 2017

@lami02 thanks for trying it out! do you mind opening a new issue with the errors you found? a concise reproduction with API calls will be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.