Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New highlighter based on lucene postings highlighter #3704

Closed
javanna opened this Issue Sep 16, 2013 · 2 comments

Comments

Projects
None yet
3 participants
@javanna
Copy link
Member

javanna commented Sep 16, 2013

Lucene includes the postings highlighter for some time now. The following are its main advantages:

  • faster than plain highlighter and requires less disk space than term_vectors, needed for the fast vector highlighter
  • outputs nice sentences as snippets
  • scores snippets based on BM25 algorithm

It's worth to try and add it as our third highlighter implementation.

@nik9000

This comment has been minimized.

Copy link
Contributor

nik9000 commented Sep 16, 2013

Neat! I'll give this a shot when I have a chance.

@javanna javanna closed this in 48ac974 Oct 24, 2013

javanna added a commit that referenced this issue Oct 24, 2013

Added third highlighter type based on lucene postings highlighter
Requires field index_options set to "offsets" in order to store positions and offsets in the postings list.
Considerably faster than the plain highlighter since it doesn't require to reanalyze the text to be highlighted: the larger the documents the better the performance gain should be.
Requires less disk space than term_vectors, needed for the fast_vector_highlighter.
Breaks the text into sentences and highlights them. Uses a BreakIterator to find sentences in the text. Plays really well with natural text, not quite the same if the text contains html markup for instance.
Treats the document as the whole corpus, and scores individual sentences as if they were documents in this corpus, using the BM25 algorithm.

Uses forked version of lucene postings highlighter to support:
- per value discrete highlighting for fields that have multiple values, needed when number_of_fragments=0 since we want to return a snippet per value
- manually passing in query terms to avoid calling extract terms multiple times, since we use a different highlighter instance per doc/field, but the query is always the same

The lucene postings highlighter api is  quite different compared to the existing highlighters api, the main difference being that it allows to highlight multiple fields in multiple docs with a single call, ensuring sequential IO.
The way it is introduced in elasticsearch in this first round is a compromise trying not to change the current highlight api, which works per document, per field. The main disadvantage is that we lose the sequential IO, but we can always refactor the highlight api to work with multiple documents.

Supports pre_tag, post_tag, number_of_fragments (0 highlights the whole field), require_field_match, no_match_size, order by score and html encoding.

Closes #3704
@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Oct 25, 2013

NO WAY! ;)

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Added third highlighter type based on lucene postings highlighter
Requires field index_options set to "offsets" in order to store positions and offsets in the postings list.
Considerably faster than the plain highlighter since it doesn't require to reanalyze the text to be highlighted: the larger the documents the better the performance gain should be.
Requires less disk space than term_vectors, needed for the fast_vector_highlighter.
Breaks the text into sentences and highlights them. Uses a BreakIterator to find sentences in the text. Plays really well with natural text, not quite the same if the text contains html markup for instance.
Treats the document as the whole corpus, and scores individual sentences as if they were documents in this corpus, using the BM25 algorithm.

Uses forked version of lucene postings highlighter to support:
- per value discrete highlighting for fields that have multiple values, needed when number_of_fragments=0 since we want to return a snippet per value
- manually passing in query terms to avoid calling extract terms multiple times, since we use a different highlighter instance per doc/field, but the query is always the same

The lucene postings highlighter api is  quite different compared to the existing highlighters api, the main difference being that it allows to highlight multiple fields in multiple docs with a single call, ensuring sequential IO.
The way it is introduced in elasticsearch in this first round is a compromise trying not to change the current highlight api, which works per document, per field. The main disadvantage is that we lose the sequential IO, but we can always refactor the highlight api to work with multiple documents.

Supports pre_tag, post_tag, number_of_fragments (0 highlights the whole field), require_field_match, no_match_size, order by score and html encoding.

Closes elastic#3704
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.