elasticsearch highlights entire word instead of just the query when ngram filter is used #3137

ike-bloomfire · 2013-06-05T03:32:08Z

when using an nGram filter on a field or index, if you try to highlight said field (or a field in an index that has an nGram filter defined on it, in search results, elasticsearch highlights the entire word instead of just the query.

so if I have the text "American" and I search for "rican"
highlighting should look like this ----> Ame rican
but instead it does this ---> American

To see this in action just follow the instructions here http://stackoverflow.com/a/15005321/141822
you get this output, which is clearly wrong

{
  "took" : 11,
  "timed_out" : false,
    "_shards" : {
      "total" : 1,
      "successful" : 1,
      "failed" : 0
    },
    "hits" : {
      "total" : 3,
      "max_score" : 0.71231794,
      "hits" : [ {
        "_index" : "myindex",
        "_type" : "product",
        "_id" : "0KyaIB8xRmqE-g0hl0ky6g",
        "_score" : 0.71231794,
        "fields" : {
         "code" : "Samsung Galaxy i7500"
        },
        "highlight" : {
          "code.ngram" : [ "<em>Samsung Galaxy i7500</em>" ],
          "code" : [ "<em>Samsung Galaxy i7500</em>" ]
        }
      }, {
        "_index" : "myindex",
        "_type" : "product",
       "_id" : "vZwpcBu0QAyGmP9LHz1hUA",
       "_score" : 0.71231794,
        "fields" : {
          "code" : "Samsung Galaxy 5 Europa"
        },
        "highlight" : {
          "code.ngram" : [ "<em>Samsung Galaxy 5 Europa</em>" ],
          "code" : [ "<em>Samsung Galaxy 5 Europa</em>" ]
        }
      }, {
        "_index" : "myindex",
        "_type" : "product",
        "_id" : "7sNkZAlxSlmuLZA9S68bvg",
        "_score" : 0.71231794,
        "fields" : {
          "code" : "Samsung Galaxy Mini"
        },
        "highlight" : {
          "code.ngram" : [ "<em>Samsung Galaxy Mini</em>" ],
          "code" : [ "<em>Samsung Galaxy Mini</em>" ]
        }
      } ]
    }
  }

With the whitespace tokenizer (vs keyword tokenizer in this case), it highlights just the word with the match in it, which is still not expected behavior

The text was updated successfully, but these errors were encountered:

ike-bloomfire · 2013-06-05T20:45:29Z

This problem won't exhibit itself if you do an upgrade and continue using an old index. It only happens on new indexes created in 0.90.1 or 0.90.0 ...

s1monw · 2013-06-06T06:52:58Z

I am afraid this is a feature not a bug. We had to drop the behaviour in the standard ngram filter since it produces broken offsets that can lead to massive offset indices and exceptions in the highlighter. If you want to have the same behaviour as 0.20 etc. you need to map the token filter like this:

  "old_ngram": {
               "type": "ngram",
               "min_gram": 2,
               "max_gram": 2,
               "version": "4.1"
            }

we are currently working on a fix in lucene that makes this possible without tricks

simon

ike-bloomfire · 2013-06-06T08:50:46Z

Thanks for the explanation @s1monw! This is good to know.

I'm a little confused by the alternative you provided, specifically the ngram settings. If the min_gram and max_gram are the same number how will elasticsearch know what the real min_gram and max_gram are?

ike-bloomfire · 2013-06-06T09:10:21Z

I just figured out that it was a typo, and that you only have to specify the "version" in there. I appreciate having this fallback. Thank you very much.

So I imagine when I'm using version 4.1 of this ngram filter, all the problems of "massive offset indices and exceptions in the highlighter" will still remain?

s1monw · 2013-06-06T14:55:10Z

So I imagine when I'm using version 4.1 of this ngram filter, all the problems of "massive offset indices and exceptions in the highlighter" will still remain?

kind of, we try to prevent them with some internal reordering but ideally you should only use the deprecated stuff if you really need to. I think it's ok at this point but once we have the improved ngram support we work on you should likely switch. BTW. word delimiter has similar problems.

I'm a little confused by the alternative you provided, specifically the ngram settings. If the min_gram and max_gram are the same number how will elasticsearch know what the real min_gram and max_gram are?

oh sorry I was only pointing you towards the version. :) I'd personally use min==max if I'd use it but it's up to you

ike-bloomfire · 2013-06-11T19:37:36Z

kind of, we try to prevent them with some internal reordering but ideally you should only use the deprecated stuff if you really need to.

Yeah. I tried using the version 4.1 stuff and I was still running into the out of bounds exceptions that I had with 19.2 :\

I think it's ok at this point but once we have the improved ngram support we work on you should likely switch. BTW.

I was curious about this. Are you saying that the old way of highlighting will make a come back in a future version of elasticsearch?

s1monw · 2013-06-11T20:23:47Z

Yeah. I tried using the version 4.1 stuff and I was still running into the out of bounds exceptions that I had with 19.2 :\

do you use stored field instead of source? I fixed this lately and this will be in the upcoming release, can you check 0.90 branch on github if that fixes your problem to make sure we have it fixed?

I was curious about this. Are you saying that the old way of highlighting will make a come back in a future version of elasticsearch?

Yes, it will likely come in the next release. Given the fact that this all relies on a bug / broken behaviour this was the safest choice.

ike-bloomfire · 2013-06-11T20:33:31Z

do you use stored field instead of source?

So just basically store a every field that I highlight with the version 4.1 stuff, yes?

I fixed this lately and this will be in the upcoming release, can you check 0.90 branch on github if that fixes your problem to make sure we have it fixed?

okay. So this is stuff that will be in 0.90.2 correct?

Given the fact that this all relies on a bug / broken behaviour this was the safest choice.

Can you explain this a bit for me, I wasn't sure what you were talking about here.

Yes, it will likely come in the next release

We rely heavily on elasticsearch highlighting for our typeahead stuff, so this is a big relief. I thought the current behavior was going to be permanent.

s1monw · 2013-06-11T20:40:05Z

So just basically store a every field that I highlight with the version 4.1 stuff, yes?

yeah so this should be fixed

okay. So this is stuff that will be in 0.90.2 correct?

YES!

Can you explain this a bit for me, I wasn't sure what you were talking about here.

Well NGram Filter and Tokenizer where entirely broken causing all these StringIndexOOB Exception and bloated term vectors etc. so fixing the broken behaviour was the only way to go here and deprecate the behaviour basically only exposing it via version

We rely heavily on elasticsearch highlighting for our typeahead stuff, so this is a big relief. I thought the current behavior was going to be permanent.

We will allow this with NGramTokenizer but not with NGramTokenFilter we are working on a pre-tokenization step that splits on whitespaces for instance. The problem here is that only tokenizer can reliably modify token offsets without breaking the TokenStream contract in lucene and produce broken offsets so we can only allow this with Tokenizers. If you need to have this in a filter you need to pull it in via version which I discourage once NGramTokenizer if fixed. I hope this helps you moving forward.

BowlingX · 2013-08-05T16:03:57Z

I ran into the same issue, is there something else i can or should use? (so instead of setting the version parameter)

s1monw · 2013-08-05T16:05:50Z

@BowlingX you should use NGramTokenizer instead that should give you the partial highlighting capabilities you need. see http://www.elasticsearch.org/guide/reference/index-modules/analysis/ngram-tokenizer/

ariasdelrio · 2014-02-25T14:20:36Z

@s1monw I was using the 4.1 stuff for edge_ngram, but now I want to change to an NGramTokenizer. My problem is that we were using an ICU tokenizer before. Is there a way to keep this? The only option that the NGramTokenizer seems to allow is to separate words based on character classes.

s1monw · 2014-02-25T14:22:10Z

no you can't really do the ICU tokenization together with the NGramTokenizer

mcuelenaere · 2015-10-08T23:25:04Z

@s1monw what's the status on partial highlighting with NGramTokenFilter (not NGramTokenizer)? Is there an issue to track this?

wkiser · 2016-03-08T03:06:27Z

+1 would also like to know the status of partial highlighting using an (edge)NGramTokenFilter. Or if there are any workarounds for 2.2.0.

clintongormley · 2016-03-08T08:59:12Z

@mcuelenaere @wkiser The contract of token filters is that they can't change positions or offsets, so an ngram (or edge ngram) token filter will always use the offsets of the whole original word.

For partial highlighting, look at using the ngram and edge ngram tokenizers instead.

wkiser · 2016-03-08T16:50:12Z

@clintongormley one of the requirements that I am working with is supporting typeahead search for name synonyms. I have this working by chaining synonym and edgengram token filters. Switching to ngramtokenizer seems like I would lose typeahead on all the synonyms.

Am I out of luck with accurate highlighting here? Is there no slower, non offset based highlighter?

clintongormley · 2016-03-08T17:58:34Z

@wkiser you are out of luck, as far as i'm aware

s1monw closed this as completed Jun 6, 2013

sschuerz mentioned this issue Nov 13, 2014

Highlighting not working when min_gram set to 1 #8468

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

elasticsearch highlights entire word instead of just the query when ngram filter is used #3137

elasticsearch highlights entire word instead of just the query when ngram filter is used #3137

ike-bloomfire commented Jun 5, 2013

ike-bloomfire commented Jun 5, 2013

s1monw commented Jun 6, 2013

ike-bloomfire commented Jun 6, 2013

ike-bloomfire commented Jun 6, 2013

s1monw commented Jun 6, 2013

ike-bloomfire commented Jun 11, 2013

s1monw commented Jun 11, 2013

ike-bloomfire commented Jun 11, 2013

s1monw commented Jun 11, 2013

BowlingX commented Aug 5, 2013

s1monw commented Aug 5, 2013

ariasdelrio commented Feb 25, 2014

s1monw commented Feb 25, 2014

mcuelenaere commented Oct 8, 2015

wkiser commented Mar 8, 2016

clintongormley commented Mar 8, 2016

wkiser commented Mar 8, 2016

clintongormley commented Mar 8, 2016

elasticsearch highlights entire word instead of just the query when ngram filter is used #3137

elasticsearch highlights entire word instead of just the query when ngram filter is used #3137

Comments

ike-bloomfire commented Jun 5, 2013

ike-bloomfire commented Jun 5, 2013

s1monw commented Jun 6, 2013

ike-bloomfire commented Jun 6, 2013

ike-bloomfire commented Jun 6, 2013

s1monw commented Jun 6, 2013

ike-bloomfire commented Jun 11, 2013

s1monw commented Jun 11, 2013

ike-bloomfire commented Jun 11, 2013

s1monw commented Jun 11, 2013

BowlingX commented Aug 5, 2013

s1monw commented Aug 5, 2013

ariasdelrio commented Feb 25, 2014

s1monw commented Feb 25, 2014

mcuelenaere commented Oct 8, 2015

wkiser commented Mar 8, 2016

clintongormley commented Mar 8, 2016

wkiser commented Mar 8, 2016

clintongormley commented Mar 8, 2016