Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elasticsearch highlights entire word instead of just the query when ngram filter is used #3137

Closed
ike-bloomfire opened this issue Jun 5, 2013 · 18 comments

Comments

@ike-bloomfire
Copy link

when using an nGram filter on a field or index, if you try to highlight said field (or a field in an index that has an nGram filter defined on it, in search results, elasticsearch highlights the entire word instead of just the query.

so if I have the text "American" and I search for "rican"
highlighting should look like this ----> Ame rican
but instead it does this ---> American

To see this in action just follow the instructions here http://stackoverflow.com/a/15005321/141822
you get this output, which is clearly wrong

{
  "took" : 11,
  "timed_out" : false,
    "_shards" : {
      "total" : 1,
      "successful" : 1,
      "failed" : 0
    },
    "hits" : {
      "total" : 3,
      "max_score" : 0.71231794,
      "hits" : [ {
        "_index" : "myindex",
        "_type" : "product",
        "_id" : "0KyaIB8xRmqE-g0hl0ky6g",
        "_score" : 0.71231794,
        "fields" : {
         "code" : "Samsung Galaxy i7500"
        },
        "highlight" : {
          "code.ngram" : [ "<em>Samsung Galaxy i7500</em>" ],
          "code" : [ "<em>Samsung Galaxy i7500</em>" ]
        }
      }, {
        "_index" : "myindex",
        "_type" : "product",
       "_id" : "vZwpcBu0QAyGmP9LHz1hUA",
       "_score" : 0.71231794,
        "fields" : {
          "code" : "Samsung Galaxy 5 Europa"
        },
        "highlight" : {
          "code.ngram" : [ "<em>Samsung Galaxy 5 Europa</em>" ],
          "code" : [ "<em>Samsung Galaxy 5 Europa</em>" ]
        }
      }, {
        "_index" : "myindex",
        "_type" : "product",
        "_id" : "7sNkZAlxSlmuLZA9S68bvg",
        "_score" : 0.71231794,
        "fields" : {
          "code" : "Samsung Galaxy Mini"
        },
        "highlight" : {
          "code.ngram" : [ "<em>Samsung Galaxy Mini</em>" ],
          "code" : [ "<em>Samsung Galaxy Mini</em>" ]
        }
      } ]
    }
  }

With the whitespace tokenizer (vs keyword tokenizer in this case), it highlights just the word with the match in it, which is still not expected behavior

@ike-bloomfire
Copy link
Author

This problem won't exhibit itself if you do an upgrade and continue using an old index. It only happens on new indexes created in 0.90.1 or 0.90.0 ...

@s1monw
Copy link
Contributor

s1monw commented Jun 6, 2013

I am afraid this is a feature not a bug. We had to drop the behaviour in the standard ngram filter since it produces broken offsets that can lead to massive offset indices and exceptions in the highlighter. If you want to have the same behaviour as 0.20 etc. you need to map the token filter like this:

  "old_ngram": {
               "type": "ngram",
               "min_gram": 2,
               "max_gram": 2,
               "version": "4.1"
            }

we are currently working on a fix in lucene that makes this possible without tricks

simon

@s1monw s1monw closed this as completed Jun 6, 2013
@ike-bloomfire
Copy link
Author

Thanks for the explanation @s1monw! This is good to know.

I'm a little confused by the alternative you provided, specifically the ngram settings. If the min_gram and max_gram are the same number how will elasticsearch know what the real min_gram and max_gram are?

@ike-bloomfire
Copy link
Author

I just figured out that it was a typo, and that you only have to specify the "version" in there. I appreciate having this fallback. Thank you very much.

So I imagine when I'm using version 4.1 of this ngram filter, all the problems of "massive offset indices and exceptions in the highlighter" will still remain?

@s1monw
Copy link
Contributor

s1monw commented Jun 6, 2013

So I imagine when I'm using version 4.1 of this ngram filter, all the problems of "massive offset indices and exceptions in the highlighter" will still remain?

kind of, we try to prevent them with some internal reordering but ideally you should only use the deprecated stuff if you really need to. I think it's ok at this point but once we have the improved ngram support we work on you should likely switch. BTW. word delimiter has similar problems.

I'm a little confused by the alternative you provided, specifically the ngram settings. If the min_gram and max_gram are the same number how will elasticsearch know what the real min_gram and max_gram are?

oh sorry I was only pointing you towards the version. :) I'd personally use min==max if I'd use it but it's up to you

@ike-bloomfire
Copy link
Author

kind of, we try to prevent them with some internal reordering but ideally you should only use the deprecated stuff if you really need to.

Yeah. I tried using the version 4.1 stuff and I was still running into the out of bounds exceptions that I had with 19.2 :\

I think it's ok at this point but once we have the improved ngram support we work on you should likely switch. BTW.

I was curious about this. Are you saying that the old way of highlighting will make a come back in a future version of elasticsearch?

@s1monw
Copy link
Contributor

s1monw commented Jun 11, 2013

Yeah. I tried using the version 4.1 stuff and I was still running into the out of bounds exceptions that I had with 19.2 :\

do you use stored field instead of source? I fixed this lately and this will be in the upcoming release, can you check 0.90 branch on github if that fixes your problem to make sure we have it fixed?

I was curious about this. Are you saying that the old way of highlighting will make a come back in a future version of elasticsearch?

Yes, it will likely come in the next release. Given the fact that this all relies on a bug / broken behaviour this was the safest choice.

@ike-bloomfire
Copy link
Author

do you use stored field instead of source?

So just basically store a every field that I highlight with the version 4.1 stuff, yes?

I fixed this lately and this will be in the upcoming release, can you check 0.90 branch on github if that fixes your problem to make sure we have it fixed?

okay. So this is stuff that will be in 0.90.2 correct?

Given the fact that this all relies on a bug / broken behaviour this was the safest choice.

Can you explain this a bit for me, I wasn't sure what you were talking about here.

Yes, it will likely come in the next release

We rely heavily on elasticsearch highlighting for our typeahead stuff, so this is a big relief. I thought the current behavior was going to be permanent.

@s1monw
Copy link
Contributor

s1monw commented Jun 11, 2013

So just basically store a every field that I highlight with the version 4.1 stuff, yes?

yeah so this should be fixed

okay. So this is stuff that will be in 0.90.2 correct?

YES!

Can you explain this a bit for me, I wasn't sure what you were talking about here.

Well NGram Filter and Tokenizer where entirely broken causing all these StringIndexOOB Exception and bloated term vectors etc. so fixing the broken behaviour was the only way to go here and deprecate the behaviour basically only exposing it via version

We rely heavily on elasticsearch highlighting for our typeahead stuff, so this is a big relief. I thought the current behavior was going to be permanent.

We will allow this with NGramTokenizer but not with NGramTokenFilter we are working on a pre-tokenization step that splits on whitespaces for instance. The problem here is that only tokenizer can reliably modify token offsets without breaking the TokenStream contract in lucene and produce broken offsets so we can only allow this with Tokenizers. If you need to have this in a filter you need to pull it in via version which I discourage once NGramTokenizer if fixed. I hope this helps you moving forward.

@BowlingX
Copy link

BowlingX commented Aug 5, 2013

I ran into the same issue, is there something else i can or should use? (so instead of setting the version parameter)

@s1monw
Copy link
Contributor

s1monw commented Aug 5, 2013

@BowlingX you should use NGramTokenizer instead that should give you the partial highlighting capabilities you need. see http://www.elasticsearch.org/guide/reference/index-modules/analysis/ngram-tokenizer/

@ariasdelrio
Copy link

@s1monw I was using the 4.1 stuff for edge_ngram, but now I want to change to an NGramTokenizer. My problem is that we were using an ICU tokenizer before. Is there a way to keep this? The only option that the NGramTokenizer seems to allow is to separate words based on character classes.

@s1monw
Copy link
Contributor

s1monw commented Feb 25, 2014

no you can't really do the ICU tokenization together with the NGramTokenizer

@mcuelenaere
Copy link

@s1monw what's the status on partial highlighting with NGramTokenFilter (not NGramTokenizer)? Is there an issue to track this?

@wkiser
Copy link

wkiser commented Mar 8, 2016

+1 would also like to know the status of partial highlighting using an (edge)NGramTokenFilter. Or if there are any workarounds for 2.2.0.

@clintongormley
Copy link

@mcuelenaere @wkiser The contract of token filters is that they can't change positions or offsets, so an ngram (or edge ngram) token filter will always use the offsets of the whole original word.

For partial highlighting, look at using the ngram and edge ngram tokenizers instead.

@wkiser
Copy link

wkiser commented Mar 8, 2016

@clintongormley one of the requirements that I am working with is supporting typeahead search for name synonyms. I have this working by chaining synonym and edgengram token filters. Switching to ngramtokenizer seems like I would lose typeahead on all the synonyms.

Am I out of luck with accurate highlighting here? Is there no slower, non offset based highlighter?

@clintongormley
Copy link

@wkiser you are out of luck, as far as i'm aware

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants