New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
elasticsearch highlights entire word instead of just the query when ngram filter is used #3137
Comments
This problem won't exhibit itself if you do an upgrade and continue using an old index. It only happens on new indexes created in 0.90.1 or 0.90.0 ... |
I am afraid this is a feature not a bug. We had to drop the behaviour in the standard ngram filter since it produces broken offsets that can lead to massive offset indices and exceptions in the highlighter. If you want to have the same behaviour as 0.20 etc. you need to map the token filter like this:
we are currently working on a fix in lucene that makes this possible without tricks simon |
Thanks for the explanation @s1monw! This is good to know. I'm a little confused by the alternative you provided, specifically the ngram settings. If the min_gram and max_gram are the same number how will elasticsearch know what the real min_gram and max_gram are? |
I just figured out that it was a typo, and that you only have to specify the "version" in there. I appreciate having this fallback. Thank you very much. So I imagine when I'm using version 4.1 of this ngram filter, all the problems of "massive offset indices and exceptions in the highlighter" will still remain? |
kind of, we try to prevent them with some internal reordering but ideally you should only use the deprecated stuff if you really need to. I think it's ok at this point but once we have the improved ngram support we work on you should likely switch. BTW. word delimiter has similar problems.
oh sorry I was only pointing you towards the version. :) I'd personally use min==max if I'd use it but it's up to you |
Yeah. I tried using the version 4.1 stuff and I was still running into the out of bounds exceptions that I had with 19.2 :\
I was curious about this. Are you saying that the old way of highlighting will make a come back in a future version of elasticsearch? |
do you use stored field instead of source? I fixed this lately and this will be in the upcoming release, can you check 0.90 branch on github if that fixes your problem to make sure we have it fixed?
Yes, it will likely come in the next release. Given the fact that this all relies on a bug / broken behaviour this was the safest choice. |
So just basically store a every field that I highlight with the version 4.1 stuff, yes?
okay. So this is stuff that will be in 0.90.2 correct?
Can you explain this a bit for me, I wasn't sure what you were talking about here.
We rely heavily on elasticsearch highlighting for our typeahead stuff, so this is a big relief. I thought the current behavior was going to be permanent. |
yeah so this should be fixed
YES!
Well NGram Filter and Tokenizer where entirely broken causing all these StringIndexOOB Exception and bloated term vectors etc. so fixing the broken behaviour was the only way to go here and deprecate the behaviour basically only exposing it via
We will allow this with NGramTokenizer but not with NGramTokenFilter we are working on a |
I ran into the same issue, is there something else i can or should use? (so instead of setting the version parameter) |
@BowlingX you should use NGramTokenizer instead that should give you the partial highlighting capabilities you need. see http://www.elasticsearch.org/guide/reference/index-modules/analysis/ngram-tokenizer/ |
@s1monw I was using the 4.1 stuff for edge_ngram, but now I want to change to an NGramTokenizer. My problem is that we were using an ICU tokenizer before. Is there a way to keep this? The only option that the NGramTokenizer seems to allow is to separate words based on character classes. |
no you can't really do the ICU tokenization together with the NGramTokenizer |
@s1monw what's the status on partial highlighting with NGramTokenFilter (not NGramTokenizer)? Is there an issue to track this? |
+1 would also like to know the status of partial highlighting using an (edge)NGramTokenFilter. Or if there are any workarounds for 2.2.0. |
@mcuelenaere @wkiser The contract of token filters is that they can't change positions or offsets, so an ngram (or edge ngram) token filter will always use the offsets of the whole original word. For partial highlighting, look at using the ngram and edge ngram tokenizers instead. |
@clintongormley one of the requirements that I am working with is supporting typeahead search for name synonyms. I have this working by chaining synonym and edgengram token filters. Switching to ngramtokenizer seems like I would lose typeahead on all the synonyms. Am I out of luck with accurate highlighting here? Is there no slower, non offset based highlighter? |
@wkiser you are out of luck, as far as i'm aware |
when using an nGram filter on a field or index, if you try to highlight said field (or a field in an index that has an nGram filter defined on it, in search results, elasticsearch highlights the entire word instead of just the query.
so if I have the text "American" and I search for "rican"
highlighting should look like this ----> Ame rican
but instead it does this ---> American
To see this in action just follow the instructions here http://stackoverflow.com/a/15005321/141822
you get this output, which is clearly wrong
With the whitespace tokenizer (vs keyword tokenizer in this case), it highlights just the word with the match in it, which is still not expected behavior
The text was updated successfully, but these errors were encountered: