-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token offsets are incorrect #25
Comments
Thanks for raising this issue @Trey314159, the position value plays an important role in searching phrases, I'll look at it. Btw, the output tokens should keep their accent marks, this is the goal of the plugin. |
Oops—I agree that the tokens should keep their accent marks! I ran the text through a different analyzer to get the correct offsets in my sample output. I changed the types to match, but not the tokens. |
FYI, I think the incorrect offsets or position values are causing StringIndexOutOfBoundsExceptions in my re-indexing script. I've posted a stack trace for one instance if you want to take a look. |
I close this issue, it will be fixed in new tokenizer #37 |
I had to build the plugin myself for Elasticsearch v5.3.2.
If I analyze this string
sách .. sách ; 東京都
I get the following:The start and end offsets are incorrect and do not map correctly back into the original string. Each start_offset is just one more than the end_offset of the previous token. This is incorrect when there is more than one character or less than one character between tokens.
I believe that this would be the correct output:
Note that the
position
values are also not correct.The text was updated successfully, but these errors were encountered: