Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token offsets are incorrect #25

Closed
Trey314159 opened this issue Jul 18, 2017 · 4 comments
Closed

Token offsets are incorrect #25

Trey314159 opened this issue Jul 18, 2017 · 4 comments

Comments

@Trey314159
Copy link

I had to build the plugin myself for Elasticsearch v5.3.2.

If I analyze this string sách .. sách ; 東京都 I get the following:

{
  "tokens" : [
    {
      "token" : "sách",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "phrase",
      "position" : 0
    },
    {
      "token" : "sách",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "phrase",
      "position" : 3
    },
    {
      "token" : "東",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "residual",
      "position" : 7
    },
    {
      "token" : "京",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "residual",
      "position" : 11
    },
    {
      "token" : "都",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "residual",
      "position" : 15
    }
  ]
}

The start and end offsets are incorrect and do not map correctly back into the original string. Each start_offset is just one more than the end_offset of the previous token. This is incorrect when there is more than one character or less than one character between tokens.

I believe that this would be the correct output:

{
  "tokens" : [
    {
      "token" : "sach",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "phrase",
      "position" : 0
    },
    {
      "token" : "sach",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "phrase",
      "position" : 1
    },
    {
      "token" : "東",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "residual",
      "position" : 2
    },
    {
      "token" : "京",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "residual",
      "position" : 3
    },
    {
      "token" : "都",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "residual",
      "position" : 4
    }
  ]
}

Note that the position values are also not correct.

@duydo
Copy link
Owner

duydo commented Jul 19, 2017

Thanks for raising this issue @Trey314159, the position value plays an important role in searching phrases, I'll look at it.

Btw, the output tokens should keep their accent marks, this is the goal of the plugin.

@Trey314159
Copy link
Author

Oops—I agree that the tokens should keep their accent marks! I ran the text through a different analyzer to get the correct offsets in my sample output. I changed the types to match, but not the tokens.

@Trey314159
Copy link
Author

FYI, I think the incorrect offsets or position values are causing StringIndexOutOfBoundsExceptions in my re-indexing script. I've posted a stack trace for one instance if you want to take a look.

@duydo
Copy link
Owner

duydo commented Aug 16, 2017

I close this issue, it will be fixed in new tokenizer #37

@duydo duydo closed this as completed Aug 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants