Token offsets are incorrect #25

Trey314159 · 2017-07-18T21:42:34Z

I had to build the plugin myself for Elasticsearch v5.3.2.

If I analyze this string sách .. sách ; 東京都 I get the following:

{
  "tokens" : [
    {
      "token" : "sách",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "phrase",
      "position" : 0
    },
    {
      "token" : "sách",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "phrase",
      "position" : 3
    },
    {
      "token" : "東",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "residual",
      "position" : 7
    },
    {
      "token" : "京",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "residual",
      "position" : 11
    },
    {
      "token" : "都",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "residual",
      "position" : 15
    }
  ]
}

The start and end offsets are incorrect and do not map correctly back into the original string. Each start_offset is just one more than the end_offset of the previous token. This is incorrect when there is more than one character or less than one character between tokens.

I believe that this would be the correct output:

{
  "tokens" : [
    {
      "token" : "sach",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "phrase",
      "position" : 0
    },
    {
      "token" : "sach",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "phrase",
      "position" : 1
    },
    {
      "token" : "東",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "residual",
      "position" : 2
    },
    {
      "token" : "京",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "residual",
      "position" : 3
    },
    {
      "token" : "都",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "residual",
      "position" : 4
    }
  ]
}

Note that the position values are also not correct.

The text was updated successfully, but these errors were encountered:

duydo · 2017-07-19T03:08:45Z

Thanks for raising this issue @Trey314159, the position value plays an important role in searching phrases, I'll look at it.

Btw, the output tokens should keep their accent marks, this is the goal of the plugin.

Trey314159 · 2017-07-19T11:56:26Z

Oops—I agree that the tokens should keep their accent marks! I ran the text through a different analyzer to get the correct offsets in my sample output. I changed the types to match, but not the tokens.

Trey314159 · 2017-07-21T21:02:01Z

FYI, I think the incorrect offsets or position values are causing StringIndexOutOfBoundsExceptions in my re-indexing script. I've posted a stack trace for one instance if you want to take a look.

duydo · 2017-08-16T10:22:52Z

I close this issue, it will be fixed in new tokenizer #37

duydo mentioned this issue Jul 26, 2017

Capitalization plus non-Vietnamese characters leads to inconsistent tokenization #26

Closed

duydo closed this as completed Aug 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token offsets are incorrect #25

Token offsets are incorrect #25

Trey314159 commented Jul 18, 2017

duydo commented Jul 19, 2017 •

edited

Loading

Trey314159 commented Jul 19, 2017

Trey314159 commented Jul 21, 2017

duydo commented Aug 16, 2017

Token offsets are incorrect #25

Token offsets are incorrect #25

Comments

Trey314159 commented Jul 18, 2017

duydo commented Jul 19, 2017 • edited Loading

Trey314159 commented Jul 19, 2017

Trey314159 commented Jul 21, 2017

duydo commented Aug 16, 2017

duydo commented Jul 19, 2017 •

edited

Loading