Repeated Tokens Get Incorrect Offsets #29

Trey314159 · 2017-08-04T14:39:17Z

Repeated tokens get incorrect offsets, especially in the presence of extra whitespace and whitespace-like characters. I found examples with spaces and with a right-to-left mark followed by a space. I left out the right-to-left mark example because it's invisible, but you can recreate the behavior by changing the first space to a right-to-left mark in #2 or #3 below. I haven't tested any other space-like characters.

no extra spaces, everything indexed as expected.

curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "a b a b"}'

{
  "tokens" : [
    { "token" : "a", "start_offset" : 0, "end_offset" : 1, "type" : "<PHRASE>", "position" : 0 },
    { "token" : "b", "start_offset" : 2, "end_offset" : 3, "type" : "<PHRASE>", "position" : 1 },
    { "token" : "a", "start_offset" : 4, "end_offset" : 5, "type" : "<PHRASE>", "position" : 2 },
    { "token" : "b", "start_offset" : 6, "end_offset" : 7, "type" : "<PHRASE>", "position" : 3 }
  ]
}

one extra leading space; both "b" tokens have offsets 3-4:

curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : " a b a b"}'

{
  "tokens" : [
    { "token" : "a", "start_offset" : 1, "end_offset" : 2, "type" : "<PHRASE>", "position" : 0 },
    { "token" : "b", "start_offset" : 3, "end_offset" : 4, "type" : "<PHRASE>", "position" : 1 },
    { "token" : "a", "start_offset" : 5, "end_offset" : 6, "type" : "<PHRASE>", "position" : 2 },
    { "token" : "b", "start_offset" : 3, "end_offset" : 4, "type" : "<PHRASE>", "position" : 3 }
  ]
}

two extra leading spaces; both "a" tokens have offsets 2-3, and both "b" tokens have offsets 4-5.

curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "  a b a b"}'

{
  "tokens" : [
    { "token" : "a", "start_offset" : 2, "end_offset" : 3, "type" : "<PHRASE>", "position" : 0 },
    { "token" : "b", "start_offset" : 4, "end_offset" : 5, "type" : "<PHRASE>", "position" : 1 },
    { "token" : "a", "start_offset" : 2, "end_offset" : 3, "type" : "<PHRASE>", "position" : 2 },
    { "token" : "b", "start_offset" : 4, "end_offset" : 5, "type" : "<PHRASE>", "position" : 3 }
  ]
}

A real life example from Vietnamese Wikipedia, showing more long-distance duplicates. Sorry if the text doesn't make sense. I edited out a bit of Arabic script, which also had a right-to-left mark in it, which is not visible. Here I've added an asterisk (*) before the lines with incorrect offsets.

curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "  ULY: Lop Nahiyisi, UPNY: Lop Nah̡iyisi ? ; giản thể: 洛浦县; bính âm: Luòpǔ xiàn, Hán  Abraxas friedrichi là một loài bướm đêm trong họ Geometridae. Dữ liệu liên quan tới Abraxas friedrichi tại Wikispecies"}'

{
  "tokens" : [
    { "token" : "uly",         "start_offset" : 2, "end_offset" : 5, "type" : "<PHRASE>", "position" : 0 },
    { "token" : "lop",         "start_offset" : 7, "end_offset" : 10, "type" : "<PHRASE>", "position" : 1 },
    { "token" : "nahiyisi",    "start_offset" : 11, "end_offset" : 19, "type" : "<PHRASE>", "position" : 2 },
    { "token" : "upny",        "start_offset" : 21, "end_offset" : 25, "type" : "<PHRASE>", "position" : 3 },
*   { "token" : "lop",         "start_offset" : 7, "end_offset" : 10, "type" : "<PHRASE>", "position" : 4 },
*   { "token" : "nah",         "start_offset" : 11, "end_offset" : 14, "type" : "<FOREIGN>", "position" : 5 },
    { "token" : "̡",           "start_offset" : 34, "end_offset" : 35, "type" : "<OTHER>", "position" : 6 },
*   { "token" : "iyisi",       "start_offset" : 14, "end_offset" : 19, "type" : "<PHRASE>", "position" : 7 },
    { "token" : "giản",        "start_offset" : 45, "end_offset" : 49, "type" : "<PHRASE>", "position" : 8 },
    { "token" : "thể",         "start_offset" : 50, "end_offset" : 53, "type" : "<PHRASE>", "position" : 9 },
    { "token" : "洛浦县",        "start_offset" : 55, "end_offset" : 58, "type" : "<FOREIGN>", "position" : 10 },
    { "token" : "bính",        "start_offset" : 60, "end_offset" : 64, "type" : "<PHRASE>", "position" : 11 },
    { "token" : "âm",          "start_offset" : 65, "end_offset" : 67, "type" : "<PHRASE>", "position" : 12 },
    { "token" : "luòpǔ",       "start_offset" : 69, "end_offset" : 74, "type" : "<PHRASE>", "position" : 13 },
    { "token" : "xiàn",        "start_offset" : 75, "end_offset" : 79, "type" : "<PHRASE>", "position" : 14 },
    { "token" : "hán",         "start_offset" : 81, "end_offset" : 84, "type" : "<PHRASE>", "position" : 15 },
    { "token" : "abraxas",     "start_offset" : 86, "end_offset" : 93, "type" : "<PHRASE>", "position" : 16 },
    { "token" : "friedrichi",  "start_offset" : 94, "end_offset" : 104, "type" : "<PHRASE>", "position" : 17 },
    { "token" : "một",         "start_offset" : 108, "end_offset" : 111, "type" : "<PHRASE>", "position" : 19 },
    { "token" : "loài",        "start_offset" : 112, "end_offset" : 116, "type" : "<PHRASE>", "position" : 20 },
    { "token" : "bướm",        "start_offset" : 117, "end_offset" : 121, "type" : "<PHRASE>", "position" : 21 },
    { "token" : "đêm",         "start_offset" : 122, "end_offset" : 125, "type" : "<PHRASE>", "position" : 22 },
    { "token" : "trong",       "start_offset" : 126, "end_offset" : 131, "type" : "<PHRASE>", "position" : 23 },
    { "token" : "họ",          "start_offset" : 132, "end_offset" : 134, "type" : "<PHRASE>", "position" : 24 },
    { "token" : "geometridae", "start_offset" : 135, "end_offset" : 146, "type" : "<PHRASE>", "position" : 25 },
    { "token" : "dữ liệu",     "start_offset" : 148, "end_offset" : 155, "type" : "<PHRASE>", "position" : 26 },
    { "token" : "liên quan",   "start_offset" : 156, "end_offset" : 165, "type" : "<PHRASE>", "position" : 27 },
    { "token" : "tới",         "start_offset" : 166, "end_offset" : 169, "type" : "<PHRASE>", "position" : 28 },
*   { "token" : "abraxas",     "start_offset" : 86, "end_offset" : 93, "type" : "<PHRASE>", "position" : 29 },
*   { "token" : "friedrichi",  "start_offset" : 94, "end_offset" : 104, "type" : "<PHRASE>", "position" : 30 },
    { "token" : "wikispecies", "start_offset" : 193, "end_offset" : 204, "type" : "<PHRASE>", "position" : 32 }
  ]
}

The text was updated successfully, but these errors were encountered:

duydo · 2017-08-16T10:22:00Z

I close this issue, it will be fixed in new tokenizer #37

duydo closed this as completed Aug 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated Tokens Get Incorrect Offsets #29

Repeated Tokens Get Incorrect Offsets #29

Trey314159 commented Aug 4, 2017

duydo commented Aug 16, 2017

Repeated Tokens Get Incorrect Offsets #29

Repeated Tokens Get Incorrect Offsets #29

Comments

Trey314159 commented Aug 4, 2017

duydo commented Aug 16, 2017