Possible Issue with tokenization when English+Japanese are adjacent in text #116

bbguitar77 · 2017-06-26T18:17:46Z

Text => Dior化粧品等の輸入総代理店で , which indexed with the default Kuromoji analyzer produces the following tokens:

dior
start: 0 end: 4 pos: 0
化粧
start: 4 end: 6 pos: 1
品等
start: 6 end: 8 pos: 2
輸入
start: 9 end: 11 pos: 4
総
start: 11 end: 12 pos: 5
代理
start: 12 end: 14 pos: 6
店
start: 14 end: 15 pos: 7

However, we noticed that when a user searched for the term Dior化粧品, it did not produce a match (using same analyzer settings). The reason is that the search term is tokenized as such:

dior
start: 0 end: 4 pos: 0
化粧
start: 4 end: 6 pos: 1
品  
start: 6 end: 7 pos: 2

Since the word cosmetics is the Japanese term 化粧品, it seems like the search term got analyzed correctly but the piece of text produced an unexpected bigram sequence of 化粧 and 品等

Not sure if this is a valid issue due to the mix of English/Japanese in the text or my Japanese fundamentals are off here

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Issue with tokenization when English+Japanese are adjacent in text #116

Possible Issue with tokenization when English+Japanese are adjacent in text #116

bbguitar77 commented Jun 26, 2017

Possible Issue with tokenization when English+Japanese are adjacent in text #116

Possible Issue with tokenization when English+Japanese are adjacent in text #116

Comments

bbguitar77 commented Jun 26, 2017