New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confussing behaviour in Phrase Matcher for Japanese model #4262
Comments
Hi, |
Thanks for the report! I can only replicate the missing space, not the missing match, but I definitely think there is something going on because the Japanese tokenization tests are failing, too. With a clean virtual environment with spacy 2.1.8 and mecab-python3, I get the output:
My enviroment:
@polm, do you have any ideas? |
Can you try it with mecab-python3==0.7? I think that should be the version in optional requires, newer versions have bugs or dictionary issues. |
Hi, thanks for your quick response! |
Sorry, I was just going by the error message (I guess the error message should be more specific?): spaCy/spacy/lang/ja/__init__.py Lines 27 to 29 in 4d4b3b0
However, I get the same results with 4 matches and no space with mecab-python3==0.7 and nearly all the tests in |
@adrianeboyd Are you sure you have Unidic installed? If you're using ipadic that would cause the tests to fail. To check, show the output of @lautel With a clean install of spaCy 2.1.8 and mecab-python3 0.7 on Linux I get this output:
So the space issue is there, but I don't have the other mismatch issue. I'm not sure why you aren't getting As to why the space is missing, the way Mecab handles half-width spaces is weird. I thought I'd handled it correctly but it's possible I missed something, so I'll look over that code in more detail. |
@polm, I've re-run the example to double-check the output and yes, I'm missing the first match. Anyway, this seems to be okay in mecab-python3==0.996.2
Thank you both! |
@lautel OK, you're using an IPADic based Neologd dictionary, which won't work. spaCy uses Universal Dependencies, which is based on / only supports Unidic. Please install Unidic and configure Mecab to use it. I would also suggest you avoid using Neologd. This is a reminder that I should probably write a dictionary sniffer to check that Unidic is being used... The missing spaces was a real issue; I pushed a fix in #4284. Thanks for finding it! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
0. Load libraries
1. Define a Phrase Matcher to find custom entities in Japanese text:
3. Call the function
4. Result
Output:
3 matches found!
ENTITY: 中川潤
ENTITY: nakagawa@xxxx.jp
ENTITY: Japan
Expected output:
4 matches found!
ENTITY: Nakagawa Jun
ENTITY: 中川潤
ENTITY: nakagawa@xxxx.jp
ENTITY: Japan
5. Additional info
If
it almost works as expected (note lowercase in 'nakagawa jun', but not in 'Japan'). By almost I mean that it matches the name but deletes the blank space between name and surname. See the following output.
Output:
ENTITY: nakagawajun
ENTITY: 中川潤
ENTITY: nakagawa@xxxx.jp
ENTITY: Japan
Your Environment
The text was updated successfully, but these errors were encountered: