Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confussing behaviour in Phrase Matcher for Japanese model #4262

Closed
lautel opened this issue Sep 9, 2019 · 9 comments
Closed

Confussing behaviour in Phrase Matcher for Japanese model #4262

lautel opened this issue Sep 9, 2019 · 9 comments
Labels
bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer lang / ja Japanese language data and models third-party Third-party packages and services

Comments

@lautel
Copy link

lautel commented Sep 9, 2019

How to reproduce the behaviour

0. Load libraries

from spacy.matcher import PhraseMatcher
from spacy.lang.ja import Japanese

1. Define a Phrase Matcher to find custom entities in Japanese text:

def phrase_matcher_test(text):

    nlp = Japanese()

    matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
    patterns = ["nakagawa jun", "中川 潤", "nakagawa@xxxx.jp", "japan"]
    patterns_doc = list(nlp.pipe(patterns))
    matcher.add("ENTITY", None, *patterns_doc)

    doc = nlp(text)

    matches = matcher(doc)
    print(f'\n{len(matches)} matches found!')

    for match_id, start, end in matches:
        print(doc.vocab.strings[match_id]+': ', doc[start:end].text)

3. Call the function

input_text = "Nakagawa Jun (中川潤) のメールはnakagawa@xxxx.jpです. 彼はJapanで働いています"
phrase_matcher_test(input_text )

4. Result

Output:

3 matches found!
ENTITY: 中川潤
ENTITY: nakagawa@xxxx.jp
ENTITY: Japan

Expected output:

4 matches found!
ENTITY: Nakagawa Jun
ENTITY: 中川潤
ENTITY: nakagawa@xxxx.jp
ENTITY: Japan

5. Additional info

If

input_text = "nakagawa jun (中川潤) のメールはnakagawa@xxxx.jpです. 彼はJapanで働いています"

it almost works as expected (note lowercase in 'nakagawa jun', but not in 'Japan'). By almost I mean that it matches the name but deletes the blank space between name and surname. See the following output.

Output:

ENTITY: nakagawajun
ENTITY: 中川潤
ENTITY: nakagawa@xxxx.jp
ENTITY: Japan

Your Environment

  • spaCy version: 2.1.8
  • Platform: Windows-10-10.0.17134-SP0
  • Python version: 3.7.3
@lautel lautel changed the title Bad behaviour in Phrase Matcher for Japanese model Confussing behaviour in Phrase Matcher for Japanese model Sep 9, 2019
@ines ines added bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer lang / ja Japanese language data and models third-party Third-party packages and services labels Sep 9, 2019
@lautel
Copy link
Author

lautel commented Sep 11, 2019

Hi,
We've kept doing some test and the example given is working fine in SpaCy version 2.1.4. Hope you find this helpful.

@adrianeboyd
Copy link
Contributor

Thanks for the report! I can only replicate the missing space, not the missing match, but I definitely think there is something going on because the Japanese tokenization tests are failing, too.

With a clean virtual environment with spacy 2.1.8 and mecab-python3, I get the output:

4 matches found!
ENTITY:  NakagawaJun
ENTITY:  中川潤
ENTITY:  nakagawa@xxxx.jp
ENTITY:  Japan

My enviroment:

blis==0.2.4
certifi==2019.9.11
chardet==3.0.4
cymem==2.0.2
idna==2.8
mecab-python3==0.996.2
murmurhash==1.0.2
numpy==1.17.2
pkg-resources==0.0.0
plac==0.9.6
preshed==2.0.1
requests==2.22.0
spacy==2.1.8
srsly==0.1.0
thinc==7.0.8
tqdm==4.35.0
urllib3==1.25.3
wasabi==0.2.2

@polm, do you have any ideas?

@polm
Copy link
Contributor

polm commented Sep 12, 2019

Can you try it with mecab-python3==0.7? I think that should be the version in optional requires, newer versions have bugs or dictionary issues.

@lautel
Copy link
Author

lautel commented Sep 12, 2019

Hi, thanks for your quick response!
So, working with mecab-python3==0.996.2 I have same output as @adrianeboyd, both in Windows and Linux environment. However, previously I had mecab-python3==0.7 installed and that's why the first match (NakagawaJun) was missing...

@adrianeboyd
Copy link
Contributor

Sorry, I was just going by the error message (I guess the error message should be more specific?):

raise ImportError(
"Japanese support requires MeCab: "
"https://github.com/SamuraiT/mecab-python3"

However, I get the same results with 4 matches and no space with mecab-python3==0.7 and nearly all the tests in tests/lang/ja fail. A few examples are tokenized correctly, some aren't, and I think all the POS and lemma tests fail.

@polm
Copy link
Contributor

polm commented Sep 13, 2019

@adrianeboyd Are you sure you have Unidic installed? If you're using ipadic that would cause the tests to fail. To check, show the output of echo "図書館" | mecab and mecab -D.

@lautel With a clean install of spaCy 2.1.8 and mecab-python3 0.7 on Linux I get this output:

4 matches found!
ENTITY:  NakagawaJun
ENTITY:  中川潤
ENTITY:  nakagawa@xxxx.jp
ENTITY:  Japan

So the space issue is there, but I don't have the other mismatch issue. I'm not sure why you aren't getting Nakagawa Jun as an entity, can you figure out what POS tags it's getting?

As to why the space is missing, the way Mecab handles half-width spaces is weird. I thought I'd handled it correctly but it's possible I missed something, so I'll look over that code in more detail.

@lautel
Copy link
Author

lautel commented Sep 13, 2019

@polm, I've re-run the example to double-check the output and yes, I'm missing the first match. Anyway, this seems to be okay in mecab-python3==0.996.2
In case you find it useful, see below further information regarding my environment (Linux):

python -m spacy info --markdown

  • spaCy version: 2.1.8
  • Platform: Linux-4.9.184-linuxkit-x86_64-with-debian-9.9
  • Python version: 3.7.4

echo "図書館" | mecab

図書館  名詞,一般,*,*,*,*,図書館,トショカン,トショカン
EOS

mecab -D

filename:       /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/sys.dic
version:        102
charset:        UTF8
type:   0
size:   4587310
left size:      1316
right size:     1316

pip list

Package       Version
------------- ---------
blis          0.2.4
certifi       2019.9.11
chardet       3.0.4
cymem         2.0.2
idna          2.8
mecab-python3 0.7
mojimoji      0.0.9
murmurhash    1.0.2
numpy         1.17.2
pip           19.1.1
plac          0.9.6
preshed       2.0.1
requests      2.22.0
setuptools    41.0.1
spacy         2.1.8
srsly         0.1.0
thinc         7.0.8
tqdm          4.35.0
urllib3       1.25.3
wasabi        0.2.2
wheel         0.33.4

Thank you both!

@polm
Copy link
Contributor

polm commented Sep 13, 2019

@lautel OK, you're using an IPADic based Neologd dictionary, which won't work. spaCy uses Universal Dependencies, which is based on / only supports Unidic. Please install Unidic and configure Mecab to use it. I would also suggest you avoid using Neologd. This is a reminder that I should probably write a dictionary sniffer to check that Unidic is being used...

The missing spaces was a real issue; I pushed a fix in #4284. Thanks for finding it!

ines pushed a commit that referenced this issue Sep 13, 2019
Before this patch, half-width spaces between words were simply lost in
Japanese text. This wasn't immediately noticeable because much Japanese
text never uses spaces at all.
@ines ines closed this as completed Sep 13, 2019
@lock
Copy link

lock bot commented Oct 13, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer lang / ja Japanese language data and models third-party Third-party packages and services
Projects
None yet
Development

No branches or pull requests

4 participants