Confussing behaviour in Phrase Matcher for Japanese model #4262

lautel · 2019-09-09T10:44:02Z

How to reproduce the behaviour

0. Load libraries

from spacy.matcher import PhraseMatcher
from spacy.lang.ja import Japanese

1. Define a Phrase Matcher to find custom entities in Japanese text:

def phrase_matcher_test(text):

    nlp = Japanese()

    matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
    patterns = ["nakagawa jun", "中川 潤", "nakagawa@xxxx.jp", "japan"]
    patterns_doc = list(nlp.pipe(patterns))
    matcher.add("ENTITY", None, *patterns_doc)

    doc = nlp(text)

    matches = matcher(doc)
    print(f'\n{len(matches)} matches found!')

    for match_id, start, end in matches:
        print(doc.vocab.strings[match_id]+': ', doc[start:end].text)

3. Call the function

input_text = "Nakagawa Jun (中川潤) のﾒｰﾙはnakagawa@xxxx.jpです. 彼はJapanで働いています"
phrase_matcher_test(input_text )

4. Result

Output:

3 matches found!
ENTITY: 中川潤
ENTITY: nakagawa@xxxx.jp
ENTITY: Japan

Expected output:

4 matches found!
ENTITY: Nakagawa Jun
ENTITY: 中川潤
ENTITY: nakagawa@xxxx.jp
ENTITY: Japan

5. Additional info

If

input_text = "nakagawa jun (中川潤) のﾒｰﾙはnakagawa@xxxx.jpです. 彼はJapanで働いています"

it almost works as expected (note lowercase in 'nakagawa jun', but not in 'Japan'). By almost I mean that it matches the name but deletes the blank space between name and surname. See the following output.

Output:

ENTITY: nakagawajun
ENTITY: 中川潤
ENTITY: nakagawa@xxxx.jp
ENTITY: Japan

Your Environment

spaCy version: 2.1.8
Platform: Windows-10-10.0.17134-SP0
Python version: 3.7.3

lautel · 2019-09-11T13:03:30Z

Hi,
We've kept doing some test and the example given is working fine in SpaCy version 2.1.4. Hope you find this helpful.

adrianeboyd · 2019-09-12T07:29:46Z

Thanks for the report! I can only replicate the missing space, not the missing match, but I definitely think there is something going on because the Japanese tokenization tests are failing, too.

With a clean virtual environment with spacy 2.1.8 and mecab-python3, I get the output:

4 matches found!
ENTITY:  NakagawaJun
ENTITY:  中川潤
ENTITY:  nakagawa@xxxx.jp
ENTITY:  Japan

My enviroment:

blis==0.2.4
certifi==2019.9.11
chardet==3.0.4
cymem==2.0.2
idna==2.8
mecab-python3==0.996.2
murmurhash==1.0.2
numpy==1.17.2
pkg-resources==0.0.0
plac==0.9.6
preshed==2.0.1
requests==2.22.0
spacy==2.1.8
srsly==0.1.0
thinc==7.0.8
tqdm==4.35.0
urllib3==1.25.3
wasabi==0.2.2

@polm, do you have any ideas?

polm · 2019-09-12T07:43:54Z

Can you try it with mecab-python3==0.7? I think that should be the version in optional requires, newer versions have bugs or dictionary issues.

lautel · 2019-09-12T08:38:45Z

Hi, thanks for your quick response!
So, working with mecab-python3==0.996.2 I have same output as @adrianeboyd, both in Windows and Linux environment. However, previously I had mecab-python3==0.7 installed and that's why the first match (NakagawaJun) was missing...

adrianeboyd · 2019-09-12T12:04:42Z

Sorry, I was just going by the error message (I guess the error message should be more specific?):

spaCy/spacy/lang/ja/__init__.py

Lines 27 to 29 in 4d4b3b0

    
           raise ImportError( 
        
               "Japanese support requires MeCab: " 
        
               "https://github.com/SamuraiT/mecab-python3"

However, I get the same results with 4 matches and no space with mecab-python3==0.7 and nearly all the tests in tests/lang/ja fail. A few examples are tokenized correctly, some aren't, and I think all the POS and lemma tests fail.

polm · 2019-09-13T12:15:45Z

@adrianeboyd Are you sure you have Unidic installed? If you're using ipadic that would cause the tests to fail. To check, show the output of echo "図書館" | mecab and mecab -D.

@lautel With a clean install of spaCy 2.1.8 and mecab-python3 0.7 on Linux I get this output:

4 matches found!
ENTITY:  NakagawaJun
ENTITY:  中川潤
ENTITY:  nakagawa@xxxx.jp
ENTITY:  Japan

So the space issue is there, but I don't have the other mismatch issue. I'm not sure why you aren't getting Nakagawa Jun as an entity, can you figure out what POS tags it's getting?

As to why the space is missing, the way Mecab handles half-width spaces is weird. I thought I'd handled it correctly but it's possible I missed something, so I'll look over that code in more detail.

lautel · 2019-09-13T12:53:24Z

@polm, I've re-run the example to double-check the output and yes, I'm missing the first match. Anyway, this seems to be okay in mecab-python3==0.996.2
In case you find it useful, see below further information regarding my environment (Linux):

python -m spacy info --markdown

spaCy version: 2.1.8
Platform: Linux-4.9.184-linuxkit-x86_64-with-debian-9.9
Python version: 3.7.4

echo "図書館" | mecab

図書館  名詞,一般,*,*,*,*,図書館,トショカン,トショカン
EOS

mecab -D

filename:       /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/sys.dic
version:        102
charset:        UTF8
type:   0
size:   4587310
left size:      1316
right size:     1316

pip list

Package       Version
------------- ---------
blis          0.2.4
certifi       2019.9.11
chardet       3.0.4
cymem         2.0.2
idna          2.8
mecab-python3 0.7
mojimoji      0.0.9
murmurhash    1.0.2
numpy         1.17.2
pip           19.1.1
plac          0.9.6
preshed       2.0.1
requests      2.22.0
setuptools    41.0.1
spacy         2.1.8
srsly         0.1.0
thinc         7.0.8
tqdm          4.35.0
urllib3       1.25.3
wasabi        0.2.2
wheel         0.33.4

Thank you both!

polm · 2019-09-13T13:17:12Z

@lautel OK, you're using an IPADic based Neologd dictionary, which won't work. spaCy uses Universal Dependencies, which is based on / only supports Unidic. Please install Unidic and configure Mecab to use it. I would also suggest you avoid using Neologd. This is a reminder that I should probably write a dictionary sniffer to check that Unidic is being used...

The missing spaces was a real issue; I pushed a fix in #4284. Thanks for finding it!

Before this patch, half-width spaces between words were simply lost in Japanese text. This wasn't immediately noticeable because much Japanese text never uses spaces at all.

lock · 2019-10-13T14:42:53Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lautel changed the title ~~Bad behaviour in Phrase Matcher for Japanese model~~ Confussing behaviour in Phrase Matcher for Japanese model Sep 9, 2019

ines added bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer lang / ja Japanese language data and models third-party Third-party packages and services labels Sep 9, 2019

ines closed this as completed Sep 13, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confussing behaviour in Phrase Matcher for Japanese model #4262

Confussing behaviour in Phrase Matcher for Japanese model #4262

lautel commented Sep 9, 2019 •

edited

lautel commented Sep 11, 2019

adrianeboyd commented Sep 12, 2019

polm commented Sep 12, 2019

lautel commented Sep 12, 2019

adrianeboyd commented Sep 12, 2019

polm commented Sep 13, 2019

lautel commented Sep 13, 2019

polm commented Sep 13, 2019 •

edited

lock bot commented Oct 13, 2019

Confussing behaviour in Phrase Matcher for Japanese model #4262

Confussing behaviour in Phrase Matcher for Japanese model #4262

Comments

lautel commented Sep 9, 2019 • edited

How to reproduce the behaviour

0. Load libraries

1. Define a Phrase Matcher to find custom entities in Japanese text:

3. Call the function

4. Result

5. Additional info

Your Environment

lautel commented Sep 11, 2019

adrianeboyd commented Sep 12, 2019

polm commented Sep 12, 2019

lautel commented Sep 12, 2019

adrianeboyd commented Sep 12, 2019

polm commented Sep 13, 2019

lautel commented Sep 13, 2019

polm commented Sep 13, 2019 • edited

lock bot commented Oct 13, 2019

lautel commented Sep 9, 2019 •

edited

polm commented Sep 13, 2019 •

edited