Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace python-mecab3 with fugashi for Japanese #4621

Merged
merged 5 commits into from Nov 23, 2019

Conversation

@polm
Copy link
Contributor

polm commented Nov 10, 2019

Description

Currently Japanese uses python-mecab3 as a wrapper of MeCab for tokenization. However we use an old version of python-mecab3 because the latest versions make using Unidic (required by Universal Dependencies) tricky, in addition to having an unresolved issue around memory usage.

This change replaces python-mecab3 with fugashi, a Cython wrapper for MeCab I wrote recently, which doesn't have the above issues. The API was also written with spaCy in mind, which allowed for simplifying the tokenizer code here a bit.

Types of change

This is an enhancement.

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.
polm added 5 commits Nov 6, 2019
mecab-python3 has been the best MeCab binding for a long time but it's
not very actively maintained, and since it's based on old SWIG code
distributed with MeCab there's a limit to how effectively it can be
maintained.

Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not
based on the old SWIG code it's easier to keep it current and make small
deviations from the MeCab C/C++ API where that makes sense.
The tags come from MeCab, but the tag schema is specified by Unidic, so
it's more proper to refer to it that way.
@adrianeboyd

This comment has been minimized.

Copy link
Collaborator

adrianeboyd commented Nov 10, 2019

I was looking at Japanese as an example for how to better set up the Chinese tokenizer and I noticed that the handling of whitespace is inconsistent with other languages. With the default tokenizer, consecutive whitespace tokens are merged into one token, so I like cheese with three spaces after I is 'I', ' ', 'like', 'cheese' with two spaces as one token. (The spaces aren't getting displayed right here.)

Would your library make a difference for Korean? (And do you know if there's an easy way to set spacy up so that both Japanese and Korean work with mecab at the same time? The only way I could figure out was to set the environment variable MECABRC in __init__.py to point to the right configuration file for each, but hard-coded paths are obviously not a general solution. I mainly want to be able to run all the unit tests locally (especially since they're not run in the CI tests) and also do things like process all UD corpora in one batch.)

@polm

This comment has been minimized.

Copy link
Contributor Author

polm commented Nov 10, 2019

I was looking at Japanese as an example for how to better set up the Chinese tokenizer and I noticed that the handling of whitespace is inconsistent with other languages. With the default tokenizer, consecutive whitespace tokens are merged into one token, so I like cheese with three spaces after I is 'I', ' ', 'like', 'cheese' with two spaces as one token. (The spaces aren't getting displayed right here.)

The way MeCab handles whitespace is weird and has caused issues before. I think we have a test and "I [three spaces] like cheese" should result in five tokens, two of which are spaces. Either way this PR shouldn't change the behavior of spaces, and if there's a problem or inconsistency it might be best to make a new issue.

Would your library make a difference for Korean?

Based on a benchmark I made when creating my library, the natto-py wrapper Korean uses is roughly four times slower than fugashi, presumably because it uses the dynamic cffi library for bindings. Currently fugashi makes some assumptions about the dictionary format but I would be glad to change that to support Korean if someone could give me a guide to the format (I don't speak Korean).

And do you know if there's an easy way to set spacy up so that both Japanese and Korean work with mecab at the same time?

I don't think there's a very easy way... using the MECABRC env var is probably one of the better options. Another option would be allowing users to specify a dictionary path when initializing the language. That's desirable in any case for use with custom tokenizer dictionaries, which is pretty common.

Another option that's more work is to distribute a MeCab wrapper with a built-in dictionary. mecab-python3 did some work on this but it's using the obsolete IPADic dictionary.

@adrianeboyd

This comment has been minimized.

Copy link
Collaborator

adrianeboyd commented Nov 11, 2019

Thanks for the info! I was hoping there might be a nicer method I didn't know about, but no such luck.

Since the CI testing doesn't help much here, I ran the unittests and the UD eval scripts and didn't run into any issues.

setup.cfg Show resolved Hide resolved
@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Nov 23, 2019

Thanks! Sorry it took a bit of time to get to this.

@honnibal honnibal merged commit f0e3e60 into explosion:master Nov 23, 2019
12 checks passed
12 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
explosion.spaCy Build #20191110.1 succeeded
Details
explosion.spaCy (Test Python35Linux) Test Python35Linux succeeded
Details
explosion.spaCy (Test Python35Mac) Test Python35Mac succeeded
Details
explosion.spaCy (Test Python35Windows) Test Python35Windows succeeded
Details
explosion.spaCy (Test Python36Linux) Test Python36Linux succeeded
Details
explosion.spaCy (Test Python36Mac) Test Python36Mac succeeded
Details
explosion.spaCy (Test Python36Windows) Test Python36Windows succeeded
Details
explosion.spaCy (Test Python37Linux) Test Python37Linux succeeded
Details
explosion.spaCy (Test Python37Mac) Test Python37Mac succeeded
Details
explosion.spaCy (Test Python37Windows) Test Python37Windows succeeded
Details
explosion.spaCy (Validate) Validate succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
You can’t perform that action at this time.