Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Replace python-mecab3 with fugashi for Japanese #4621
Currently Japanese uses python-mecab3 as a wrapper of MeCab for tokenization. However we use an old version of python-mecab3 because the latest versions make using Unidic (required by Universal Dependencies) tricky, in addition to having an unresolved issue around memory usage.
This change replaces python-mecab3 with fugashi, a Cython wrapper for MeCab I wrote recently, which doesn't have the above issues. The API was also written with spaCy in mind, which allowed for simplifying the tokenizer code here a bit.
Types of change
This is an enhancement.
mecab-python3 has been the best MeCab binding for a long time but it's not very actively maintained, and since it's based on old SWIG code distributed with MeCab there's a limit to how effectively it can be maintained. Fugashi is a new Cython-based MeCab wrapper I wrote. Since it's not based on the old SWIG code it's easier to keep it current and make small deviations from the MeCab C/C++ API where that makes sense.
The tags come from MeCab, but the tag schema is specified by Unidic, so it's more proper to refer to it that way.
I was looking at Japanese as an example for how to better set up the Chinese tokenizer and I noticed that the handling of whitespace is inconsistent with other languages. With the default tokenizer, consecutive whitespace tokens are merged into one token, so
Would your library make a difference for Korean? (And do you know if there's an easy way to set spacy up so that both Japanese and Korean work with mecab at the same time? The only way I could figure out was to set the environment variable
The way MeCab handles whitespace is weird and has caused issues before. I think we have a test and "I [three spaces] like cheese" should result in five tokens, two of which are spaces. Either way this PR shouldn't change the behavior of spaces, and if there's a problem or inconsistency it might be best to make a new issue.
Based on a benchmark I made when creating my library, the natto-py wrapper Korean uses is roughly four times slower than fugashi, presumably because it uses the dynamic cffi library for bindings. Currently fugashi makes some assumptions about the dictionary format but I would be glad to change that to support Korean if someone could give me a guide to the format (I don't speak Korean).
I don't think there's a very easy way... using the
Another option that's more work is to distribute a MeCab wrapper with a built-in dictionary.