-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to mecab-ko as default Korean tokenizer #11294
Conversation
Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`.
It is too difficult to parameterize tests over fixtures, so I just duplicated the existing texts. In my local tests this was also quite a bit faster than |
@polm What do you think about And I guess also how to specify custom dictionaries. We'd have to add a setting for the options to pass in on init, and I'm not sure whether you can override it if you have And maybe this should just be v4 instead. |
It looks like |
On further examination I confirmed this works on Windows (which is one tricky part) and that the basic functionality seems fine. The packaging code is not just a one-time copy of mecab-python3 but actually designed to be updated by pulling mecab-python3 and doing string replacement on it. This might break builds depending on how I update mecab-python3, but shouldn't affect releases or anything like that. |
I think the way it defaults to And I don't really understand what mecab does when you provide multiple But if we write our own init for this instead of mecab_dic_package: Optional[str] = "mecab_ko_dic"
mecabrc_path: Optional[Path] = null
mecab_dic_path: Optional[Path] = null If it's not worth the effort to keep the default (I don't really see how), then it would at least be fine for v4. |
Why would that be a problem? I think the way it works is just that if no dictionary is specified then that one is used by default. This is the same behavior as mecab-python3 / fugashi. To clarify the MeCab arguments a little:
So the arguments would be more like this:
Note that also this assumes that the dictionary is provided as a Python package, which is traditionally not the case, that's just something that I started with It might be preferable to just let them provide a MeCab argument string, which kind of just punts on the typing, but it does mean we don't have to worry about accidentally shutting off a feature. It's hard to decide what the right thing to do with this is without much community input... |
My main concern was that if you use In terms of the configurability in general, if you can override the defaults with later I think I'd probably just go for the raw string config here. I think you can potentially screw up the output enough that the annotation parsing fails, but that's on you. |
Ah, I see what you mean about changing the defaults by ignoring the Given we don't know much about our users, I think it's best to go with the default behaviour of If we want to be more careful, we could implement the custom Tagger to not do the auto-loading, use it to check if someone has a working installation already, and warn them if so. But then we would have to provide some way to disable that warning, and given we're unsure how many people it would affect it may not be worth the effort. (Is there some way we could have users disable a warning like that that wouldn't leave an extra flag in the config or an env var or something?) Also, about the option to provide the name of other dictionaries - that seems like a reasonable option, but I've never heard of an alternative dictionary for mecab-ko, and since there's none on PyPI at the moment, I think we can omit it for now. |
Hmm, I still don't 100% understand what the mecab wrapper is going to do here in the presence of a system installation. Does it work to basically have I think we can live with not having any built-in way to disable this warning other than your own warnings filters. |
That sounds fine. In our pretrained pipelines I expect we'd just configure it to use
Good point. About what a default system installation does - if a If we could check a single location for the |
The trained pipelines don't use I was afraid that the paths would work like that. This is too complicated, so let's just switch to this in v4 with the existing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Korean related changes look like they're ready to go.
One thing I checked is whether mecab-ko-dic has quoted fields in the CSV features, I confirmed it does not. (Some versions of UniDic have this and it means you can't just use split(",")
.)
However, this PR seems to have a lot of unrelated changes in it, I guess because it was changed from being based on master to v4? I think it needs to be rebased or recreated.
Ah, it's just waiting for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the history is fixed, so I think this is good to go?
I noticed one place where it looks like an error message needs updating, otherwise looks fine though.
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think this is ready to go! It's still in draft state though, is that just a leftover from before the merge or is there something else to add?
* Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit d2083e7. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit d2083e7. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit d2083e7. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit d2083e7. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit d2083e7. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit d2083e7. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit d2083e7. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Description
Switch to the (confusingly-named)
mecab-ko
python module for default Korean tokenization.Maintain the previous
natto-py
tokenizer asspacy.KoreanNattoTokenizer.v1
.Types of change
Enhancement.
Checklist