Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to mecab-ko as default Korean tokenizer #11294

Merged
merged 13 commits into from
Aug 26, 2022

Conversation

adrianeboyd
Copy link
Contributor

@adrianeboyd adrianeboyd commented Aug 11, 2022

Description

Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization.

Maintain the previous natto-py tokenizer as spacy.KoreanNattoTokenizer.v1.

Types of change

Enhancement.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.
@adrianeboyd adrianeboyd added lang / ko Korean language data and models v3.5 Related to v3.5 labels Aug 11, 2022
@adrianeboyd adrianeboyd marked this pull request as draft August 11, 2022 11:52
@adrianeboyd
Copy link
Contributor Author

It is too difficult to parameterize tests over fixtures, so I just duplicated the existing texts.

In my local tests this was also quite a bit faster than natto-py (~1.5x). We'll have to see how the compatibility looks in the tests, but switching to mecab-ko would remove the last remaining default tokenizer that requires separate system libraries so we could run all the language-specific tests in the slow buildkite tests.

@adrianeboyd
Copy link
Contributor Author

adrianeboyd commented Aug 11, 2022

@polm What do you think about mecab-ko (mainly the packaging)?

And I guess also how to specify custom dictionaries. We'd have to add a setting for the options to pass in on init, and I'm not sure whether you can override it if you have mecab_ko_dic installed because of how it sets the options internally.

And maybe this should just be v4 instead.

@polm
Copy link
Contributor

polm commented Aug 15, 2022

It looks like mecab-ko is just a repackaging of mecab-python3, which would be great. Let me take a closer look at it and give it a quick test on Windows...

@polm
Copy link
Contributor

polm commented Aug 15, 2022

On further examination I confirmed this works on Windows (which is one tricky part) and that the basic functionality seems fine. The packaging code is not just a one-time copy of mecab-python3 but actually designed to be updated by pulling mecab-python3 and doing string replacement on it. This might break builds depending on how I update mecab-python3, but shouldn't affect releases or anything like that.

@adrianeboyd
Copy link
Contributor Author

I think the way it defaults to mecab_ko_dic is going to be a problem in terms of the configurability.

And I don't really understand what mecab does when you provide multiple -d and -r options.

But if we write our own init for this instead of Tagger then we can have the settings for mecab_ko_dic (or another package) vs. a path to a system mecabrc. And I'm not sure how to preserve the current default, which is some built-in system mecab default.

mecab_dic_package: Optional[str] = "mecab_ko_dic"
mecabrc_path: Optional[Path] = null
mecab_dic_path: Optional[Path] = null

If it's not worth the effort to keep the default (I don't really see how), then it would at least be fine for v4.

@polm
Copy link
Contributor

polm commented Aug 16, 2022

I think the way it defaults to mecab_ko_dic is going to be a problem in terms of the configurability.

Why would that be a problem? I think the way it works is just that if no dictionary is specified then that one is used by default. This is the same behavior as mecab-python3 / fugashi.

To clarify the MeCab arguments a little:

  • -d is a "system dictionary". You can only have one, and if multiple arguments are provided only the last one is used (it overwrites earlier ones). Packages like unidic-lite or mecab-ko-dic provide this.
  • -r is the config path and, like the system dictionary, only the last one given is used. In my packages I specify /dev/null (or Windows equivalent) since there's no need for it to contain anything useful.
  • -u is for user dictionaries, and any number (up to some large limit) can be used.

So the arguments would be more like this:

mecab_system_dic: Optional[str] = "mecab_ko_dic"
mecabrc_path: Optional[Path] = null
mecab_user_dics: Optional[List[Path]] = null # or make it non-optional and []

Note that also this assumes that the dictionary is provided as a Python package, which is traditionally not the case, that's just something that I started with unidic-lite etc. It's possible that they have a dictionary somewhere on disk they want to use that needs to be specified with a path. (In Japanese that would be typical for NEologd, which is a large dictionary with no pip distribution, though I've not aware of anything similar for Korean.)

It might be preferable to just let them provide a MeCab argument string, which kind of just punts on the typing, but it does mean we don't have to worry about accidentally shutting off a feature.

It's hard to decide what the right thing to do with this is without much community input...

@adrianeboyd
Copy link
Contributor Author

My main concern was that if you use spacy.blank("ko") currently you get whatever your system/environment mecabrc has configured. With this PR (as it is now) you get mecab_ko_dic instead, so we couldn't drop it in in v3.5.0 without causing problems. Possibly we could write our own Tagger init so that it doesn't use mecab_ko_dic at all right now, at least not without additional non-default settings.

In terms of the configurability in general, if you can override the defaults with later -d -r settings, then I guess that's okay.

I think I'd probably just go for the raw string config here. I think you can potentially screw up the output enough that the annotation parsing fails, but that's on you.

@polm
Copy link
Contributor

polm commented Aug 18, 2022

Ah, I see what you mean about changing the defaults by ignoring the mecabrc, that's a good point. The thing is, if we respect people's settings by default, then for new users the dictionary autoloading also won't work, which isn't great.

Given we don't know much about our users, I think it's best to go with the default behaviour of mecab-ko (Python) here, where it loads the dictionary, even if it risks breaking some installations. If we provide access to the string config on top of that they can reconfigure it if they need to.

If we want to be more careful, we could implement the custom Tagger to not do the auto-loading, use it to check if someone has a working installation already, and warn them if so. But then we would have to provide some way to disable that warning, and given we're unsure how many people it would affect it may not be worth the effort. (Is there some way we could have users disable a warning like that that wouldn't leave an extra flag in the config or an env var or something?)

Also, about the option to provide the name of other dictionaries - that seems like a reasonable option, but I've never heard of an alternative dictionary for mecab-ko, and since there's none on PyPI at the moment, I think we can omit it for now.

@adrianeboyd
Copy link
Contributor Author

adrianeboyd commented Aug 18, 2022

Hmm, I still don't 100% understand what the mecab wrapper is going to do here in the presence of a system installation.

Does it work to basically have mecab_dic_package = null in v3 and add a warning that we'll switch to mecab_dic_package = "mecab_ko_doc" in v4, so if you want to keep your system defaults, you'll have to override this in your config in v4, plus you can go ahead and switch now and then not need the system installation for the defaults?

I think we can live with not having any built-in way to disable this warning other than your own warnings filters.

@polm
Copy link
Contributor

polm commented Aug 18, 2022

Does it work to basically have mecab_dic_package = null in v3 and add a warning that we'll switch to mecab_dic_package = "mecab_ko_doc" in v4, so if you want to keep your system defaults, you'll have to override this in your config in v4, plus you can go ahead and switch now and then not need the system installation for the defaults?

That sounds fine. In our pretrained pipelines I expect we'd just configure it to use mecab_ko_dic.

I think we can live with not having any built-in way to disable this warning other than your own warnings filters.

Good point.


About what a default system installation does - if a mecabrc file can be found, settings will be read from it. In theory you can set most options in a mecabrc, such as configuring user dictionaries, output format, and so on, but in practice, the only thing I've ever seen anyone use it for is configuring the system dictionary. Because the rc file for the system dictionary (dicrc) can have all the same settings, and because many settings depend on dictionary fields, any configuration is normally done there instead.

If we could check a single location for the mecabrc file that would be one thing, but the path checked for the rc file is configurable at compilation time in MeCab, and an issue I ran into before is that the location when building from official sources is different than the one used in Debian, for example. This means that the path mecab-ko checks may not be the path the user's binary would check. (Checking both common paths is an option, but in mecab-python3/fugashi I decided it was too much trouble and just didn't do it. I also was able to make a major version release with the change though.) So there is a non-zero chance that even without autoloading mecab-ko-dic, the binary included in the Python mecab-ko may not behave the same by default. (With the proposed setting where we don't use mecab-ko-dic by default, if they have an install, and the mecabrc path Python mecab-ko checks is different from their system binary, they'll get an error that mecabrc wasn't found and should be able to figure out what happened. That's much better than a subtle dictionary change so I think it's fine.)

@adrianeboyd
Copy link
Contributor Author

The trained pipelines don't use mecab-ko at all, so that isn't a factor.

I was afraid that the paths would work like that. This is too complicated, so let's just switch to this in v4 with the existing mecab_ko_dic default + raw config string, and provide the previous Natto tokenizer for people who want to work with the system defaults without any further configuration.

@adrianeboyd adrianeboyd changed the base branch from develop to v4 August 18, 2022 09:13
@polm polm added 🔜 v4.0 Related to upcoming v4.0 and removed v3.5 Related to v3.5 labels Aug 23, 2022
@adrianeboyd adrianeboyd marked this pull request as ready for review August 23, 2022 12:07
Copy link
Contributor

@polm polm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Korean related changes look like they're ready to go.

One thing I checked is whether mecab-ko-dic has quoted fields in the CSV features, I confirmed it does not. (Some versions of UniDic have this and it means you can't just use split(",").)

However, this PR seems to have a lot of unrelated changes in it, I guess because it was changed from being based on master to v4? I think it needs to be rebased or recreated.

@adrianeboyd
Copy link
Contributor Author

Ah, it's just waiting for master to be merged into v4. I'll mark it as a draft again.

@adrianeboyd adrianeboyd marked this pull request as draft August 24, 2022 05:10
Copy link
Contributor

@polm polm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the history is fixed, so I think this is good to go?

I noticed one place where it looks like an error message needs updating, otherwise looks fine though.

spacy/lang/ko/__init__.py Outdated Show resolved Hide resolved
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Copy link
Contributor

@polm polm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is ready to go! It's still in draft state though, is that just a leftover from before the merge or is there something else to add?

@adrianeboyd adrianeboyd marked this pull request as ready for review August 26, 2022 07:55
@adrianeboyd adrianeboyd merged commit 2a558a7 into explosion:v4 Aug 26, 2022
jordankanter pushed a commit to jordankanter/spaCy that referenced this pull request Mar 14, 2024
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
jordankanter pushed a commit to jordankanter/spaCy that referenced this pull request Mar 14, 2024
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
jordankanter pushed a commit to jordankanter/spaCy that referenced this pull request Mar 29, 2024
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
jordankanter pushed a commit to jordankanter/spaCy that referenced this pull request Apr 17, 2024
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
jordankanter pushed a commit to jordankanter/spaCy that referenced this pull request May 10, 2024
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
jordankanter pushed a commit to jordankanter/spaCy that referenced this pull request May 10, 2024
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
jordankanter pushed a commit to jordankanter/spaCy that referenced this pull request May 21, 2024
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / ko Korean language data and models 🔜 v4.0 Related to upcoming v4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants