Switch to mecab-ko as default Korean tokenizer #11294

adrianeboyd · 2022-08-11T11:52:10Z

Description

Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization.

Maintain the previous natto-py tokenizer as spacy.KoreanNattoTokenizer.v1.

Types of change

Enhancement.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`.

adrianeboyd · 2022-08-11T11:56:00Z

It is too difficult to parameterize tests over fixtures, so I just duplicated the existing texts.

In my local tests this was also quite a bit faster than natto-py (~1.5x). We'll have to see how the compatibility looks in the tests, but switching to mecab-ko would remove the last remaining default tokenizer that requires separate system libraries so we could run all the language-specific tests in the slow buildkite tests.

This reverts commit d2083e7.

adrianeboyd · 2022-08-11T18:57:45Z

@polm What do you think about mecab-ko (mainly the packaging)?

And I guess also how to specify custom dictionaries. We'd have to add a setting for the options to pass in on init, and I'm not sure whether you can override it if you have mecab_ko_dic installed because of how it sets the options internally.

And maybe this should just be v4 instead.

polm · 2022-08-15T03:30:11Z

It looks like mecab-ko is just a repackaging of mecab-python3, which would be great. Let me take a closer look at it and give it a quick test on Windows...

polm · 2022-08-15T10:57:48Z

On further examination I confirmed this works on Windows (which is one tricky part) and that the basic functionality seems fine. The packaging code is not just a one-time copy of mecab-python3 but actually designed to be updated by pulling mecab-python3 and doing string replacement on it. This might break builds depending on how I update mecab-python3, but shouldn't affect releases or anything like that.

adrianeboyd · 2022-08-15T11:46:23Z

I think the way it defaults to mecab_ko_dic is going to be a problem in terms of the configurability.

And I don't really understand what mecab does when you provide multiple -d and -r options.

But if we write our own init for this instead of Tagger then we can have the settings for mecab_ko_dic (or another package) vs. a path to a system mecabrc. And I'm not sure how to preserve the current default, which is some built-in system mecab default.

mecab_dic_package: Optional[str] = "mecab_ko_dic"
mecabrc_path: Optional[Path] = null
mecab_dic_path: Optional[Path] = null

If it's not worth the effort to keep the default (I don't really see how), then it would at least be fine for v4.

polm · 2022-08-16T09:46:18Z

I think the way it defaults to mecab_ko_dic is going to be a problem in terms of the configurability.

Why would that be a problem? I think the way it works is just that if no dictionary is specified then that one is used by default. This is the same behavior as mecab-python3 / fugashi.

To clarify the MeCab arguments a little:

-d is a "system dictionary". You can only have one, and if multiple arguments are provided only the last one is used (it overwrites earlier ones). Packages like unidic-lite or mecab-ko-dic provide this.
-r is the config path and, like the system dictionary, only the last one given is used. In my packages I specify /dev/null (or Windows equivalent) since there's no need for it to contain anything useful.
-u is for user dictionaries, and any number (up to some large limit) can be used.

So the arguments would be more like this:

mecab_system_dic: Optional[str] = "mecab_ko_dic"
mecabrc_path: Optional[Path] = null
mecab_user_dics: Optional[List[Path]] = null # or make it non-optional and []

Note that also this assumes that the dictionary is provided as a Python package, which is traditionally not the case, that's just something that I started with unidic-lite etc. It's possible that they have a dictionary somewhere on disk they want to use that needs to be specified with a path. (In Japanese that would be typical for NEologd, which is a large dictionary with no pip distribution, though I've not aware of anything similar for Korean.)

It might be preferable to just let them provide a MeCab argument string, which kind of just punts on the typing, but it does mean we don't have to worry about accidentally shutting off a feature.

It's hard to decide what the right thing to do with this is without much community input...

adrianeboyd · 2022-08-18T06:08:49Z

My main concern was that if you use spacy.blank("ko") currently you get whatever your system/environment mecabrc has configured. With this PR (as it is now) you get mecab_ko_dic instead, so we couldn't drop it in in v3.5.0 without causing problems. Possibly we could write our own Tagger init so that it doesn't use mecab_ko_dic at all right now, at least not without additional non-default settings.

In terms of the configurability in general, if you can override the defaults with later -d -r settings, then I guess that's okay.

I think I'd probably just go for the raw string config here. I think you can potentially screw up the output enough that the annotation parsing fails, but that's on you.

polm · 2022-08-18T07:16:21Z

Ah, I see what you mean about changing the defaults by ignoring the mecabrc, that's a good point. The thing is, if we respect people's settings by default, then for new users the dictionary autoloading also won't work, which isn't great.

Given we don't know much about our users, I think it's best to go with the default behaviour of mecab-ko (Python) here, where it loads the dictionary, even if it risks breaking some installations. If we provide access to the string config on top of that they can reconfigure it if they need to.

If we want to be more careful, we could implement the custom Tagger to not do the auto-loading, use it to check if someone has a working installation already, and warn them if so. But then we would have to provide some way to disable that warning, and given we're unsure how many people it would affect it may not be worth the effort. (Is there some way we could have users disable a warning like that that wouldn't leave an extra flag in the config or an env var or something?)

Also, about the option to provide the name of other dictionaries - that seems like a reasonable option, but I've never heard of an alternative dictionary for mecab-ko, and since there's none on PyPI at the moment, I think we can omit it for now.

adrianeboyd · 2022-08-18T08:02:36Z

Hmm, I still don't 100% understand what the mecab wrapper is going to do here in the presence of a system installation.

Does it work to basically have mecab_dic_package = null in v3 and add a warning that we'll switch to mecab_dic_package = "mecab_ko_doc" in v4, so if you want to keep your system defaults, you'll have to override this in your config in v4, plus you can go ahead and switch now and then not need the system installation for the defaults?

I think we can live with not having any built-in way to disable this warning other than your own warnings filters.

polm · 2022-08-18T08:52:32Z

Does it work to basically have mecab_dic_package = null in v3 and add a warning that we'll switch to mecab_dic_package = "mecab_ko_doc" in v4, so if you want to keep your system defaults, you'll have to override this in your config in v4, plus you can go ahead and switch now and then not need the system installation for the defaults?

That sounds fine. In our pretrained pipelines I expect we'd just configure it to use mecab_ko_dic.

I think we can live with not having any built-in way to disable this warning other than your own warnings filters.

Good point.

About what a default system installation does - if a mecabrc file can be found, settings will be read from it. In theory you can set most options in a mecabrc, such as configuring user dictionaries, output format, and so on, but in practice, the only thing I've ever seen anyone use it for is configuring the system dictionary. Because the rc file for the system dictionary (dicrc) can have all the same settings, and because many settings depend on dictionary fields, any configuration is normally done there instead.

If we could check a single location for the mecabrc file that would be one thing, but the path checked for the rc file is configurable at compilation time in MeCab, and an issue I ran into before is that the location when building from official sources is different than the one used in Debian, for example. This means that the path mecab-ko checks may not be the path the user's binary would check. (Checking both common paths is an option, but in mecab-python3/fugashi I decided it was too much trouble and just didn't do it. I also was able to make a major version release with the change though.) So there is a non-zero chance that even without autoloading mecab-ko-dic, the binary included in the Python mecab-ko may not behave the same by default. (With the proposed setting where we don't use mecab-ko-dic by default, if they have an install, and the mecabrc path Python mecab-ko checks is different from their system binary, they'll get an error that mecabrc wasn't found and should be able to figure out what happened. That's much better than a subtle dictionary change so I think it's fine.)

adrianeboyd · 2022-08-18T09:11:43Z

The trained pipelines don't use mecab-ko at all, so that isn't a factor.

I was afraid that the paths would work like that. This is too complicated, so let's just switch to this in v4 with the existing mecab_ko_dic default + raw config string, and provide the previous Natto tokenizer for people who want to work with the system defaults without any further configuration.

polm

The Korean related changes look like they're ready to go.

One thing I checked is whether mecab-ko-dic has quoted fields in the CSV features, I confirmed it does not. (Some versions of UniDic have this and it means you can't just use split(",").)

However, this PR seems to have a lot of unrelated changes in it, I guess because it was changed from being based on master to v4? I think it needs to be rebased or recreated.

adrianeboyd · 2022-08-24T05:10:43Z

Ah, it's just waiting for master to be merged into v4. I'll mark it as a draft again.

polm

Looks like the history is fixed, so I think this is good to go?

I noticed one place where it looks like an error message needs updating, otherwise looks fine though.

spacy/lang/ko/__init__.py

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

polm

Thanks, I think this is ready to go! It's still in draft state though, is that just a leftover from before the merge or is there something else to add?

* Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit d2083e7. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

adrianeboyd added 2 commits August 11, 2022 13:49

Switch to mecab-ko as default Korean tokenizer

a2a0e1a

Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`.

Temporarily run tests with mecab-ko tokenizer

d2083e7

adrianeboyd added lang / ko Korean language data and models v3.5 Related to v3.5 labels Aug 11, 2022

adrianeboyd marked this pull request as draft August 11, 2022 11:52

adrianeboyd added 4 commits August 11, 2022 14:12

Fix types

a19253b

Fix duplicate test names

b3a63b9

Update requirements test

498dddc

Revert "Temporarily run tests with mecab-ko tokenizer"

4d281f7

This reverts commit d2083e7.

polm mentioned this pull request Aug 18, 2022

Packaging mecab-ko for easier use konlpy/konlpy#398

Open

adrianeboyd changed the base branch from develop to v4 August 18, 2022 09:13

Add mecab_args setting, fix pickle for KoreanNattoTokenizer

3f0eed7

polm added 🔜 v4.0 Related to upcoming v4.0 and removed v3.5 Related to v3.5 labels Aug 23, 2022

adrianeboyd added 2 commits August 23, 2022 14:06

Fix length check

75473fe

Update docs

66c0643

adrianeboyd marked this pull request as ready for review August 23, 2022 12:07

polm reviewed Aug 24, 2022

View reviewed changes

adrianeboyd marked this pull request as draft August 24, 2022 05:10

adrianeboyd added 3 commits August 24, 2022 09:50

Merge remote-tracking branch 'upstream/v4' into feature/ko-tok-alt

0789efa

Merge remote-tracking branch 'upstream/v4' into feature/ko-tok-alt

fba5df9

Formatting

c452df3

polm reviewed Aug 25, 2022

View reviewed changes

spacy/lang/ko/__init__.py Outdated Show resolved Hide resolved

Update natto-py error message

49f1bcf

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

polm approved these changes Aug 26, 2022

View reviewed changes

adrianeboyd marked this pull request as ready for review August 26, 2022 07:55

adrianeboyd merged commit 2a558a7 into explosion:v4 Aug 26, 2022

adrianeboyd mentioned this pull request Feb 27, 2023

Make Korean tokenizer easier to use #12328

Closed

3 tasks

adrianeboyd mentioned this pull request Mar 20, 2023

Korean blank model crashes if mecab-ko not installed, but not during initialization #12449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to mecab-ko as default Korean tokenizer #11294

Switch to mecab-ko as default Korean tokenizer #11294

adrianeboyd commented Aug 11, 2022 •

edited

Loading

adrianeboyd commented Aug 11, 2022

adrianeboyd commented Aug 11, 2022 •

edited

Loading

polm commented Aug 15, 2022

polm commented Aug 15, 2022

adrianeboyd commented Aug 15, 2022

polm commented Aug 16, 2022

adrianeboyd commented Aug 18, 2022

polm commented Aug 18, 2022

adrianeboyd commented Aug 18, 2022 •

edited

Loading

polm commented Aug 18, 2022

adrianeboyd commented Aug 18, 2022

polm left a comment

adrianeboyd commented Aug 24, 2022

polm left a comment

polm left a comment

Switch to mecab-ko as default Korean tokenizer #11294

Switch to mecab-ko as default Korean tokenizer #11294

Conversation

adrianeboyd commented Aug 11, 2022 • edited Loading

Description

Types of change

Checklist

adrianeboyd commented Aug 11, 2022

adrianeboyd commented Aug 11, 2022 • edited Loading

polm commented Aug 15, 2022

polm commented Aug 15, 2022

adrianeboyd commented Aug 15, 2022

polm commented Aug 16, 2022

adrianeboyd commented Aug 18, 2022

polm commented Aug 18, 2022

adrianeboyd commented Aug 18, 2022 • edited Loading

polm commented Aug 18, 2022

adrianeboyd commented Aug 18, 2022

polm left a comment

Choose a reason for hiding this comment

adrianeboyd commented Aug 24, 2022

polm left a comment

Choose a reason for hiding this comment

polm left a comment

Choose a reason for hiding this comment

adrianeboyd commented Aug 11, 2022 •

edited

Loading

adrianeboyd commented Aug 11, 2022 •

edited

Loading

adrianeboyd commented Aug 18, 2022 •

edited

Loading