Japanese model: add user_dict entries and small refactor #5573

hiroshi-matsuda-rit · 2020-06-11T04:22:18Z

Add user_dict entries (reading_forms, inflections, sub_tokens) and improving the code readability around the token alignment procedure.

Description

I added and reorganized the user_dict fields which are commonly used in Japanese NLP applications.
I deleted separate_sentences() because we should use parser's sentencizer.
I added the test cases which covers new user_dict entries and executed py.test spacy/tests/lang/ja/test_tokenizer.py and found no error.
I refactored the codes and wrote comments around _get_dtokens() and get_dtokens_and_spaces() because they were complicated.

Types of change

enhancement

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

spacy/master

master

deleting: unidic_tags improve code readability around the token alignment procedure

…gnizer

adrianeboyd · 2020-06-11T08:01:46Z

Thanks for this contribution! In case there are any minor differences between these updates and the current version (which we've already used to train models), I would like to hold off on merging these updates for 2.3.0, but they can be included in the next releases (which would be 2.3.1 and 3.0.0).

hiroshi-matsuda-rit · 2020-06-11T09:19:59Z

@adrianeboyd Fortunately, there is no difference in the Doc contents except the user_data[]. We do not need to retrain the model. I think we can merge this to v 2.3.0 and the users can use these features from the beginning of official Japanese model. Could you please consider merging to v2.3.0 again?

hiroshi-matsuda-rit · 2020-06-11T09:23:43Z

In addition, @polm -san is still improving the SudachiPy.
I'd like to ask him about whether we need to retrain the model or not if we upgrade SudachiPy to v0.4.6.

polm · 2020-06-11T09:31:55Z

Thanks for the ping.

My work on Sudachi shouldn't change the output in any way so shouldn't require retraining the model. However, the ellipsis-related issue that came up while we were working on the Japanese model had a fix released around the same time. I think the fix won't change the actual tokens spaCy sees (since it discards empty tokens), but I'm not sure if tokenizations changes are possible. If there were changes, I would assume they would only be with rare/weird tokens.

@sorami might know more about what the possible issues with the ellipsis change. What version of SudachiPy was used for training the model?

sorami · 2020-06-11T11:16:52Z

The "ellipsis change" was included in SudachiPy v.0.4.5 (released Tuesday last week).

There will be no change in the tokenization analysis result; Only the order of empty tokens will change.

adrianeboyd · 2020-06-11T11:45:09Z

The models were trained with sudachipy v0.4.5 and this is the minimum required version listed in the spacy[ja] requirements.

This refactor is too large for me to feel comfortable including it in spacy v2.3.0. The models have already been trained and released and we're in the final stages of preparing the spacy library release, so we don't want to merge any more significant changes at this point.

I would expect the first patch release to be relatively soon as bugs are fixed, so I don't think you'll have to wait too long for v2.3.1.

hiroshi-matsuda-rit · 2020-06-11T11:48:29Z

@adrianeboyd Okay. I understand where are you coming from and agree to wait until release v2.3.1. Thanks for taking time for releasing the Japanese model!

hiroshi-matsuda-rit · 2020-06-22T12:34:13Z

@adrianeboyd Thanks a lot!

* Fix typos and auto-format [ci skip] * Add pkuseg warnings and auto-format [ci skip] * Update Binder URL [ci skip] * Update Binder version [ci skip] * Update alignment example for new gold.align * Update POS in tagging example * Fix numpy.zeros() dtype for Doc.from_array * Change example title to Dr. Change example title to Dr. so the current model does exclude the title in the initial example. * Fix spacy convert argument * Warning for sudachipy 0.4.5 (#5611) * Create myavrum.md (#5612) * Update lex_attrs.py (#5608) * Create mahnerak.md (#5615) * Some changes for Armenian (#5616) * Fixing numericals * We need a Armenian question sign to make the sentence a question * Add Nepali Language (#5622) * added support for nepali lang * added examples and test files * added spacy contributor agreement * Japanese model: add user_dict entries and small refactor (#5573) * user_dict fields: adding inflections, reading_forms, sub_tokens deleting: unidic_tags improve code readability around the token alignment procedure * add test cases, replace fugashi with sudachipy in conftest * move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer * tag is space -> both surface and tag are spaces * consider len(text)==0 * Add warnings example in v2.3 migration guide (#5627) * contribute (#5632) * Fix polarity of Token.is_oov and Lexeme.is_oov (#5634) Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the lexeme does **not** have a vector. * Extend what's new in v2.3 with vocab / is_oov (#5635) * Skip vocab in component config overrides (#5624) * Fix backslashes in warnings config diff (#5640) Fix backslashes in warnings config diff in v2.3 migration section. * Disregard special tag _SP in check for new tag map (#5641) * Skip special tag _SP in check for new tag map In `Tagger.begin_training()` check for new tags aside from `_SP` in the new tag map initialized from the provided gold tuples when determining whether to reinitialize the morphology with the new tag map. * Simplify _SP check Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Marat M. Yavrumyan <myavrum@ysu.am> Co-authored-by: Karen Hambardzumyan <mahnerak@gmail.com> Co-authored-by: Rameshh <30867740+rameshhpathak@users.noreply.github.com> Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

hiroshi-matsuda-rit and others added 8 commits June 5, 2020 14:48

Merge pull request #4 from explosion/master

7b35836

spacy/master

Merge pull request #6 from explosion/master

ba4ee99

master

Merge pull request #7 from explosion/master

4c5ffe5

master

user_dict fields: adding inflections, reading_forms, sub_tokens

03f9309

deleting: unidic_tags improve code readability around the token alignment procedure

add test cases, replace fugashi with sudachipy in conftest

3e18759

move bunsetu.py to spaCy Universe as a pipeline component BunsetuReco…

c811292

…gnizer

tag is space -> both surface and tag are spaces

99a9964

consider len(text)==0

0c7d362

hiroshi-matsuda-rit mentioned this pull request Jun 11, 2020

Japanese Model #3756

Closed

adrianeboyd added enhancement Feature requests and improvements lang / ja Japanese language data and models labels Jun 11, 2020

adrianeboyd merged commit 150a39c into explosion:master Jun 22, 2020

hiroshi-matsuda-rit deleted the feature/japanese_reading_forms_and_refactors branch June 22, 2020 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Japanese model: add user_dict entries and small refactor #5573

Japanese model: add user_dict entries and small refactor #5573

hiroshi-matsuda-rit commented Jun 11, 2020

adrianeboyd commented Jun 11, 2020

hiroshi-matsuda-rit commented Jun 11, 2020

hiroshi-matsuda-rit commented Jun 11, 2020 •

edited

polm commented Jun 11, 2020

sorami commented Jun 11, 2020 •

edited

adrianeboyd commented Jun 11, 2020

hiroshi-matsuda-rit commented Jun 11, 2020

hiroshi-matsuda-rit commented Jun 22, 2020 •

edited

Japanese model: add user_dict entries and small refactor #5573

Japanese model: add user_dict entries and small refactor #5573

Conversation

hiroshi-matsuda-rit commented Jun 11, 2020

Description

Types of change

Checklist

adrianeboyd commented Jun 11, 2020

hiroshi-matsuda-rit commented Jun 11, 2020

hiroshi-matsuda-rit commented Jun 11, 2020 • edited

polm commented Jun 11, 2020

sorami commented Jun 11, 2020 • edited

adrianeboyd commented Jun 11, 2020

hiroshi-matsuda-rit commented Jun 11, 2020

hiroshi-matsuda-rit commented Jun 22, 2020 • edited

hiroshi-matsuda-rit commented Jun 11, 2020 •

edited

sorami commented Jun 11, 2020 •

edited

hiroshi-matsuda-rit commented Jun 22, 2020 •

edited