Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese Model #3756

Open
polm opened this issue May 17, 2019 · 46 comments

Comments

@polm
Copy link
Contributor

commented May 17, 2019

Feature description

I'd like to add a Japanese model to spaCy. (Let me know if this should be discussed in #3056 instead - I thought it best to just tag it in for now.)

The Ginza project exists, but currently it's a repackaging of spaCy rather than a model to use with normal spaCy, and I think some of the resources it uses may be tricky to integrate from a licensing perspective.

My understanding is that the main parts of a model now are 1. the dependency model, 2. NER, and 3. word vectors. Notes on each of those:

  1. Dependencies. For dependency info we can use UD Japanese GSD. UD BCCWJ is bigger but the corpus has licensing issues. GSD is rather small but probably enough to be usable (8k sentences). I have trained it with spaCy and there were no conversion issues.

  2. NER. I don't know of a good dataset for this; Christopher Manning mentioned the same problem two years ago. I guess I could make one based on Wikipedia - I think some other spaCy models use data produced by Nothman et al's method, which skipped Japanese to avoid dealing with segmentation, so that might be one approach. (A reasonable question here is: what do people use for NER in Japanese? Most tokenizer dictionaries, including Unidic, have entity-like information and make it easy to add your own entries, so that's probably the most common approach.)

  3. Vectors. Using JA Wikipedia is no problem. I haven't worked with the Common Crawl before and I'm not sure I have the hardware for it, buf if I could get some help on it that's also an option.

So, how does that sound? If there's no issues with that I'll look into creating an NER dataset.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented May 31, 2019

Hi! I have the same opinion. I think there is a lot of demand to integrate Japanese model to spaCy. And actually, I'm preparing to integrate GiNZA to spaCy v2.1 and going to create some Pull Requests for this integration with a parser model trained on (formally licensed) UD-Japanese BCCWJ and a NER model trained on Kyoto University Web Document Leads Corpus (KDWLC), by the middle of June.

Unfortunately, the UD-Japanese GSD corpus is almost obsoleted after UD v2.4 and will not be maintained by the UD community anymore. They are focusing on developing BCCWJ version from v2.3. Our lab (Megagon labs) has a mission to develop and publish the models trained on UD-Japanese BCCWJ with the licenser (NINJAL).

Because of the licensing problems (mostly from the newspaper) and the difference among the word segmentation policies coming with POS systems, there are some difficulties to train and publish commercial-use NER models. But, there is a good solution which I've used for GiNZA. The Kyoto University Web Document Leads Corpus has high quality gold-standard annotations containing NER spans with popular NER categories. I used a part of KWDLC's NER annotations which meet to the boundaries of Sudachi morphological analyzer (mode C) to train our models.
http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?KWDLC

How do you think this approach? @polm

@honnibal

This comment has been minimized.

Copy link
Member

commented Jun 1, 2019

If you're able to prepare the dataset files, I'd definitely be glad to add a Japanese model to the training pipeline.

For the NER model, I would hope that better text than Wikipedia could be found. Wikipedia is a fairly specific genre, so models trained on Wikipedia text aren't always broadly useful. Perhaps text from the common crawl could be used?

@honnibal honnibal added the enhancement label Jun 1, 2019
@polm

This comment has been minimized.

Copy link
Contributor Author

commented Jun 1, 2019

@hiroshi-matsuda-rit I think that sounds great! Let me know if there's anything I can help with.

Unfortunately, the UD-Japanese GSD corpus is almost obsoleted after UD v2.4 and will not be maintained by the UD community anymore.

I had not realized this, that's really unfortunate.

The Kyoto University Web Document Leads Corpus has high quality gold-standard annotations containing NER spans with popular NER categories. I used a part of KWDLC's NER annotations which meet to the boundaries of Sudachi morphological analyzer (mode C) to train our models.

I was not aware this had good NER annotations, so that sounds like a great option. My only concern is the license is somewhat unusual so I'm not sure how the spaCy team or downstream users would feel about using a model based on the data.

The thing I thought might be a problem with BCCWJ and similar data is that based on #3056 my understanding is that the spaCy maintainers want to have access to the training data, not just compiled models, so that the models can be added to the training pipeline and updated whenever spaCy's code is updated (see here). @honnibal Are you open to adding data to the training pipeline that requires a paid license, like the BCCWJ? I didn't see a mention of any cases like that in #3056 so I wasn't sure...

For the NER model, I would hope that better text than Wikipedia could be found. Wikipedia is a fairly specific genre, so models trained on Wikipedia text aren't always broadly useful. Perhaps text from the common crawl could be used?

That makes sense, though it seems easier to use the Crawl for word vectors. Is there a good non-manual way to make NER training data from the Common Crawl? I thought of starting with Wikipedia since the approach in Nothman et al (which I think was/is used for some spaCy training data?) seemed relatively quick to implement; I've posted a little progress on that here. I guess with the Common Crawl a similar strategy could be used, looking for links to Wikipedia or something? With a quick search all I found was DepCC, which seems to just use the output of the Stanford NER models for NER data.

@honnibal

This comment has been minimized.

Copy link
Member

commented Jun 1, 2019

Is there a good non-manual way to make NER training data from the Common Crawl?

The non-manual approaches are pretty limited, really. If you're able to put a little time into it, we could give you a copy of Prodigy? You should be able to get a decent training corpus with about 20 hours of annotation.

Are you open to adding data to the training pipeline that requires a paid license, like the BCCWJ?

We're open to buying licenses to corpora, sure, so long as once we buy the license we can train and distribute models for free. We've actually paid quite a lot in license fees for the English and German models. If the license would require users to buy the corpus, it's still a maybe, but I'm less enthusiastic.

@polm

This comment has been minimized.

Copy link
Contributor Author

commented Jun 1, 2019

The non-manual approaches are pretty limited, really. If you're able to put a little time into it, we could give you a copy of Prodigy? You should be able to get a decent training corpus with about 20 hours of annotation.

If 20 hours is enough then I'd be glad to give it a shot! I figured it'd take much longer than that.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 1, 2019

@honnibal

If you're able to prepare the dataset files, I'd definitely be glad to add a Japanese model to the training pipeline.

Sure. I think it would be better if training datasets were published under OSS license even if only small part of BCCWJ. I'm going to ask this dataset publishing issue to the licenser. Please wait just few days.

For the NER model, I would hope that better text than Wikipedia could be found. Wikipedia is a fairly specific genre, so models trained on Wikipedia text aren't always broadly useful. Perhaps text from the common crawl could be used?

If we could use common crawl texts without worrying about the license problems, I think that KWDLC would be a good option for NER training corpus because KWDLC's annotation data has no limitation for commercial use and its entire text is retrieved from the web in much same way as common crawl.

@polm

I think that sounds great! Let me know if there's anything I can help with.

I'm very happy to hear that and would like your helps. Thanks a lot! Please review my branches later.

I had not realized this, that's really unfortunate.

Kanayama-san, he is one of the founding members of UD-Japanese community, has updated UD-Japanese GSD master branch some weeks ago.
I'd like to ask him about the future of GSD at the UD-Japanese meeting scheduled on June 17 in Kyoto.

My only concern is the license is somewhat unusual so I'm not sure how the spaCy team or downstream users would feel about using a model based on the data.

I guess that we should use the datasets having well-known license such as MIT or GPL to train the language models which spaCy supports officially.
To expand the possibilities, I'd like to discuss about re-publishing KWDLC corpus under MIT license, with our legal department.

Thanks!

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 4, 2019

@polm @honnibal Good news for us!
I discussed with Asahara-san, he is the leader of UD-Japanese community, about Japanese UD open dataset issue this morning.
They are planning to maintain GSD open dataset continuously (applying same refining methods for both BCCWJ and GSD) and going to decide this issue at the UD-Japanese meeting on June 17.
If this plan would be approved by the committee, I think UD-Japanese GSD will be a superior choice for the training dataset of spaCy's standard model.

I'd like to prepare a new spaCy model trained on UD-Japanese GSD (parser) and KWDLC (NER) in a few days. I'd use GiNZA's logics but it will be much smarter than previous version of GiNZA. I'm refactoring the pipeline (I dropped dummy root token).

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 6, 2019

@polm @honnibal
I've just uploaded a trial version of ja_ginza_gsd to the GiNZA's github repository.

https://github.com/megagonlabs/ginza/releases/tag/v1.1.0-gsd_preview-1

I used the latest UD-Japanese GSD dataset and KWDLC to train that model.

Parsing accuracy (using separated test dataset):

sentence=550, gold_token=12371, result_token=12471
sentence:LAS=0.1673,UAS=0.2418,POS=0.2909,boundary=0.6182,root=0.9473
tkn_recall:LAS=0.8413,UAS=0.8705,POS=0.9270,boundary=0.9669
tkn_precision:LAS=0.8346,UAS=0.8635,POS=0.9196,boundary=0.9591

NER accuracy (using same dataset for both train and test, sorry):

labels: <ALL>, sentence=14742, gold_ent=7377, result_ent=7216
 overlap
  recall: 0.9191 (label=0.8501), precision: 0.9426 (label=0.8713)
 include
  recall: 0.9183 (label=0.8499), precision: 0.9417 (label=0.8710)
 one side border
  recall: 0.9176 (label=0.8493), precision: 0.9407 (label=0.8701)
 both borders
  recall: 0.8650 (label=0.8090), precision: 0.8843 (label=0.8271)

label confusion matrix
       |  DATE |  LOC  | MONEY |  ORG  |PERCENT| PERSON|PRODUCT|  TIME | {NONE}
  DATE |   1435|      0|      0|      1|      0|      0|      2|      1|     71
  LOC  |      1|   2378|      0|     65|      0|      9|     13|      0|    139
 MONEY |      2|      0|    109|      0|      1|      0|      0|      0|      1
  ORG  |      1|     83|      1|    826|      0|     22|     91|      0|    120
PERCENT|      6|      0|      0|      0|     82|      0|      0|      0|      8
 PERSON|      1|      4|      0|     26|      0|    852|      9|      0|     40
PRODUCT|      4|     27|      0|    110|      0|     18|    555|      0|    193
  TIME |     17|      0|      0|      0|      0|      0|      0|     50|     25
 {NONE}|    121|     87|      3|     31|      7|     46|     99|     20|      0

I'm still refactoring the source codes to resolve two important issues below:

  1. Use "entry-point" method to make GiNZA as a custom language class
    https://spacy.io/usage/saving-loading#entry-points-components

  2. Find a way to disambiguate the POS of root token, mostly NOUN or VERB, with the parse result
    I use a kind of dependency label expansion technique for both merging the over-segmented tokens and correcting POS errors.
    https://github.com/megagonlabs/ginza/blob/v1.1.0-gsd_preview-1/ja_ginza/parse_tree.py#L440
    I read a mention below, which might be related with this issue, but I could not find the APIs to do that.

Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 7, 2019

  1. Use "entry-point" method to make GiNZA as a custom language class

I've done the refactoring to use entry points.
And now, GiNZA's source codes and its package structure become much smarter than ever.
https://github.com/megagonlabs/ginza/releases/tag/v1.1.1-gsd_preview-2

I'd add one more improvement around the accuracy of root token's pos, tonight.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 7, 2019

  1. Find a way to disambiguate the POS of root token, mostly NOUN or VERB, with the parse result

I've implemented a rule-based logic for "Sa-hen" noun pos disambiguation and released the final alpha version.

https://github.com/megagonlabs/ginza/releases/tag/v1.1.2-gsd_preview-3

I'm going to add some documents and tests, and then create a PR to integrate the new ja language class and models to spaCy master branch.

Please review above codes and give me some feedback.
Thanks a lot! @polm @honnibal

@KoichiYasuoka

This comment has been minimized.

Copy link

commented Jun 8, 2019

Thank you, @hiroshi-matsuda-rit san, for your work of

https://github.com/megagonlabs/ginza/releases/tag/v1.1.2-gsd_preview-3

but I could not find ja_gsd-1.1.2.tar.gz in the Assets. Umm...

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 8, 2019

I'm so sorry for that. I added it. Please try downloading again. @KoichiYasuoka

@KoichiYasuoka

This comment has been minimized.

Copy link

commented Jun 8, 2019

Thank you again, @hiroshi-matsuda-rit san, and I've checked some attributes in Token https://spacy.io/api/token as follows:

for t,bi,typ in zip(doc,doc._.bunsetu_bi_label,doc._.bunsetu_position_type):
  print(t.i,t.orth_,t.lemma_,t.pos_,t.tag_,"_",t.head.i,t.dep_,"_",bi,typ)
0 旅 旅 VERB VERB _ 2 acl _ B SEM_HEAD
1 する 為る AUX AUX _ 0 aux _ I SYN_HEAD
2 時 時 NOUN NOUN _ 6 nsubj _ B SEM_HEAD
3 は は ADP 助詞,係助詞,*,* _ 2 case _ I SYN_HEAD
4 旅 旅 NOUN NOUN _ 6 obj _ B SEM_HEAD
5 を を ADP 助詞,格助詞,*,* _ 4 case _ I SYN_HEAD
6 する 為る AUX AUX _ 6 ROOT _ B ROOT

and I found several t.tag_ were destroyed by t.pos_ . Yes, I can use t._.pos_detail instead of t.tag_ only in Japanese model, but it seems rather complicated when I use it with other language models.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 8, 2019

I've ever tried to store detailed part-of-speech information to tag_ but could not do that without modifying data structure of spaCy token, and unfortunately this modification needs C level recompilation and reduces compatibility with other languages.

I'd like to read spaCy's source codes again and report whether we could store language specific tag to token.tag_ or not. @KoichiYasuoka

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 8, 2019

I've found that the recent spaCy versions can change token.pos_.
I'm still testing that but I'd like to release the trial version below.
https://github.com/megagonlabs/ginza/releases/tag/v1.2.0-gsd_preview-4
Thanks, @KoichiYasuoka

@KoichiYasuoka

This comment has been minimized.

Copy link

commented Jun 8, 2019

Thank you again and again, @hiroshi-matsuda-rit san, I'm just trying
https://github.com/megagonlabs/ginza/releases/tag/v1.2.0-gsd_preview-4
now t.tag_ works well. But 「名詞である」 goes to error now.

>>> import spacy
>>> ja=spacy.load("ja_gsd")
>>> s=ja("名詞である")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yasuoka/.local/lib/python3.7/site-packages/spacy/language.py", line 390, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/home/yasuoka/.local/lib/python3.7/site-packages/ginza/japanese_corrector.py", line 21, in __call__
    set_bunsetu_bi_type(doc)
  File "/home/yasuoka/.local/lib/python3.7/site-packages/ginza/japanese_corrector.py", line 83, in set_bunsetu_bi_type
    t.pos_ in FUNC_POS or
  File "token.pyx", line 864, in spacy.tokens.token.Token.pos_.__get__
KeyError: 405

So as 「猫である」 or 「人である」 etc. Then 「である」 seems to have wrong Key for t.pos_ .

>>> s=ja("である")
>>> for t in s:
...   dir(t)
...
['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'remove_extension', 'right_edge', 'rights', 'sent', 'sent_start', 'sentiment', 'set_extension', 'shape', 'shape_', 'similarity', 'string', 'subtree', 'suffix', 'suffix_', 'tag', 'tag_', 'text', 'text_with_ws', 'vector', 'vector_norm', 'vocab', 'whitespace_']
>>> print(t.tag_)
接続詞,*,*,*+連体詞,*,*,*
>>> print(t.pos_)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "token.pyx", line 864, in spacy.tokens.token.Token.pos_.__get__
KeyError: 405
>>> print(t.pos)
405
@KoichiYasuoka

This comment has been minimized.

Copy link

commented Jun 9, 2019

One more thing, @hiroshi-matsuda-rit san, how do you think about to change t._.bunsetu_bi_label into t._.chunk_iob ? If you agree (in the sense of t.ent_iob_ for NER), how about t._.bunsetu_position_type into t._.chunk_pos ? "bunsetu" is rather long for me, and I often misspelled it for "bunsetsu"...

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 9, 2019

@KoichiYasuoka Thanks you!
I just opened an issue on ginza's repository.
megagonlabs/ginza#26
Let us make further discussions on that aggressively!

Actually, I've improved GiNZA's codes on the problems which your reported above, and just started refactoring for training procedure.
I'll report the progress on that tomorrow.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 11, 2019

@honnibal I'd like to switch the discussing place from #3818 to here because it is Japanese model specific issue.

I'm trying to use spaCy's train command with SudachiTokenizer to create Japanese model as I mentioned in #3818.
But unfortunately, due to my misunderstanding, it doesn't work yet.
I used UD-Japanese GSD dataset converted to json, and the train command work well with that when I add -G option.
But if I drop -G option:

python -m spacy train ja ja_gsd-ud ja_gsd-ud-train.json ja_gsd-ud-dev.json -p tagger,parser -ne 2 -V 1.2.2 -pt dep,tag -v models/ja_gsd-1.2.1/ -VV
...
✔ Saved model to output directory                                                                                                                                                         
ja_gsd-ud/model-final
⠙ Creating best model...
Traceback (most recent call last):
  File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/cli/train.py", line 257, in train
    losses=losses,
  File "/home/matsuda/.pyenv/versions/3.7.2/lib/python3.7/site-packages/spacy/language.py", line 457, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 413, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 519, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "transition_system.pyx", line 86, in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence
  File "arc_eager.pyx", line 592, in spacy.syntax.arc_eager.ArcEager.set_costs
ValueError: [E020] Could not find a gold-standard action to supervise the dependency parser. The tree is non-projective (i.e. it has crossing arcs - see spacy/syntax/nonproj.pyx for definitions). The ArcEager transition system only supports projective trees. To learn non-projective representations, transform the data before training and after parsing. Either pass `make_projective=True` to the GoldParse class, or use spacy.syntax.nonproj.preprocess_training_data.

I added probe code to arc_eager.set_cost() like below, and found some fragmented content might be created by Levenshtein alignment procedure.

        if n_gold < 1:
            for t in zip(gold.words, gold.tags, gold.heads, gold.labels):
                print(t)

The printed content for the sentence "高橋の地元、盛岡にある「いわてアートサポートセンター」にある風のスタジオでは、地域文化芸術振興プランと題して「語りの芸術祭inいわて盛岡」と呼ばれる朗読会が過去に上演されてお り、高橋の作品も何度か上演されていたことがある。" is:

('高橋', 'NNP', 2, 'nmod')
('の', 'PN', 0, 'case')
('地元', 'NN', 4, 'nmod')
('、', 'SYM', 2, 'punct')
('盛岡', 'NNP', 6, 'iobj')
('に', 'PS', 4, 'case')
('ある', 'VV', None, 'advcl')
('「', 'SYM', None, 'punct')
(None, None, None, None)
('アート', 'NN', 10, 'compound')
('サポートセンター', 'NN', 13, 'iobj')
('」', 'SYM', 10, 'punct')
('に', 'PS', 10, 'case')
('ある', 'VV', 16, 'acl')
('風', 'NN', 16, 'nmod')
('の', 'PN', 14, 'case')
('スタジオ', 'NN', 44, 'obl')
('で', 'PS', 16, 'case')
('は', 'PK', 16, 'case')
('、', 'SYM', 16, 'punct')
('地域', 'NN', 24, 'compound')
('文化', 'NN', 24, 'compound')
('芸術', 'NN', 24, 'compound')
('振興', 'NN', 24, 'compound')
('プラン', 'NN', None, 'obl')
('と', 'PS', 24, 'case')
(None, None, None, None)
('て', 'PC', None, 'mark')
('「', 'SYM', 29, 'punct')
('語り', 'NN', 34, 'nmod')
('の', 'PN', 29, 'case')
('芸術祭', 'NN', 32, 'compound')
('in', 'NNP', 34, 'nmod')
(None, None, None, None)
('盛岡', 'NNP', 37, 'obl')
('」', 'SYM', 34, 'punct')
('と', 'PQ', 34, 'case')
('呼ぶ', 'VV', 40, 'acl')
('れる', 'AV', 37, 'aux')
('朗読', 'NN', 40, 'compound')
('会', 'XS', 44, 'nsubj')
('が', 'PS', 40, 'case')
('過去', 'NN', 44, 'iobj')
('に', 'PS', 42, 'case')
('上演', 'VV', 65, 'advcl')
(None, None, None, None)
(None, None, None, None)
('て', 'PC', 44, 'mark')
('おる', 'AV', 44, 'aux')
('、', 'SYM', 44, 'punct')
('高橋', 'NNP', 52, 'nmod')
('の', 'PN', 50, 'case')
('作品', 'NN', 57, 'obl')
('も', 'PK', 52, 'case')
(None, None, None, None)
(None, None, None, None)
('か', 'PF', 59, 'mark')
('上演', 'VV', 65, 'csubj')
('何度', 'NN', 59, 'subtok')
('何度', 'NN', 57, 'obl')
('て', 'PC', 57, 'mark')
(None, None, None, None)
('た', 'AV', 57, 'aux')
('こと', 'PNB', 57, 'mark')
('が', 'PS', 57, 'case')
('ある', 'VV', 65, 'ROOT')
('。', 'SYM', 65, 'punct')

The missing words are "いわて", "題し", "さ", "れ", and "何度" is aligned twice after the following word.

When I used only the former part of train dataset (before that sentence), the error did not occurred but the UAS value decreased in 5 epochs.
It seems like there are some confusions in gold-retokenization procedure.
I'm researching the cause of this phenomenon.

@honnibal

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

@hiroshi-matsuda-rit Thanks, it's very possible my alignment code could be wrong, as I struggled a little to develop it. You might find it easier to develop a test case with the problem if you're calling it directly. You can find the tests here: https://github.com/explosion/spaCy/blob/master/spacy/tests/test_align.py

It may be that there's a bug which only gets triggered for non-Latin text, as maybe I'm doing something wrong with respect to methods like .lower() or .startswith(). Are you able to develop a test-case using Latin characters? If so, it'd be very preferable, as I would find it a lot easier to work with.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 11, 2019

@honnibal Thanks too, I'd like to try making some tests including both Latin and non-Latin ones.

@honnibal

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

The non-manual approaches are pretty limited, really. If you're able to put a little time into it, we could give you a copy of Prodigy? You should be able to get a decent training corpus with about 20 hours of annotation.

If 20 hours is enough then I'd be glad to give it a shot! I figured it'd take much longer than that.

@polm : Of course your mileage may vary, but the annotation tool is quite quick, especially if you do the common entities one at a time, and possibly use things like pattern rules. Even a small corpus can start to get useful accuracy, and once the initial model is produced, it can be used to bootstrap.

If you want to try it, send us an email? contact@explosion.ai

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 11, 2019

@honnibal : I've tested your alignment function for several hours and I found that the current implementation is working correctly even for non-latin characters. I found it was a possible alignment, which I showed above, hence there is no need to add the test cases to test_align.py anymore.
Also, I fixed some bugs on my json generation codes but I'm still facing to other type of error. I'd report the details tomorrow morning. Thanks again for your suggestions!

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 11, 2019

@honnibal Finally, I've solved all the problems and now train command works well with SudachiTokenizer. The accuracy is getting better as the epoch proceeds.

I found three issues which prevent executing subtok unification.

  1. In spacy.pipeline.function.merge_subtokens(), we have to merge overlapped spans as below
    spans = [(start, end + 1) for _, start, end in matches]
    widest = []
    for start, end in spans:
        for i, (s, e) in enumerate(widest):
            if start <= s and e <= end:
                del widest[i]
            elif s <= start and end <= e:
                break
        else:
            widest.append((start, end))
    spans = [doc[start:end] for start, end in widest]
  1. spacy.pipeline.function.merge_subtokens() is receiving additional arguments from the pipe
def merge_subtokens(doc, label="subtok", batch_size=None, verbose=None):
  1. In GoldParse(), we have to prevent causing dependency loop while adding subtok arcs like below (but I'm not sure about the appropriate condition to do this)
                    if not is_last and i != self.gold_to_cand[heads[i2j_multi[i+1]]]:
@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 18, 2019

@honnibal @polm I'd like to report on the result of UD-Japanese meeting held in Kyoto yesterday.
I asked the committees for two issues below and got useful feedback to publish a commercially available data-set.

Q1. Is the UD_Japanese-GSD data-set suitable for a training data for official Japanese model of spaCy?
A1. Probably, not.
The license of GSD data-set is CC-BY-NC-SA. It is 'gray' in Japanese law if someone uses the probabilistic models trained on the 'NC' data-set for commercial purposes (but in other jurisdictions, it might be allowed).
https://github.com/UniversalDependencies/UD_Japanese-GSD/blob/master/LICENSE.txt

We think it's safer to use UD_Japanese-PUD under CC-BY-SA license.
https://github.com/UniversalDependencies/UD_Japanese-PUD/blob/master/LICENSE.txt

My opinions:

  • Use the UD_Japanese-PUD data-set instead GSD (PUD is small but enough to learn basic dependency structure) for the early releases of spaCy Japanese language model
  • If it is not necessary to publish JSON data, we can publish the models trained on UD-Japanese BCCWJ as spaCy's Japanese model.
  • We'd like to create a new data-set which contains over 10,000 sentences with dependency and named-entity gold annotations under OSS license which allows commercial use, by the end of this year

I've published a PUD-based Japanese lang model as a preview version. It was trained on only 900 sentences but the accuracy is not so bad.
https://github.com/megagonlabs/ginza/releases/tag/v1.3.1-pud.preview-8
tkn_recall:LAS=0.8736,UAS=0.8930,UPOS=0.9393,boundary=0.9647
precision:LAS=0.8672,UAS=0.8864,UPOS=0.9324,boundary=0.9576

Shall I publish this PUD-based JSON files and send PR with my custom tokenizer and component?

Q2. Are there any commercially available Japanese NE data-sets?
A2. Unfortunately, KWDLC is the only solution.

My opinions:
According to the KWDLC license, we have a responsibility to remove the sentences when the copyright holder requests that. It may cause some problems in the future.

I'm going to make NE annotations over UD_Japanese-PUD in two weeks.
Can I use Prodigy as an annotation tool to do that?

Thanks,

@polm

This comment has been minimized.

Copy link
Contributor Author

commented Jun 18, 2019

Thanks for the update!

We'd like to create a new data-set which contains over 10,000 sentences with dependency and named-entity gold annotations under OSS license which allows commercial use, by the end of this year

This is great news!

The license of GSD data-set is CC-BY-NC-SA. It is 'gray' in Japanese law if someone uses the probabilistic models trained on the 'NC' data-set for commercial purposes (but in other jurisdictions, it might be allowed).

This is surprising to me - I was under the impression that trained models were fine to distribute after the recent Japanese copyright law changes.

Anyway, I think getting a full pipeline working with data that exists now sounds good, and it's wonderful to hear data with a clear license will be available later this year. If there's anything I can help with please feel free to @ me.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jun 18, 2019

@polm Sure. We had big changes in the copyright law in Japan at the beginning of this year.

In my humble opinion, the new copyright law allows us to publish the data-sets extracted from the public web (except the ready-made data-sets with no-republish-clause), and everyone in Japan can train and use machine learning models just for the internal use with those open data-sets even for commercial purposes.

But it's still in the gray-zone if we'd publish the models trained on the data-sets having no-republish-clauses or non-commercial-use-clauses, with commercial-use capability in the license of the models. This is why using GSD has the risks and I recommended to use PUD for the early versions of spaCy Japanese model.

Thanks a lot!

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jul 2, 2019

I've just created a PR for Japanese Model using UD_Japanese-PUD with my NE annotations.
#3899
It seems that I'll have to make some arrangements to ensure spaCy's guidelines.
Of course, I'd like to try them in just few days! @honnibal

Best,
Hiroshi

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jul 3, 2019

from #3899

The model file is trained on UD_Japanese-PUD v2.4-NE that is a fork of official PUD and contains additional OntoNotes 5 based gold NE labels with BI-CATEGORY format.
https://github.com/megagonlabs/UD_Japanese-PUD/blob/v2.4-NE-spacy/ja_pud-ud-test.train.json
https://github.com/megagonlabs/UD_Japanese-PUD/blob/v2.4-NE-spacy/ja_pud-ud-test.test.json

These jsons are created by my conllu-to-json converter which I implemented to handle my custom NE-BI labels and also to apply Japanese-specific Zenkaku/Hankaku character data-augmentation method.
https://github.com/megagonlabs/ginza/blob/develop/ginza_util/conllu_to_json.py

I used these json files for the arguments of spacy train command. You can view all the arguments of spacy train in the training shell script below.
https://github.com/megagonlabs/ginza/blob/develop/shell/train_pipeline.sh

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jul 10, 2019

In #3899, I suppose that it looks a little strange to add pos and dep label rewriter at the end of pipeline like JapaneseCorrector.

I just developed another version of SudachiPy integration.
#3945

This simple version has standard pipeline and shows good accuracy with the model trained on UD_Japanese-PUD+NE.
https://github.com/megagonlabs/UD_Japanese-PUD

I used ja_pud-ud-test.simple.(train|test).json to train spaCy model like below:

python -m spacy train ja temp/ja_core_web_md-2.1.0.`date +%Y%m%d%H%M` ja_pud-ud-test.simple.train.json ja_pud-ud-test.simple.test.json -v models/ja_pud -p tagger,parser,ner -pt dep,tag -et dep,tag -ne 5 -V 2.1.0 -VV -T

https://github.com/megagonlabs/UD_Japanese-PUD/releases/download/v2.4-NE-simple/ja_core_web_md-2.1.0.tar.gz

I arranged a line to solve an error raised while executing train command with -T option for above corpus. @honnibal
https://github.com/explosion/spaCy/pull/3945/files#diff-7883b2dbfb4f962d7965c71f6dd321d5L39

@polm

This comment has been minimized.

Copy link
Contributor Author

commented Jul 14, 2019

Hey, thanks so much for your work on this and sorry for my delayed reply! I was travelling for spaCy IRL but am back home now, so I'll take a look at this over the next few days.

@polm

This comment has been minimized.

Copy link
Contributor Author

commented Jul 29, 2019

Thanks again for your work on this and sorry again for taking so long to respond.

I think the work in the PR for this model is good but there are some issues.

The biggest issue is that SudachiPy is slow. I was expecting it to be slower than MeCab, but in rough tests it's 40x slower or more. While it's not noticeable for individual sentences, for longer documents I don't feel the speed is reasonable.

Are there plans to make changes to SudachiPy to make it faster? If not I could look into rewriting the relevant parts of it in Cython or otherwise working on that.

Other, less important issues:

The README for SudachiPy indicates it is still in early development. Is this true? It seems stable but I'm unclear on whether it's been used in production anywhere, since the Java version seems to be the main one, and looking at the issues it does look like basic features such as user dictionaries are still undergoing adjustment.

From the README at the time of writing:

Warning: SudachiPy is still under development, and some of the functions are still not complete. Please use it at your own risk.

SUDACHI_DEFAULT_MODE seems redundant and should be removed.

What are the read_sudachi functions in sudachi_tokenizer.py there for? Nothing else seems to call them.

use_sentence_separator is saved as a boolean on the SudachiTokenizer object, and it's checked during tokenization, but there's no way to change it without directly setting the attribute. I'm also not sure when it would be desirable to make it false (documents with unusual punctuation?).

I see there's code to handle 空白 tokens specially, but a sentence with multiple half-width spaces still treats some spaces as "words" and gives them a position in the dependency tree, which feels wrong. Is that a bug? Example sentence:

panda   fishというバンドを見に行った。

It would be preferable to have the dictionary distributed as a normal PyPI package to make versioning clearer. I think the dictionary is over the default 60MB PyPI limit, but you should be able to get an exception from the policy by asking for it. Are there plans to do that, or is there another reason the dictionary is hosted off PyPI for now?

I believe that SudachiPy and the dictionary should only be specified as optional dependencies in setup.py, not in requirements.txt, though I am not sure about that. I understand that only having it in setup.py makes using automated testing more difficult.


Sorry for listing a lot of smaller issues. I really appreciate the work on this PR, and I think there are parts of it that can be used right away - for example, if you want to add the handling for サ変名詞 and verbs to the existing code that would be great.

Thanks again, and let me know if you have any feedback about the above, particularly about plans for improving the speed of SudachiPy. It'd be great to have a fast tokenizer that's easier to install and configure than MeCab.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Jul 30, 2019

Thanks for your helps to make new Japanese models better! > @polm

I'd like to solve some issues which you wrote above, and hope asking to SudachiPy develop team to explain the current development status and the future works.

use_sentence_separator is saved as a boolean on the SudachiTokenizer object

This boolean flag is used in GiNZA's cli.py to input already sentencized text from stdin, but currently not used in spacy.lang.ja.
I'd like to add accessor methods for that field (or remove it and set default mode as C).

I believe that SudachiPy and the dictionary should only be specified as optional dependencies in setup.py, not in requirements.txt

Sure, I added the SudachiPy's dependencies in requirements.txt only to make CI tests succeeded. These dependencies are used to construct the testing environments for travis and azure, I believe.

  • Are there plans to make changes to SudachiPy to make it faster?
  • The README for SudachiPy indicates it is still in early development. Is this true?
  • I see there's code to handle 空白 tokens specially, but a sentence with multiple half-width spaces still treats some spaces as "words" and gives them a position in the dependency tree, which feels wrong.
  • I think the dictionary is over the default 60MB PyPI limit, but you should be able to get an exception from the policy by asking for it.

Could you please explain your opinions for these issues? > @kazuma-t and @izziiyt

@izziiyt

This comment has been minimized.

Copy link

commented Jul 31, 2019

@polm @hiroshi-matsuda-rit

Thank you for good review and FB for growing SudachPy.
As long as possible we would make SudachiPy suitable as Japanese Tokenizer for spacy.

Are there plans to make changes to SudachiPy to make it faster?

Yes. Now I'm experimenting using Cython but not prioritized issue. If you say it's the main concern for integrating SudachiPy to spacy, I'll focus.

The README for SudachiPy indicates it is still in early development. Is this true?

No. It's too much to say that SudachiPy is under early development. Some features are still incompatible with Sudachi-Java but we think it's already stable. I discussed with @kazuma-t and have consensus.

I see there's code to handle 空白 tokens specially, but a sentence with multiple half-width spaces still treats some spaces as "words" and gives them a position in the dependency tree, which feels wrong.

I haven't noticed this issue. I'll check it. Thanks.

I think the dictionary is over the default 60MB PyPI limit, but you should be able to get an exception from the policy by asking for it.

We didn't have idea to ask PyPI organization allowing exception. Do you know an example package over 60MB limitation ? If you know I want to refer issues or discussions around it. We'd like to do that.

@hiroshi-matsuda

This comment has been minimized.

Copy link

commented Jul 31, 2019

@izziiyt Thanks a lot. This issue would help you to bump the size limit of PyPI.
pypa/packaging-problems#86

@polm

This comment has been minimized.

Copy link
Contributor Author

commented Jul 31, 2019

Thanks for your speedy reply! @hiroshi-matsuda already beat me to the point about PyPI; if you file an issue on that repo I think you should have no trouble getting an exception.

Yes. Now I'm experimenting using Cython but not prioritized issue. If you say it's the main concern for integrating SudachiPy to spacy, I'll focus.

I do think the speed is the main issue holding it back at this point.

No. It's too much to say that SudachiPy is under early development. Some features are still incompatible with Sudachi-Java but we think it's already stable. I discussed with @kazuma-t and have consensus.

OK, I was hoping that was just a leftover in the README, seems that was the case. Sounds good to me 👍

Let me know if there's anything at all I can help with.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Aug 8, 2019

@izziiyt Thanks for your hard works.
I've integrated SudachiPy 0.3.11 to #3945 and it's working very well.
Please let me know if you'd find something to do in spacy.lang.ja of #3945.

@izziiyt

This comment has been minimized.

Copy link

commented Aug 11, 2019

Thank you for your cooperation. I have much time next week so I don't care my operation time. But I'm wondering how fast it is required. I'm going way without goal.

F.Y.I

Using pypy, Sudachipy single core v0.3.11 runs as fast as Sudachi-Java 4 cores. Using python, 4 times slower than
pypy. If I cythonize, we can't take benefit from JIT.

@polm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

Ideally a Cython version of SudachiPy would be as fast as MeCab. Since (I think?) Sudachi's architecture is pretty similar to MeCab, I would expect to get similar performance from a Cython version.

@izziiyt

This comment has been minimized.

Copy link

commented Aug 21, 2019

@polm Sorry for late reply. Uh,... it sounds hard work. I'll discuss Tokushima-NLP team how should we put resource on this task first. Thank you.

@icoxfog417

This comment has been minimized.

Copy link

commented Aug 29, 2019

UD_Japanase-GSD's license problem is now cleared. "NC" is going to be removed!. Only we have to do is NER annotation.

@polm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 30, 2019

@izziiyt Any updates on the Cython development schedule?

If it's unclear when you'd have time to work on porting SudachiPy to Cython, or if it's just not a priority, I'd be delighted to work on it. I have some other tasks I'm working on now but would be able to focus on this starting in a few weeks. Let me know if I can help!

@izziiyt

This comment has been minimized.

Copy link

commented Aug 30, 2019

@polm Sorry for late reply.

I'm so sorry that I was enjoying summer vacation and ignoring SudachiPy development. Just now I'm resolving bug fixing, after that I'll do the bellow tasks (was asked to do from @kazuma-t)

  • cythonize doublearray
  • cythonize connection matrix access
  • cythonize other core component of tokenize function

I'll finish these tasks in 2 weeks. (may be 1 week if debug task is light)

I already make SudachiPy about 2 times faster (v0.3.11) by fixing codes.

I don't have plan to do cythonizing other parts like plugins. So if you can help us, I'm glad if you update plugins ! (e.g. join numeric, join katakana plugins takes time in parsing)

But I think just cythoization won't make SudachiPy as fast as mecab or similar scale. SudachiPy has regularization functions rather than mecab and basically dictionary is much bigger. However I'd like to let you know that using SudachiPy with small dictionary and without plugins means killing good parts of SudachiPy.

If we truly want to make SudachiPy as fast as mecab or similar scale, we should implement SudachiC(++) and calling from python is the best way I think.

@izziiyt

This comment has been minimized.

Copy link

commented Sep 10, 2019

@hiroshi-matsuda-rit Can you check SudachiPy v0.4.0 to Ginza ? I want to know the compatibility. A part of code is cythonized. (improvement is little.)

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Sep 17, 2019

@izziiyt Sorry for late reply. I'm going to update #3945 with the latest SudachiPy version. Please wait for few days.

@hiroshi-matsuda-rit

This comment has been minimized.

Copy link

commented Sep 21, 2019

I tried to merge recent updates on spaCy's master branch to #3945 but it caused some readability degradation hence I decided to re-create the branch from latest master.
I'm testing SudachiPy v0.4.0 and the new model trained by the latest train command of spaCy.
Just wait for a moment.
megagonlabs@c45af2e

By the way, we (Megagon labs) are planning to annotate OntoNotes5 NE labels for UD_Japanese-GSD with the help of Ninjal. We'd like to release this NE annotation in few months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
8 participants
You can’t perform that action at this time.