Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classical Chinese Model needed #100

Open
KoichiYasuoka opened this issue May 7, 2019 · 31 comments
Open

Classical Chinese Model needed #100

KoichiYasuoka opened this issue May 7, 2019 · 31 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@KoichiYasuoka
Copy link
Contributor

I've almost finished to build up UD_Classical_Chinese-Kyoto Treebank, and now I'm trying to make a Classical Chinese model for NLP-Cube (please check my diary). But in my model sentence_accuracy<35 and I can't sentencize "天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠" (check gold standard here). How do I tune up sentencization for Classical Chinese?

@tiberiu44
Copy link
Contributor

I looked over the corpus, and I see there are no delimiters (punctuation marks) for sentences. Is this ik?

@KoichiYasuoka
Copy link
Contributor Author

KoichiYasuoka commented May 7, 2019

Yes, OK. Classical Chinese does not have any punctuations or spaces between words or sentences. Therefore, in my humble opinion, tokenization is a hard task without POS-tagging, and sentencization is a hard task without dependency parsing...

@tiberiu44
Copy link
Contributor

I think we could go for jointly POS-Tagging and tokenising. Unfortunately, the algorithm we use for dependency parsing requires us to build a NxN matrix for all the words (N), which is likely to cause an out of memory error if we use all tokens. Do you know of any other approach, that does not require dependency parsing for sentence segmentation?

@KoichiYasuoka
Copy link
Contributor Author

Umm... I only know Straka & Straková (2017) approach using dynamic programming (see section 4.3), but it requires tentative parse trees...

@tiberiu44
Copy link
Contributor

I see. I can imagine joint sentence segmentation and parsing working by using the ARC-system. Whenever the stack is emptied, it implies that a sentence boundary should be generated.

We've finished work for the Parser and Tagger for version 2.0, but we still haven't found a good solution for tokenization/sentence splitting.

I think I will give this new approach a try, but it will take some time to implement. I'll let you know when it's done and maybe you can test it on your corpus.

Thanks for the feedback,
Tibi

@tiberiu44 tiberiu44 added enhancement New feature or request help wanted Extra attention is needed labels Jun 5, 2019
@tiberiu44
Copy link
Contributor

@KoichiYasuoka - i haven't had any success with the tokenizer/sentence splitter so far. We are working on rolling out version 2.0 which uses a single model conditionally trained with language embeddings. We have great accuracy figures for the parser and tagger. However, we are still experiencing difficulties with the tokenizer (for all languages).

We tried jointly tagging/parsing and tokenizing, but we simply got the same results as if we would do these two tasks independently. Any suggestions on how to proceed?

@KoichiYasuoka
Copy link
Contributor Author

Umm... For Japanese tokenisation (word splitting) and POS-tagging, we often apply Conditional Random Fields as Kudo et al. (2004). For Classical Chinese, we also use CRF in our UD-Kanbun.

For sentence segmentation in Classical Chinese, recent progress has been made by Hu et al. (2019) at https://seg.shenshen.wiki/. Hu et al. uses BERT-model, which is trained by enormous Classical Chinese texts of 3.3×109 characters...

@tiberiu44
Copy link
Contributor

tiberiu44 commented May 1, 2020

@KoichiYasuoka - I hope you are doing well in this time of crisis.

It's been a long time since our last progress update on this issue. We started training the 2.0 models for NLP-Cube and they should be out soon. I saw the Classical Chinese corpus in the UD Treebanks (v2.5). The model will be included in this release. Congratulations and thank you for your work.

I thought you might be interested in the fact that we are also setting up a "model zoo" for NLP-Cube, so contributors can publish their pre-trained models. We will try to make research attribution easy, by printing a banner with copyright and/or citing options for these models.

@KoichiYasuoka
Copy link
Contributor Author

KoichiYasuoka commented May 1, 2020

@tiberiu44 - Thank you for using our UD_Classical_Chinese-Kyoto for your NLP-Cube. We've just finished to add 19 more volumes from "禮記" into https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto/tree/dev for the v2.6 release of UD Treebanks (scheduled on May 15, 2020). Enjoy!

@tiberiu44
Copy link
Contributor

Hi @KoichiYasuoka ,

We've finished releasing the current version of NLPCube and we included the classical Chinese model from 2.7. Sentence segmentation seems to be problematic for this treebank. You can check branch 3.0 of the repo to get more info: https://github.com/adobe/NLP-Cube/tree/3.0

If you have any suggestions regarding sentence segmentation, please let me know. Right now we are using xlm-roberta-base for language modeling, but maybe there is some other LM that can provide better results.

Best,
Tiberiu

@KoichiYasuoka
Copy link
Contributor Author

Thank you @tiberiu44 for releasing NLP-Cube 3.0. But, well, pytorch-lightning==1.1.7 is too old for recent torchtext==0.10.0 so I use pytorch-lightning==1.2.10 instead:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("不入虎穴不得虎子")
>>> print(doc)
1	不入虎穴不得虎子	叔津	PROPN	n,名詞,人,複合的人名	NameType=Prs	0	root	_	_

Umm... tokenization of classical Chinese doesn't work here...

@tiberiu44
Copy link
Contributor

Yes, I see something is definitely wrong with the model. Just tried you example and tokenization did not work. However, on longer examples it seems to behave differently:

Out[13]:
1	子曰學而時習之不亦說乎	子春城	PROPN	n,名詞,人,名	NameType=Giv	2	nsubj	_	_
2	有	有	VERB	v,動詞,存在,存在	_	0	root	_	_
3	朋	朋	NOUN	n,名詞,人,関係	_	2	obj	_	_
4	自	自	ADP	v,前置詞,経由,*	_	6	case	_	_
5	遠	遠	VERB	v,動詞,描写,量	Degree=Pos|VerbForm=Part	6	amod	_	_
6	方	方	NOUN	n,名詞,固定物,関係	Case=Loc	7	obl	_	_
7	來	來	VERB	v,動詞,行為,移動	_	2	ccomp	_	_
8	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	14	advmod	_	_
9	亦	亦	ADV	v,副詞,頻度,重複	_	10	advmod	_	_
10	樂	樂	VERB	v,動詞,行為,態度	_	2	conj	_	_
11	乎	乎	ADP	v,前置詞,基盤,*	_	12	case	_	_
12	人	人	NOUN	n,名詞,人,人	_	7	obl	_	_
13	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	14	advmod	_	_
14	知	知	VERB	v,動詞,行為,動作	_	10	parataxis	_	_

1	而	而	CCONJ	p,助詞,接続,並列	_	3	advmod	_	_
2	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	3	advmod	_	_
3	慍	慍	VERB	v,動詞,行為,態度	_	6	csubj	_	_
4	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	6	advmod	_	_
5	亦	亦	ADV	v,副詞,頻度,重複	_	6	advmod	_	_
6	君子	君子	NOUN	n,名詞,人,役割	_	0	root	_	_
7	乎	乎	PART	p,助詞,句末,*	_	6	discourse:sp	_	_```

I will try retraining the tokenizer with a different LM.

@KoichiYasuoka
Copy link
Contributor Author

Umm... first eleven characters seem untokenized:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("子曰道千乘之國敬事而信節用而愛人使民以時")
>>> print(doc)
1	子曰道千乘之國敬事而信	子春于	PROPN	n,名詞,人,名	NameType=Giv	2	nsubj	_	_
2	節	節	VERB	v,動詞,描写,態度	Degree=Pos	0	root	_	_
3	用	用	VERB	v,動詞,行為,動作	_	2	flat:vv	_	_

1	而	而	CCONJ	p,助詞,接続,並列	_	2	advmod	_	_
2	愛	愛	VERB	v,動詞,行為,交流	_	6	csubj	_	_
3	人	人	NOUN	n,名詞,人,人	_	2	obj	_	_
4	使	使	VERB	v,動詞,行為,使役	_	2	parataxis	_	_
5	民	民	NOUN	n,名詞,人,人	_	4	obj	_	_
6	以	以	VERB	v,動詞,行為,動作	_	0	root	_	_
7	時	時	NOUN	n,名詞,時,*	Case=Tem	6	obj	_	_

@tiberiu44
Copy link
Contributor

Yes, seems to be a recurring issue with any text I try. I'm retraining the tokenizer/sentence splitter right now (it will take a couple of hours). Hopefully, this will solve the problem. I'll let you know as soon as I publish the new model.

@KoichiYasuoka
Copy link
Contributor Author

Thank you @tiberiu44 and I will wait for the new tokenizer. Ah, well, for sentence segmentation of the classical Chinese, I released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-char and https://github.com/KoichiYasuoka/SuPar-Kanbun using the segmentation algorithm of 一种基于循环神经网络的古文断句方法. I hope these help you.

@tiberiu44
Copy link
Contributor

This is perfect. I will use your model to train the Classical Chinese pipeline:

python3 cube/trainer.py --task=tokenizer --train=scripts/train/2.7/language/lzh.yaml --store=data/lzh-trf-tokenizer --num-workers=0 --lm-device=cuda:0 --gpus=1 --lm-model=transformer:KoichiYasuoka/roberta-classical-chinese-large-char

Given that this is a dedicated model, I hope it will provide better results than any other LM.

Thank you for this.

KoichiYasuoka added a commit to KoichiYasuoka/NLP-Cube that referenced this issue Aug 12, 2021
@KoichiYasuoka
Copy link
Contributor Author

Thank you @tiberiu44 for releasing nlpcube 0.3.0.7. I tried the new model of classical Chinese with pytorch-lightning==1.2.10 and torchtext==0.10.0:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("不入虎穴不得虎子")
>>> print(doc)
1	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	2	advmod	_	_
2	入	入	VERB	v,動詞,行為,移動	_	0	root	_	_
3	虎	虎	NOUN	n,名詞,主体,動物	_	4	nmod	_	_
4	穴	<UNK>	NOUN	n,名詞,可搬,道具	_	2	obj	_	_

1	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	2	advmod	_	_
2	得	得	VERB	v,動詞,行為,得失	_	0	root	_	_

1	虎	虎	NOUN	n,名詞,主体,動物	_	0	root	_	_

1	子 	子產	PROPN	n,名詞,人,名	NameType=Giv	0	root	_	_;compund

The tokenization seems to work well this time. Now the problem is the sentence segmentation...

@tiberiu44
Copy link
Contributor

Thank you for the feedback. I'm working on that right now. Hope to get it fixed soon.

@tiberiu44
Copy link
Contributor

So far, I only got an sentence f-score of 20 (best result using your RobertaModel):

-----------+-----------+-----------+-----------+-----------
Tokens     |     98.40 |     97.34 |     97.87 |
Sentences  |     34.06 |     15.03 |     20.86 |
Words      |     98.40 |     97.34 |     97.87 |
UPOS       |     92.36 |     91.37 |     91.86 |     93.86
XPOS       |     89.27 |     88.31 |     88.78 |     90.72
UFeats     |     92.95 |     91.95 |     92.45 |     94.46
AllTags    |     87.35 |     86.41 |     86.88 |     88.77
Lemmas     |     92.01 |     91.02 |     91.51 |     93.51
UAS        |     66.76 |     66.04 |     66.40 |     67.84
LAS        |     61.46 |     60.80 |     61.13 |     62.46
CLAS       |     60.49 |     59.19 |     59.83 |     60.96
MLAS       |     56.81 |     55.59 |     56.20 |     57.25
BLEX       |     56.06 |     54.86 |     55.45 |     56.49

The UAS and LAS scores are low because every time it get's a sentence wrong, the system will also mislabel the root node.

@KoichiYasuoka
Copy link
Contributor Author

KoichiYasuoka commented Aug 15, 2021

20.86% is much worse than the result (80%) of 一种基于循环神经网络的古文断句方法. OK, here I try myself with transformers on Google Colab:

!pip install 'transformers>=4.7.0' datasets seqeval
!test -d UD_Classical_Chinese-Kyoto || git clone https://github.com/universaldependencies/UD_Classical_Chinese-Kyoto
!test -f run_ner.py || curl -LO https://raw.githubusercontent.com/huggingface/transformers/v`pip list | sed -n 's/^transformers *\([^ ]*\) *$/\1/p'`/examples/pytorch/token-classification/run_ner.py

for d in ["train","dev","test"]:
  with open("UD_Classical_Chinese-Kyoto/lzh_kyoto-ud-"+d+".conllu","r",encoding="utf-8") as f:
    r=f.read()
  with open(d+".json","w",encoding="utf-8") as f:
    tokens=[]
    tags=[]
    i=0
    for s in r.split("\n"):
      t=s.split("\t")
      if len(t)==10:
        for c in t[1]:
          tokens.append(c)
          i+=1
      else:
        if i==1:
          tags.append("S")
        elif i==2:
          tags+=["B","E"]
        elif i==3:
          tags+=["B","E2","E"]
        elif i>3:
          tags+=["B"]+["M"]*(i-4)+["E3","E2","E"]
        i=0
        if len(tokens)>80:
          print("{\"tokens\":[\""+"\",\"".join(tokens)+"\"],\"tags\":[\""+"\",\"".join(tags)+"\"]}",file=f)
          tokens=[]
          tags=[]

!python run_ner.py --model_name_or_path KoichiYasuoka/roberta-classical-chinese-large-char --train_file train.json --validation_file dev.json --test_file test.json --output_dir my.danku --do_train --do_eval

I got "eval metrics" as follows:

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9212
  eval_f1                 =     0.8995
  eval_loss               =     0.2794
  eval_precision          =     0.8991
  eval_recall             =     0.8998
  eval_runtime            = 0:00:09.70
  eval_samples            =        329
  eval_samples_per_second =     33.901
  eval_steps_per_second   =      4.328

Then I tried to sentencize the paragraph I wrote two years ago (#100 (comment)):

import torch
from transformers import AutoTokenizer,AutoModelForTokenClassification
tkz=AutoTokenizer.from_pretrained("my.danku")
mdl=AutoModelForTokenClassification.from_pretrained("my.danku")
s="天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠"
e=tkz.encode(s,return_tensors="pt")
p=[mdl.config.id2label[q] for q in torch.argmax(mdl(e)[0],dim=2)[0].tolist()[1:-1]]
print("".join(c+"。" if q=="E" or q=="S" else c for c,q in zip(s,p)))

And I got the result "天平二年正月十三日萃于帥老之宅。申宴會也。于時初春令月。氣淑風和。梅披鏡前之粉。蘭薰珮後之香。加以曙嶺移雲。松掛羅而傾盖。夕岫結霧。鳥封縠而迷林。庭舞新蝶。空歸故鴈。於是盖天坐地。促膝飛觴。忘言一室之裏。開衿煙霞之外。淡然自放。快然自足。若非翰苑何以攄情。詩紀落梅之篇。古今夫何異矣。宜賦園梅。聊成短詠。"
How about your system @tiberiu44?

@tiberiu44
Copy link
Contributor

Unfortunately, I canot run the test right now and I will be away from keyboard most of the day. I will try your approach with transformers tomorrow.

The latest models are pushed if you want to try them. If you already loaded lzh, you will need to trigger a redownload of the model.

The easiest way is to remove all lzh files located in the userhome/.nlpcube/3.0 (anythint that starts with lzh, incuding a folder)

@KoichiYasuoka
Copy link
Contributor Author

Thank you @tiberiu44 for releasing nlpcube 0.3.1.0. I cleaned up my ~/.nlpcube/3.0/lzh:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠")
>>> print("".join(s.text.replace(" ","")+"。" for s in doc.sentences))

And I've got the result "天平二年正月十三日萃于帥老之宅申宴會也。于時初春令月氣淑風和。梅披鏡前之粉蘭薰珮後之香。加以曙嶺移雲松掛羅而傾盖。夕岫結霧。鳥封縠而迷林庭舞新蝶空歸故鴈。於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情。詩紀落梅之篇古今夫何異矣。宜賦園梅。聊。成。短詠。" Umm... "聊。成。短詠。" seems unmeaningful but other segmentations are rather good. Then, how do we improve...

@tiberiu44
Copy link
Contributor

On your previous example, the current version of the tokenizer generates this sentence segmentation:

1	天平	天平	NOUN	n,名詞,時,*	Case=Tem	3	nmod	_	_
2	二	二	NUM	n,数詞,数字,*	_	3	nummod	_	_
3	年	年	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
4	正	正	NOUN	n,名詞,時,*	_	5	amod	_	_
5	月	月	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
6	十三	十三	NUM	n,数詞,数,*	_	7	nummod	_	_
7	日	日	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
8	萃	<UNK>	VERB	v,動詞,行為,動作	_	0	root	_	_
9	于	于	ADP	v,前置詞,基盤,*	_	13	case	_	_
10	帥	帥	NOUN	n,名詞,人,役割	_	11	amod	_	_
11	老	老	NOUN	n,名詞,人,人	_	13	nmod	_	_
12	之	之	SCONJ	p,助詞,接続,属格	_	11	case	_	_
13	宅	宅	NOUN	n,名詞,固定物,建造物	Case=Loc	8	obl:lmod	_	_
14	申	申	VERB	v,動詞,行為,動作	_	8	parataxis	_	_
15	宴	宴	VERB	v,動詞,行為,交流	VerbForm=Part	14	obj	_	_
16	會	會	VERB	v,動詞,行為,交流	_	15	flat:vv	_	_
17	也	也	PART	p,助詞,句末,*	_	8	discourse:sp	_	_

1	于	于	ADP	v,前置詞,基盤,*	_	2	case	_	_
2	時	時	NOUN	n,名詞,時,*	Case=Tem	8	obl:tmod	_	_
3	初	初	NOUN	n,名詞,時,*	Case=Tem	4	nmod	_	_
4	春	春	NOUN	n,名詞,時,*	Case=Tem	6	nmod	_	_
5	令	令	NOUN	n,名詞,人,役割	_	6	nmod	_	_
6	月	月	NOUN	n,名詞,時,*	Case=Tem	8	nsubj	_	_
7	氣	氣	NOUN	n,名詞,描写,形質	_	8	nsubj	_	_
8	淑	淑	VERB	v,動詞,描写,態度	Degree=Pos	0	root	_	_
9	風	風	NOUN	n,名詞,天象,気象	_	10	nsubj	_	_
10	和	和	VERB	v,動詞,描写,形質	Degree=Pos	8	conj	_	_

1	梅	梅	NOUN	n,名詞,固定物,樹木	_	2	nsubj	_	_
2	披	披	VERB	v,動詞,行為,動作	_	0	root	_	_
3	鏡	<UNK>	NOUN	n,名詞,可搬,道具	_	4	nmod	_	_
4	前	前	NOUN	n,名詞,固定物,関係	Case=Loc	6	nmod	_	_
5	之	之	SCONJ	p,助詞,接続,属格	_	4	case	_	_
6	粉	<UNK>	NOUN	n,名詞,不可譲,身体	_	2	obj	_	_

1	蘭	蘭	NOUN	n,名詞,可搬,道具	_	2	nsubj	_	_
2	薰	<UNK>	NOUN	n,名詞,可搬,道具	_	0	root	_	_
3	珮	<UNK>	NOUN	n,名詞,可搬,道具	_	4	nmod	_	_
4	後	後	NOUN	n,名詞,固定物,関係	Case=Tem	6	nmod	_	_
5	之	之	SCONJ	p,助詞,接続,属格	_	4	case	_	_
6	香	香	NOUN	n,名詞,描写,形質	_	2	obj	_	_

1	加	加	VERB	v,動詞,行為,得失	_	5	advmod	_	_
2	以	以	VERB	v,動詞,行為,動作	_	5	advcl	_	_
3	曙	<UNK>	NOUN	n,名詞,描写,形質	_	4	nmod	_	_
4	嶺	<UNK>	NOUN	n,名詞,固定物,地形	Case=Loc	2	obj	_	_
5	移	移	VERB	v,動詞,行為,移動	_	0	root	_	_
6	雲	雲	NOUN	n,名詞,天象,気象	_	5	obj	_	_

1	松	松	PROPN	n,名詞,人,名	NameType=Giv	0	root	_	_

1	掛	<UNK>	VERB	v,動詞,行為,動作	_	0	root	_	_
2	羅	羅	NOUN	n,名詞,可搬,道具	_	1	obj	_	_
3	而	而	CCONJ	p,助詞,接続,並列	_	4	cc	_	_
4	傾	傾	VERB	v,動詞,行為,動作	_	1	conj	_	_
5	盖	<UNK>	NOUN	n,名詞,可搬,道具	_	4	obj	_	_

1	夕	夕	NOUN	n,名詞,時,*	Case=Tem	2	nmod	_	_
2	岫	<UNK>	NOUN	n,名詞,固定物,地形	Case=Loc	3	nsubj	_	_
3	結	結	VERB	v,動詞,行為,動作	_	0	root	_	_
4	霧	<UNK>	NOUN	n,名詞,可搬,道具	_	3	obj	_	_

1	鳥	鳥	NOUN	n,名詞,主体,動物	_	2	nsubj	_	_
2	封	封	VERB	v,動詞,行為,役割	_	45	csubj	_	_
3	縠	<UNK>	NOUN	n,名詞,可搬,道具	_	2	obj	_	_
4	而	而	CCONJ	p,助詞,接続,並列	_	5	cc	_	_
5	迷	<UNK>	VERB	v,動詞,行為,動作	_	2	conj	_	_
6	林	林	NOUN	n,名詞,固定物,地形	Case=Loc	31	obj	_	_
7	庭	庭	NOUN	n,名詞,固定物,建造物	Case=Loc	40	obl:lmod	_	_
8	舞	舞	VERB	v,動詞,行為,動作	_	2	conj	_	_
9	新	新	VERB	v,動詞,描写,形質	Degree=Pos|VerbForm=Part	10	amod	_	_
10	蝶	<UNK>	NOUN	n,名詞,可搬,道具	_	5	obj	_	_
11	空	空	ADV	v,動詞,描写,形質	Degree=Pos|VerbForm=Conv	40	advmod	_	_
12	歸	歸	VERB	v,動詞,行為,移動	_	2	conj	_	_
13	故	故	NOUN	n,名詞,時,*	Case=Tem	14	nmod	_	_
14	鴈	<UNK>	NOUN	n,名詞,主体,動物	_	40	nsubj	_	_
15	於	於	ADP	v,前置詞,基盤,*	_	16	case	_	_
16	是	是	PRON	n,代名詞,指示,*	PronType=Dem	2	obl	_	_
17	盖	<UNK>	NOUN	n,名詞,不可譲,身体	_	40	nsubj	_	_
18	天	天	NOUN	n,名詞,制度,場	Case=Loc	2	obl	_	_
19	坐	坐	VERB	v,動詞,行為,動作	_	2	conj	_	_
20	地	地	NOUN	n,名詞,固定物,地形	Case=Loc	5	obj	_	_
21	促	<UNK>	VERB	v,動詞,行為,動作	_	2	conj	_	_
22	膝	<UNK>	NOUN	n,名詞,可搬,道具	_	31	obj	_	_
23	飛	飛	VERB	v,動詞,行為,動作	_	2	conj	_	_
24	觴	<UNK>	NOUN	n,名詞,可搬,道具	_	31	obj	_	_
25	忘	忘	VERB	v,動詞,行為,動作	_	2	conj	_	_
26	言	言	NOUN	n,名詞,可搬,伝達	_	31	obj	_	_
27	一	一	NUM	n,数詞,数字,*	_	28	nummod	_	_
28	室	室	NOUN	n,名詞,固定物,建造物	Case=Loc	36	nmod	_	_
29	之	之	SCONJ	p,助詞,接続,属格	_	28	case	_	_
30	裏	<UNK>	NOUN	n,名詞,固定物,関係	Case=Loc	2	conj	_	_
31	開	開	VERB	v,動詞,行為,動作	_	2	conj	_	_
32	衿	<UNK>	NOUN	n,名詞,不可譲,身体	_	31	obj	_	_
33	煙	<UNK>	NOUN	n,名詞,固定物,樹木	_	31	obj	_	_
34	霞	<UNK>	NOUN	n,名詞,固定物,樹木	_	33	flat	_	_
35	之	之	SCONJ	p,助詞,接続,属格	_	28	case	_	_
36	外	外	NOUN	n,名詞,固定物,関係	Case=Loc	2	obj	_	_
37	淡	<UNK>	ADV	v,動詞,描写,形質	Degree=Pos|VerbForm=Conv	2	conj	_	_
38	然	然	PART	p,接尾辞,*,*	_	37	fixed	_	_
39	自	自	PRON	n,代名詞,人称,他	PronType=Prs|Reflex=Yes	40	nsubj	_	_
40	放	放	VERB	v,動詞,行為,動作	_	2	conj	_	_
41	快	<UNK>	VERB	v,動詞,描写,態度	Degree=Pos	40	advmod	_	_
42	然	然	PART	p,接尾辞,*,*	_	37	fixed	_	_
43	自	自	PRON	n,代名詞,人称,他	PronType=Prs|Reflex=Yes	50	obj	_	_
44	足	足	VERB	v,動詞,描写,量	Degree=Pos	2	conj	_	_
45	若	若	VERB	v,動詞,行為,分類	Degree=Equ	0	root	_	_
46	非	非	ADV	v,副詞,否定,体言否定	Polarity=Neg	48	amod	_	_
47	翰	翰	NOUN	n,名詞,可搬,道具	_	48	nmod	_	_
48	苑	苑	NOUN	n,名詞,固定物,建造物	Case=Loc	51	nsubj	_	_
49	何	何	PRON	n,代名詞,疑問,*	PronType=Int	50	obj	_	_
50	以	以	VERB	v,動詞,行為,動作	_	51	advcl	_	_
51	攄	<UNK>	VERB	v,動詞,行為,動作	_	44	parataxis	_	_
52	情	情	NOUN	n,名詞,描写,態度	_	51	obj	_	_

1	詩	詩	NOUN	n,名詞,主体,書物	_	2	nsubj	_	_
2	紀	紀	VERB	v,動詞,行為,動作	_	0	root	_	_
3	落	落	VERB	v,動詞,行為,移動	VerbForm=Part	4	amod	_	_
4	梅	梅	NOUN	n,名詞,固定物,樹木	_	6	nmod	_	_
5	之	之	SCONJ	p,助詞,接続,属格	_	4	case	_	_
6	篇	篇	NOUN	n,名詞,可搬,伝達	_	2	obj	_	_

1	古	古	NOUN	n,名詞,時,*	Case=Tem	5	nsubj	_	_
2	今	今	NOUN	n,名詞,時,*	Case=Tem	1	conj	_	_
3	夫	夫	PART	p,助詞,句頭,*	_	5	discourse	_	_
4	何	何	ADV	v,副詞,疑問,原因	AdvType=Cau	5	advmod	_	_
5	異	異	VERB	v,動詞,描写,形質	Degree=Pos	0	root	_	_
6	矣	矣	PART	p,助詞,句末,*	_	5	discourse:sp	_	_

1	宜	宜	AUX	v,助動詞,必要,*	Mood=Nec	2	aux	_	_
2	賦	賦	VERB	v,動詞,行為,動作	_	0	root	_	_
3	園	園	NOUN	n,名詞,固定物,建造物	Case=Loc	4	nmod	_	_
4	梅	梅	NOUN	n,名詞,固定物,樹木	_	2	obj	_	_

1	聊	<UNK>	ADV	v,動詞,行為,動作	VerbForm=Conv	2	advmod	_	_
2	成	成	VERB	v,動詞,行為,生産	_	0	root	_	_
3	短	短	VERB	v,動詞,描写,量	Degree=Pos	4	advmod	_	_
4	詠	詠	VERB	v,動詞,行為,伝達	_	2	ccomp	_	_

Is this an improvement?

@KoichiYasuoka
Copy link
Contributor Author

Yes, yes @tiberiu44 it seems much better result except for "松". But I could not download the improved model after I cleaned ~/.nlpcube/3.0/lzh up. Well, has the new model been released?

@tiberiu44
Copy link
Contributor

It's not published. The sentence segmentation is still bad. Also, token is worse:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     93.29 |     92.62 |     92.96 |
Sentences  |     27.12 |      7.65 |     11.94 |
Words      |     93.29 |     92.62 |     92.96 |
UPOS       |     87.02 |     86.40 |     86.71 |     93.28
XPOS       |     84.06 |     83.46 |     83.76 |     90.11
UFeats     |     88.16 |     87.53 |     87.84 |     94.50
AllTags    |     82.22 |     81.64 |     81.93 |     88.14
Lemmas     |     89.80 |     89.15 |     89.47 |     96.26
UAS        |     43.40 |     43.09 |     43.24 |     46.52
LAS        |     39.54 |     39.26 |     39.40 |     42.38
CLAS       |     38.00 |     36.86 |     37.42 |     39.96
MLAS       |     35.55 |     34.49 |     35.01 |     37.39
BLEX       |     36.87 |     35.76 |     36.31 |     38.77

@KoichiYasuoka
Copy link
Contributor Author

I've released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation for sentence segmentation of classical Chinese. You can use it with transformers>=4.1:

import torch
from transformers import AutoTokenizer,AutoModelForTokenClassification
tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation")
model=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation")
s="天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠"
p=[model.config.id2label[q] for q in torch.argmax(model(tokenizer.encode(s,return_tensors="pt"))[0],dim=2)[0].tolist()[1:-1]]
print("".join(c+"。" if q=="E" or q=="S" else c for c,q in zip(s,p)))

@tiberiu44
Copy link
Contributor

Do we have permission to use your model in NLPCube? Do you need any citation or notice when somebody loads it?

@KoichiYasuoka
Copy link
Contributor Author

The models are distributed under the Apache License 2.0. You can use them (almost) freely except for trademarks.

@tiberiu44
Copy link
Contributor

This sounds good. I will update the runtime code for the tokenizer to be able to use transformer models for tokenization.

@tiberiu44
Copy link
Contributor

One more question: does your model also support tokenization or just sentence segmentation?

@KoichiYasuoka
Copy link
Contributor Author

KoichiYasuoka commented Aug 18, 2021

https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation is only for sentence segmentation. And I've just released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-upos for POS-tagging with tokenization:

>>> import torch
>>> from transformers import AutoTokenizer,AutoModelForTokenClassification
>>> tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-upos")
>>> model=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-upos")
>>> s="子曰學而時習之不亦說乎有朋自遠方來不亦樂乎人不知而不慍不亦君子乎"
>>> p=[model.config.id2label[q] for q in torch.argmax(model(tokenizer.encode(s,return_tensors="pt"))[0],dim=2)[0].tolist()[1:-1]]
>>> print(list(zip(s,p)))
[('子', 'NOUN'), ('曰', 'VERB'), ('學', 'VERB'), ('而', 'CCONJ'), ('時', 'NOUN'), ('習', 'VERB'), ('之', 'PRON'), ('不', 'ADV'), ('亦', 'ADV'), ('說', 'VERB'), ('乎', 'PART'), ('有', 'VERB'), ('朋', 'NOUN'), ('自', 'ADP'), ('遠', 'VERB'), ('方', 'NOUN'), ('來', 'VERB'), ('不', 'ADV'), ('亦', 'ADV'), ('樂', 'VERB'), ('乎', 'PART'), ('人', 'NOUN'), ('不', 'ADV'), ('知', 'VERB'), ('而', 'CCONJ'), ('不', 'ADV'), ('慍', 'VERB'), ('不', 'ADV'), ('亦', 'ADV'), ('君', 'B-NOUN'), ('子', 'I-NOUN'), ('乎', 'PART')]

You can see "君子" is tokenized as a single word with the POS's of B-NOUN and I-NOUN.

tiberiu44 added a commit that referenced this issue Aug 27, 2021
* Partial update

* Bugfix

* API update

* Bugfixing and API

* Bugfix

* Fix long words OOM by skipping sentences

* bugfixing and api update

* Added language flavour

* Added early stopping condition

* Corrected naming

* Corrected permissions

* Bugfix

* Added GPU support at runtime

* Wrong config package

* Refactoring

* refactoring

* add lightning to dependencies

* Dummy test

* Dummy test

* Tweak

* Tweak

* Update test

* Test

* Finished loading for UD CONLL-U format

* Working on tagger

* Work on tagger

* tagger training

* tagger training

* tagger training

* Sync

* Sync

* Sync

* Sync

* Tagger working

* Better weight for aux loss

* Better weight for aux loss

* Added save and printing for tagger and shared options class

* Multilanguage evaluation

* Saving multiple models

* Updated ignore list

* Added XLM-Roberta support

* Using custom ro model

* Score update

* Bugfixing

* Code refactor

* Refactor

* Added option to load external config

* Added option to select LM-model from CLI or config

* added option to overwrite config lm from CLI

* Bugfix

* Working on parser

* Sync work on parser

* Parser working

* Removed load limit

* Bugfix in evaluation

* Added bi-affine attention

* Added experimental ChuLiuEdmonds tree decoding

* Better config for parser and bugfix

* Added residuals to tagging

* Model update

* Switched to AdamW optimizer

* Working on tokenizer

* Working on tokenizer

* Training working - validation to do

* Bugfix in language id

* Working on tokenization validation

* Tokenizer working

* YAML update

* Bug in LMHelper

* Tagger is working

* Tokenizer is working

* bfix

* bfix

* Bugfix for bugfix :)

* Sync

* Tokenizer worker

* Tagger working

* Trainer updates

* Trainer process now working

* Added .DS_Store

* Added datasets for Compound Word Expander and Lemmatizer

* Added collate function for lemma+compound

* Added training and validation step

* Updated config for Lemmatizer

* Minor fixes

* Removed duplicate entries from lemma and cwe

* Added training support for lemmatizer

* Removed debug directives

* Lemmatizer in testing phase

* removed unused line

* Bugfix in Lemma dataset

* Corrected validation issue with gs labels being sent to the forward method and removed loss computation during testing

* Lemmatizier training done

* Compound word expander ready

* Sync

* Added support for FastText, Transformers and Languasito LM models

* Added multi-lm support for tokenizer

* Added support for multiword tokens

* Sync

* Bugfix in evaluation

* Added Languasito as a subpackage

* Added path to local Languasito

* Bugfixing all around

* Removed debug printing

* Bugfix for no-space languages that actually contain spaces :)

* Bugfix for no-space languages that actually contain spaces :)

* Fixed GPU support

* Biaffine transform for LAS and relative head location (RHL) for UAS

* Bugfix

* Tweaks

* moved rhl to lower layer

* Added configurable option for RHL

* Safenet for spaces in languages that should use no spaces

* Better defaults

* Sync

* Cleanup parser

* Bilinear xpos and attrs

* Added Biaffine module from Stanza

* Tagger with reduced number of parameters:

* Parser with conditional attrs

* Working on tokenizer runtime

* Tokenizer process 90% done

* Added runtime for parser, tokenizer and tagger

* Added quick test for runtime

* Test for e2e

* Added support for multiple word embeddings at the same time

* Bugfix

* Added multiple word representations for tokenizer

* moved mask_concat to utils.py

* Added XPOS prediction to pipeline

* Bugfix in tokenizer shifted word embeddings

* Using Languasito tokenizer for HF tokenization

* Bugfix

* Bugfixing

* Bugfixing

* Bugfix

* Runtime fixing

* Sync

* Added spa for FT and Languasito

* Added spa for FT and Languasito

* Minor tweaks

* Added configuration for RNN layers

* Bugfix for spa

* HF runtime fix

* Mixed test fasttext+transformer

* Added word reconstruction and MHA

* Sync

* Bugfix

* bugfix

* Added masked attention

* Sync

* Added test for runtime

* Bugfix in mask values

* Updated test

* Added full mask dropout

* Added resume option

* Removed useless printouts

* Removed useless printouts

* Switched to eval at runtime

* multiprocessing added

* Added full mask dropout for word decoder

* Bugfix

* Residual

* Added lexical-contextual cosine loss

* Removed full mask dropout from WordDecoder

* Bugfix

* Training script generation update

* Added residual

* Updated languasito to pickle tokenized lines

* Updated languasito to pickle tokenized lines

* Updated languasito to pickle tokenized lines

* Not training for seq len > max_seq_len

* Added seq limmits for collates

* Passing seq limits from collate to tokenizer

* Skipping complex parsing

* Working on word decomposer

* Model update

* Sync

* Bugfix

* Bugfix

* Bugfix

* Using all reprs

* Dropped immediate context

* Multi train script added

* Changed gpu parameter type to string, for multiple gpus int failed

* Updated pytorch_lightning callback method to work with newer version

* Updated pytorch_lightning callback method to work with newer version

* Transparently pass PL args from the command line; skip over empty compound word datasets

* Fix typo

* Refactoring and on the way to working API

* API load working

* Partial _call_ working

* Partial _call_ working

* Added partly working api and refactored everything back to cube/. Compound not working yet and tokenizer needs retraining.

* api is working

* Fixing api

* Updated readme

* Update Readme to include flavours

* Device support

* api update

* Updated package

* Tweak + results

* Clarification

* Test update

* Update

* Sync

* Update README

* Bugfixing

* Bugfix and api update

* Fixed compound

* Evaluation update

* Bugfix

* Package update

* Bugfix for large sentences

* Pip package update

* Corrected spanish evaluation

* Package version update

* Fixed tokenization issues on transformers

* Removed pinned memory

* Bugfix for GPU tensors

* Update package version

* Automatically detecting hidden state size

* Automatically detecting hidden state size

* Automatically detecting hidden state size

* Sync

* Evaluation update

* Package update

* Bugfix

* Bugfixing

* Package version update

* Bugfix

* Package version update

* Update evaluation for Italian

* tentative support torchtext>=0.9.0 (#127)

as mentioned in Lightning-AI/pytorch-lightning#6211 and #100

* Update package dependencies

Co-authored-by: Stefan Dumitrescu <sdumitre@adobe.com>
Co-authored-by: dumitrescustefan <dumitrescu.stefan@gmail.com>
Co-authored-by: Tiberiu Boros <boros@adobe.com>
Co-authored-by: Tiberiu Boros <boros@boros-macos.local>
Co-authored-by: Koichi Yasuoka <yasuoka@kanji.zinbun.kyoto-u.ac.jp>
tiberiu44 added a commit that referenced this issue Feb 17, 2023
* Corrected permissions

* Bugfix

* Added GPU support at runtime

* Wrong config package

* Refactoring

* refactoring

* add lightning to dependencies

* Dummy test

* Dummy test

* Tweak

* Tweak

* Update test

* Test

* Finished loading for UD CONLL-U format

* Working on tagger

* Work on tagger

* tagger training

* tagger training

* tagger training

* Sync

* Sync

* Sync

* Sync

* Tagger working

* Better weight for aux loss

* Better weight for aux loss

* Added save and printing for tagger and shared options class

* Multilanguage evaluation

* Saving multiple models

* Updated ignore list

* Added XLM-Roberta support

* Using custom ro model

* Score update

* Bugfixing

* Code refactor

* Refactor

* Added option to load external config

* Added option to select LM-model from CLI or config

* added option to overwrite config lm from CLI

* Bugfix

* Working on parser

* Sync work on parser

* Parser working

* Removed load limit

* Bugfix in evaluation

* Added bi-affine attention

* Added experimental ChuLiuEdmonds tree decoding

* Better config for parser and bugfix

* Added residuals to tagging

* Model update

* Switched to AdamW optimizer

* Working on tokenizer

* Working on tokenizer

* Training working - validation to do

* Bugfix in language id

* Working on tokenization validation

* Tokenizer working

* YAML update

* Bug in LMHelper

* Tagger is working

* Tokenizer is working

* bfix

* bfix

* Bugfix for bugfix :)

* Sync

* Tokenizer worker

* Tagger working

* Trainer updates

* Trainer process now working

* Added .DS_Store

* Added datasets for Compound Word Expander and Lemmatizer

* Added collate function for lemma+compound

* Added training and validation step

* Updated config for Lemmatizer

* Minor fixes

* Removed duplicate entries from lemma and cwe

* Added training support for lemmatizer

* Removed debug directives

* Lemmatizer in testing phase

* removed unused line

* Bugfix in Lemma dataset

* Corrected validation issue with gs labels being sent to the forward method and removed loss computation during testing

* Lemmatizier training done

* Compound word expander ready

* Sync

* Added support for FastText, Transformers and Languasito LM models

* Added multi-lm support for tokenizer

* Added support for multiword tokens

* Sync

* Bugfix in evaluation

* Added Languasito as a subpackage

* Added path to local Languasito

* Bugfixing all around

* Removed debug printing

* Bugfix for no-space languages that actually contain spaces :)

* Bugfix for no-space languages that actually contain spaces :)

* Fixed GPU support

* Biaffine transform for LAS and relative head location (RHL) for UAS

* Bugfix

* Tweaks

* moved rhl to lower layer

* Added configurable option for RHL

* Safenet for spaces in languages that should use no spaces

* Better defaults

* Sync

* Cleanup parser

* Bilinear xpos and attrs

* Added Biaffine module from Stanza

* Tagger with reduced number of parameters:

* Parser with conditional attrs

* Working on tokenizer runtime

* Tokenizer process 90% done

* Added runtime for parser, tokenizer and tagger

* Added quick test for runtime

* Test for e2e

* Added support for multiple word embeddings at the same time

* Bugfix

* Added multiple word representations for tokenizer

* moved mask_concat to utils.py

* Added XPOS prediction to pipeline

* Bugfix in tokenizer shifted word embeddings

* Using Languasito tokenizer for HF tokenization

* Bugfix

* Bugfixing

* Bugfixing

* Bugfix

* Runtime fixing

* Sync

* Added spa for FT and Languasito

* Added spa for FT and Languasito

* Minor tweaks

* Added configuration for RNN layers

* Bugfix for spa

* HF runtime fix

* Mixed test fasttext+transformer

* Added word reconstruction and MHA

* Sync

* Bugfix

* bugfix

* Added masked attention

* Sync

* Added test for runtime

* Bugfix in mask values

* Updated test

* Added full mask dropout

* Added resume option

* Removed useless printouts

* Removed useless printouts

* Switched to eval at runtime

* multiprocessing added

* Added full mask dropout for word decoder

* Bugfix

* Residual

* Added lexical-contextual cosine loss

* Removed full mask dropout from WordDecoder

* Bugfix

* Training script generation update

* Added residual

* Updated languasito to pickle tokenized lines

* Updated languasito to pickle tokenized lines

* Updated languasito to pickle tokenized lines

* Not training for seq len > max_seq_len

* Added seq limmits for collates

* Passing seq limits from collate to tokenizer

* Skipping complex parsing

* Working on word decomposer

* Model update

* Sync

* Bugfix

* Bugfix

* Bugfix

* Using all reprs

* Dropped immediate context

* Multi train script added

* Changed gpu parameter type to string, for multiple gpus int failed

* Updated pytorch_lightning callback method to work with newer version

* Updated pytorch_lightning callback method to work with newer version

* Transparently pass PL args from the command line; skip over empty compound word datasets

* Fix typo

* Refactoring and on the way to working API

* API load working

* Partial _call_ working

* Partial _call_ working

* Added partly working api and refactored everything back to cube/. Compound not working yet and tokenizer needs retraining.

* api is working

* Fixing api

* Updated readme

* Update Readme to include flavours

* Device support

* api update

* Updated package

* Tweak + results

* Clarification

* Test update

* Update

* Sync

* Update README

* Bugfixing

* Bugfix and api update

* Fixed compound

* Evaluation update

* Bugfix

* Package update

* Bugfix for large sentences

* Pip package update

* Corrected spanish evaluation

* Package version update

* Fixed tokenization issues on transformers

* Removed pinned memory

* Bugfix for GPU tensors

* Update package version

* Automatically detecting hidden state size

* Automatically detecting hidden state size

* Automatically detecting hidden state size

* Sync

* Evaluation update

* Package update

* Bugfix

* Bugfixing

* Package version update

* Bugfix

* Package version update

* Update evaluation for Italian

* tentative support torchtext>=0.9.0 (#127)

as mentioned in Lightning-AI/pytorch-lightning#6211 and #100

* Update package dependencies

* Dummy word embeddings

* Update params

* Better dropout values

* Skipping long words

* Skipping long words

* dummy we -> float

* Added gradient clipping

* Update tokenizer

* Update tokenizer

* Sync

* DCWE

* Working on DCWE

---------

Co-authored-by: Stefan Dumitrescu <sdumitre@adobe.com>
Co-authored-by: Tiberiu Boros <boros@adobe.com>
Co-authored-by: Koichi Yasuoka <yasuoka@kanji.zinbun.kyoto-u.ac.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants