Ancient Chinese Word Segmentation and POS Tagging Using Distant Supervision (Word Alignment)

Code for ICASSP2023: Ancient Chinese Word Segmentation and Part-of-Speech Tagging Using Distant Supervision (Word Alignment)

we propose a novel augmentation method of ancient Chinese WSG and POS tagging data using word alignment over parallel corpus

Ⅰ、Get the labelled ancient Chinese and modern Chinese

1、Get the raw unlabelled data:

(1) Download the data from https://github.com/NiuTrans/Classical-Modern;

(2) Rename the folder “双语数据” to “bitext” and place it under folder "ACDS";

(3) Get modern Chinese sentences ("src") and ancient Chinese sentences ("tgt")

python make_data.py

(4) Shuffle modern Chinese sentences and ancient Chinese sentences to get "src_shuf" and "tag_shuf".

python shuffle_data.py

(5) Get the segmented、pos tagged modern Chinese data "src_shuf_seg、src_shuf_pos" by LTP (https://github.com/HIT-SCIR/ltp) and single- character splited ancient Chinese data "tgt_shuf_seg"

python make_segpos_data.py

Ⅱ、Get the alignment probability between ancient Chinese and modern Chinese

1、Download the alignment tool giza:

(1) Download the tool giza from https://github.com/sillsdev/giza-py and put it in "ACDS" as a folder "giza";

(2) Configure the tool giza as described in https://github.com/sillsdev/giza-py.

2、Get the alignment probabilities Get the alignment probabilities between segmented ancient words and single-character splited ancient Chinese words. The output result is file "alignment".

cd giza
python giza.py --source ../src_shuf_seg --target ../tgt_shuf_seg --alignments ../alignment --model ibm4 --m1 10 --mh 10 --m3 10 --m4 10 --include-probs

Ⅲ、Get the word boundaries and parts of speech of ancient Chinese words

Get the labelled ancient Chinese data "tgt.shuf.seg_pos".

cd ..
python align-pos_tag_ltp.py

Ⅳ、Train and evaluate the model

After getting the augmented data "tgt.shuf.seg_pos" from word alignment, you can train the SIKU-RoBerta (https://huggingface.co/SIKU-BERT/sikuroberta) like this repository (https://github.com/farlit/The-first-ancient-Chinese-word-segmentation-and-part-of-speech-tagging-code-and-analysis).

The augmented training dataset is "tgt.shuf.seg_pos"; the annotated training dataset is "zuozhuan_train_utf8"; the validation datasets are "EvaHan_testa_gold" and "EvaHan_testb_gold".

The annotated dataset and validation datasets are from https://github.com/RacheleSprugnoli/LT4HALA/tree/master/2022/data_and_doc.

Our code refers to https://github.com/hemingkx/WordSeg, thanks for their work.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
EvaHan_testa_gold.txt		EvaHan_testa_gold.txt
EvaHan_testa_raw.txt		EvaHan_testa_raw.txt
EvaHan_testb_gold.txt		EvaHan_testb_gold.txt
EvaHan_testb_raw.txt		EvaHan_testb_raw.txt
README.md		README.md
align-pos_tag_ltp.py		align-pos_tag_ltp.py
config.py		config.py
data_loader.py		data_loader.py
data_loader_test.py		data_loader_test.py
data_process.py		data_process.py
data_process_test.py		data_process_test.py
eval_EvaHan_2022_FINAL.py		eval_EvaHan_2022_FINAL.py
eval_script.py		eval_script.py
evaluate.py		evaluate.py
evaluate_aug_complete.py		evaluate_aug_complete.py
evaluate_complete.py		evaluate_complete.py
evaluate_test.py		evaluate_test.py
inference_aug_complete.py		inference_aug_complete.py
inference_complete.py		inference_complete.py
label.py		label.py
langconv.py		langconv.py
make_data.py		make_data.py
make_segpos_data.py		make_segpos_data.py
metrics.py		metrics.py
model.py		model.py
modify_data.py		modify_data.py
shuffle_data.py		shuffle_data.py
test.py		test.py
test_compete.py		test_compete.py
test_file.py		test_file.py
test_file_hybrid.py		test_file_hybrid.py
test_test.py		test_test.py
train.py		train.py
train_all_data.py		train_all_data.py
train_clean.py		train_clean.py
train_temp.py		train_temp.py
zh_wiki.py		zh_wiki.py
zuozhuan_train_utf8.txt		zuozhuan_train_utf8.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ancient Chinese Word Segmentation and POS Tagging Using Distant Supervision (Word Alignment)

Ⅰ、Get the labelled ancient Chinese and modern Chinese

Ⅱ、Get the alignment probability between ancient Chinese and modern Chinese

Ⅲ、Get the word boundaries and parts of speech of ancient Chinese words

Ⅳ、Train and evaluate the model

About

Releases

Packages

Languages

farlit/ACDS

Folders and files

Latest commit

History

Repository files navigation

Ancient Chinese Word Segmentation and POS Tagging Using Distant Supervision (Word Alignment)

Ⅰ、Get the labelled ancient Chinese and modern Chinese

Ⅱ、Get the alignment probability between ancient Chinese and modern Chinese

Ⅲ、Get the word boundaries and parts of speech of ancient Chinese words

Ⅳ、Train and evaluate the model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages