PunCantonese: A Benchmark Corpus for Low-Resource Cantonese Punctuation Restoration from Speech Transcripts

This repository contains the data and source code of the paper accepted at the InterSpeech2023, PunCantonese: A Benchmark Corpus for Low-Resource Cantonese Punctuation Restoration from Speech Transcripts.

Data

The dataset would be available upon request to this email address: yunxiang.li@link.cuhk.edu.hk

Model

We propose a Transformer-based neural network model to evaluate the PunCantonese corpus.

The model exploits pre-trained language models to obtain a good network initialization, a multi-task learning objective to prevent the network from paying too much attention to the largest subset of written-style sentences and a novel Jyutping embedding layer to represent a Cantonese character with its Jyutping sequence which potentially enables the model to incorporate phonetic features that are not explicitly available in Cantonese characters. Then we have one Bi-LSTM and one linear layer for classification.

Training

To train our model with default settings and with both multitask learning and jyutping embedding on, run the following command

python src/train_jyupin_multi.py --cuda=True --pretrained-model=bert-base-multilingual-uncased --freeze-bert=False --lstm-dim=-1 --seed=0 --lr=2e-5 --epoch=15 --use-crf=False --data-path=data --save-path=out --batch-size=32 --sequence-length=128 --loss=focal --multitask=True --jyutping=True

Acknowledgements

This project is based on modifications made to the original code from Punctuation Restoration using Transformer Models. Special thanks to the authors from this work for providing the excellent code base.

License

This repository is under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
src		src
LICENSE		LICENSE
README.md		README.md
apache.org_licenses_LICENSE-2.0.txt		apache.org_licenses_LICENSE-2.0.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

src

src

LICENSE

LICENSE

README.md

README.md

apache.org_licenses_LICENSE-2.0.txt

apache.org_licenses_LICENSE-2.0.txt

Repository files navigation

PunCantonese: A Benchmark Corpus for Low-Resource Cantonese Punctuation Restoration from Speech Transcripts

Data

Model

Training

Acknowledgements

License

About

Releases

Packages

Contributors 2

Languages

License

cpii-cai/PunCantonese

Folders and files

Latest commit

History

Repository files navigation

PunCantonese: A Benchmark Corpus for Low-Resource Cantonese Punctuation Restoration from Speech Transcripts

Data

Model

Training

Acknowledgements

License

About

Resources

License

Stars

Watchers

Forks

Languages