Skip to content

Latest commit

 

History

History
44 lines (36 loc) · 1.92 KB

README.md

File metadata and controls

44 lines (36 loc) · 1.92 KB

neural-punctuator

Complimentary code for our paper Automatic punctuation restoration with BERT models submitted to the XVII. Conference on Hungarian Computational Linguistics.

Abstract

We present an approach for automatic punctuation restoration with BERT models for English and Hungarian. For English, we conduct our experiments on Ted Talks, a commonly used benchmark for punctuation restoration, while for Hungarian we evaluate our models on the Szeged Treebank dataset. Our best models achieve a macro-averaged F1-score of 79.8 in English and 82.2 in Hungarian.

Repository Structure

.
|-- docs
|   └── paper       # The submitted paper
|-- notebooks       # Notebooks for data preparation/preprocessing
|-- src
    └── neural_punctuator 
        ├── base            # Base classes for training Torch models
        ├── configs         # YAMl files defining the parameters of each model
        ├── models          # Torch model definitions
        ├── preprocessors   # Preprocessor class
        ├── trainers        # Train logic
        ├── utils           # Utility scripts (logging, metrics, tensorboard etc.)
        └── wrappers        # Wrapper classes for the models containing all the components needed for training/prediction

Dataset

Ted Talk dataset (English) - http://hltc.cs.ust.hk/iwslt/index.php/evaluation-campaign/ted-task.html

Szeged Treebank (Hungarian) - https://rgai.inf.u-szeged.hu/node/113

Citation

If you use our work, please cite the following paper:

@article{nagy2021automatic,
  title={Automatic punctuation restoration with bert models},
  author={Nagy, Attila and Bial, Bence and {\'A}cs, Judit},
  journal={arXiv preprint arXiv:2101.07343},
  year={2021}
}

Authors

Attila Nagy, Bence Bial, Judit Ács

Budapest University of Technology and Economics - Department of Automation and Applied Informatics