GitHub - amazon-science/statistical-byte-pair-encoding

A Statistical Extension of Byte-Pair Encoding

Code for the paper: A Statistical Extension of Byte-Pair Encoding by David Vilar and Marcello Federico.

Citation:

@inproceedings{vilar-federico-2021-statistical,
    title = "A Statistical Extension of Byte-Pair Encoding",
    author = "Vilar, David  and
      Federico, Marcello",
    booktitle = "Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)",
    month = aug,
    year = "2021",
    address = "Bangkok, Thailand (online)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.iwslt-1.31",
    doi = "10.18653/v1/2021.iwslt-1.31",
    pages = "263--275",
    abstract = "Sub-word segmentation is currently a standard tool for training neural machine translation (MT) systems and other NLP tasks. The goal is to split words (both in the source and target languages) into smaller units which then constitute the input and output vocabularies of the MT system. The aim of reducing the size of the input and output vocabularies is to increase the generalization capabilities of the translation model, enabling the system to translate and generate infrequent and new (unseen) words at inference time by combining previously seen sub-word units. Ideally, we would expect the created units to have some linguistic meaning, so that words are created in a compositional way. However, the most popular word-splitting method, Byte-Pair Encoding (BPE), which originates from the data compression literature, does not include explicit criteria to favor linguistic splittings nor to find the optimal sub-word granularity for the given training data. In this paper, we propose a statistically motivated extension of the BPE algorithm and an effective convergence criterion that avoids the costly experimentation cycle needed to select the best sub-word vocabulary size. Experimental results with morphologically rich languages show that our model achieves nearly-optimal BLEU scores and produces morphologically better word segmentations, which allows to outperform BPE{'}s generalization in the translation of sentences containing new words, as shown via human evaluation.",
}

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
THIRD-PARTY		THIRD-PARTY
heap.py		heap.py
learn_bpe.py		learn_bpe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE

LICENSE

README.md

README.md

THIRD-PARTY

THIRD-PARTY

heap.py

heap.py

learn_bpe.py

learn_bpe.py

Repository files navigation

A Statistical Extension of Byte-Pair Encoding

Security

License

About

Releases

Packages

Languages

License

amazon-science/statistical-byte-pair-encoding

Folders and files

Latest commit

History

Repository files navigation

A Statistical Extension of Byte-Pair Encoding

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages