Skip to content

Code for Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation

License

Notifications You must be signed in to change notification settings

ahmed-alqazzaz/wtpsplit

 
 

Repository files navigation

wtpsplit🪓

Code for the paper Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation with Jonas Pfeiffer and Ivan Vulić, accepted at ACL 2023.

This repository contains wtpsplit, a package for robust and adaptible sentence segmentation across 85 languages, as well as the code and configs to reproduce the experiments in the paper.

Installation

pip install wtpsplit

Usage

from wtpsplit import WtP

wtp = WtP("wtp-bert-mini")
# optionally run on GPU for better performance
# also supports TPUs via e.g. wtp.to("xla:0"), in that case pass `pad_last_batch=True` to wtp.split
wtp.half().to("cuda")

# returns ["Hello ", "This is a test."]
wtp.split("Hello This is a test.")

# returns an iterator yielding a lists of sentences for every text
# do this instead of calling wtp.split on every text individually for much better performance
wtp.split(["Hello This is a test.", "And some more texts..."])

# if you're using a model with language adapters, also pass a `lang_code`
wtp.split("Hello This is a test.", lang_code="en")

# depending on your usecase, adaptation to e.g. the Universal Dependencies style may give better results
# this always requires a language code
wtp.split("Hello This is a test.", lang_code="en", style="ud")

ONNX support

You can enable ONNX inference for the wtp-bert-* models:

wtp = WtP("wtp-bert-mini", onnx_providers=["CUDAExecutionProvider"])

This requires onnxruntime and onnxruntime-gpu. It should give a good speedup on GPU!

>>> from wtpsplit import WtP
>>> texts = ["This is a sentence. This is another sentence."] * 1000

# PyTorch GPU
>>> model = WtP("wtp-bert-mini")
>>> model.half().to("cuda")
>>> %timeit list(model.split(texts))
272 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# onnxruntime GPU
>>> model = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"])
>>> %timeit list(model.split(texts))
198 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notes:

  • The wtp-canine-* models are currently not supported with ONNX because the pooling done by CANINE is not trivial to export. Ideas to solve this are very welcome!
  • This does not work with Python 3.7 because onnxruntime does not support the opset we need for py37.

Available Models

Pro tips: I recommend wtp-bert-mini for speed-sensitive applications, otherwise wtp-canine-s-12l. The *-no-adapters models provide a good tradeoff between speed and performance. You should probably not use wtp-bert-tiny.

Model English Score English Score
(adapted)
Multilingual Score Multilingual Score
(adapted)
wtp-bert-tiny 83.8 91.9 79.5 88.6
wtp-bert-mini 91.8 95.9 84.3 91.3
wtp-canine-s-1l 94.5 96.5 86.7 92.8
wtp-canine-s-1l-no-adapters 93.1 96.4 85.1 91.8
wtp-canine-s-3l 94.4 96.8 86.7 93.4
wtp-canine-s-3l-no-adapters 93.8 96.4 86 92.3
wtp-canine-s-6l 94.5 97.1 87 93.6
wtp-canine-s-6l-no-adapters 94.4 96.8 86.4 92.8
wtp-canine-s-9l 94.8 97 87.7 93.8
wtp-canine-s-9l-no-adapters 94.3 96.9 86.6 93
wtp-canine-s-12l 94.7 97.1 87.9 94
wtp-canine-s-12l-no-adapters 94.5 97 87.1 93.2

The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual". "adapted" means adapation via WtP Punct; check out the paper for details.

For comparison, here's the English scores of some other tools:

Model English Score
SpaCy (sentencizer) 86.8
PySBD 69.8
SpaCy (dependency parser) 93.1
Ersatz 91.6
Punkt (nltk.sent_tokenize) 92.5

Paragraph Segmentation

Since WtP models are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.

# returns a list of paragraphs, each containing a list of sentences
# adjust the paragraph threshold via the `paragraph_threshold` argument.
wtp.split(text, do_paragraph_segmentation=True)

Adaptation

WtP can adapt to the Universal Dependencies, OPUS100 or Ersatz corpus segmentation style in many languages by punctuation adaptation (preferred) or threshold adaptation.

Punctuation Adaptation

# this requires a `lang_code`
# check the paper or `wtp.mixtures` for supported styles
wtp.split(text, lang_code="en", style="ud")

This also allows changing the threshold, but inherently has higher thresholds values since it is not newline probablity anymore being thresholded:

wtp.split(text, lang_code="en", style="ud", threshold=0.7)

To get the default threshold for a style:

wtp.get_threshold("en", "ud", return_punctuation_threshold=True)

Threshold Adaptation

threshold = wtp.get_threshold("en", "ud")

wtp.split(text, threshold=threshold)

Advanced Usage

Get the newline or sentence boundary probabilities for a text:

# returns newline probabilities (supports batching!)
wtp.predict_proba(text)

# returns sentence boundary probabilities for the given style
wtp.predict_proba(text, lang_code="en", style="ud")

Load a WtP model in HuggingFace transformers:

# import wtpsplit to register the custom models 
# (character-level BERT w/ hash embeddings and canine with language adapters)
import wtpsplit
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("benjamin/wtp-bert-mini") # or some other model name

** NEW ** Adapt to your own corpus using WtP_Punct:

Clone the repository:

git clone https://github.com/bminixhofer/wtpsplit
cd wtpsplit

Create your data:

import torch

torch.save(
    {
        "en": {
            "sentence": {
                "dummy-dataset": {
                    "meta": {
                        "train_data": ["train sentence 1", "train sentence 2"],
                    },
                    "data": [
                        "test sentence 1",
                        "test sentence 2",
                    ]
                }
            }
        }
    },
    "dummy-dataset.pth"
)

Run adaptation:

python3 wtpsplit/evaluation/adapt.py --model_path=benjamin/wtp-bert-mini --eval_data_path dummy-dataset.pth --include_langs=en

This should print something like

en dummy-dataset U=0.500 T=0.667 PUNCT=0.667
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.52it/s]
Wrote mixture to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini.skops
Wrote results to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini_intrinsic_results.json

i.e. run adaptation on your data and save the mixtures and evaluation results. You can then load and use the mixture like this:

from wtpsplit import WtP
import skops.io as sio

wtp = WtP(
    "wtp-bert-mini",
    mixtures=sio.load(
        "wtpsplit/.cache/wtp-bert-mini.skops",
        ["numpy.float32", "numpy.float64", "sklearn.linear_model._logistic.LogisticRegression"],
    ),
)

wtp.split("your text here", lang_code="en", style="dummy-dataset")

... and adjust the dataset name, language and model in the above to your needs.

Reproducing the paper

configs/ contains the configs for the runs from the paper. We trained on a TPUv3-8. Launch training like this:

python wtpsplit/train/train.py configs/<config_name>.json

In addition:

  • wtpsplit/data_acquisition contains the code for obtaining evaluation data and raw text from the mC4 corpus.
  • wtpsplit/evaluation contains the code for:
    • intrinsic evaluation (i.e. sentence segmentation results) via adapt.py. The raw intrinsic results in JSON format are also at evaluation_results/
    • extrinsic evaluation on Machine Translation in extrinsic.py
    • baseline (PySBD, nltk, etc.) intrinsic evaluation in intrinsic_baselines.py
    • punctuation annotation experiments in punct_annotation.py and punct_annotation_wtp.py

Supported Languages

iso Name
af Afrikaans
am Amharic
ar Arabic
az Azerbaijani
be Belarusian
bg Bulgarian
bn Bengali
ca Catalan
ceb Cebuano
cs Czech
cy Welsh
da Danish
de German
el Greek
en English
eo Esperanto
es Spanish
et Estonian
eu Basque
fa Persian
fi Finnish
fr French
fy Western Frisian
ga Irish
gd Scottish Gaelic
gl Galician
gu Gujarati
ha Hausa
he Hebrew
hi Hindi
hu Hungarian
hy Armenian
id Indonesian
ig Igbo
is Icelandic
it Italian
ja Japanese
jv Javanese
ka Georgian
kk Kazakh
km Central Khmer
kn Kannada
ko Korean
ku Kurdish
ky Kirghiz
la Latin
lt Lithuanian
lv Latvian
mg Malagasy
mk Macedonian
ml Malayalam
mn Mongolian
mr Marathi
ms Malay
mt Maltese
my Burmese
ne Nepali
nl Dutch
no Norwegian
pa Panjabi
pl Polish
ps Pushto
pt Portuguese
ro Romanian
ru Russian
si Sinhala
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sv Swedish
ta Tamil
te Telugu
tg Tajik
th Thai
tr Turkish
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
xh Xhosa
yi Yiddish
yo Yoruba
zh Chinese
zu Zulu

Citation

Please cite wtpsplit as

@inproceedings{minixhofer-etal-2023-wheres,
    title = "Where{'}s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation",
    author = "Minixhofer, Benjamin  and
      Pfeiffer, Jonas  and
      Vuli{\'c}, Ivan",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.398",
    pages = "7215--7235",
    abstract = "Many NLP pipelines split text into sentences as one of the crucial preprocessing steps. Prior sentence segmentation tools either rely on punctuation or require a considerable amount of sentence-segmented training data: both central assumptions might fail when porting sentence segmenters to diverse languages on a massive scale. In this work, we thus introduce a multilingual punctuation-agnostic sentence segmentation method, currently covering 85 languages, trained in a self-supervised fashion on unsegmented text, by making use of newline characters which implicitly perform segmentation into paragraphs. We further propose an approach that adapts our method to the segmentation in a given corpus by using only a small number (64-256) of sentence-segmented examples. The main results indicate that our method outperforms all the prior best sentence-segmentation tools by an average of 6.1{\%} F1 points. Furthermore, we demonstrate that proper sentence segmentation has a point: the use of a (powerful) sentence segmenter makes a considerable difference for a downstream application such as machine translation (MT). By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points over the best prior segmentation tool, as well as massive gains over a trivial segmenter that splits text into equally-sized blocks.",
}

Acknowledgments

Ivan Vulić is supported by a personal Royal Society University Research Fellowship ‘Inclusive and Sustainable Language Technology for a Truly Multilingual World’ (no 221137; 2022–). Research supported with Cloud TPUs from Google’s TPU Research Cloud (TRC). We thank Christoph Minixhofer for advice in the initial stage of this project. We also thank Sebastian Ruder and Srini Narayanan for helpful feedback on a draft of the paper.

Previous Version

This repository previously contained nnsplit, the precursor to wtpsplit. You can still use the nnsplit branch (or the nnsplit PyPI releases) for the old version, however, this is highly discouraged and not maintained! Please let me know if you have a usecase which nnsplit can solve but wtpsplit can not.

About

Code for Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%