PolyNorm

This dataset accompanies the research paper, PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech.

PolyNorm-Bench: Description

Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages showistent reductions in the word error rate (WER) compared to a produproduction-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.

Documentation

PolyNorm-Bench is a dataset with normalized and unnormalized data generated by DeepSeek-R1 as foundation across 8 locales, spanning 27 categories. Each synthetic example is edited and verified by internal language experts to ensure the highest standards of linguistic precision, naturalness, and overall quality. The orthography and formatting of synthetic data follow the conventions of target languages. There are 20 examples per category per language, resulting in 540 high-quality data points per language.

Languages

American EnglishGerman, French, Mexican Spanish, Italian, Lithuanian, Japanese, and Mandarin Chinese

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
polynorm_bench		polynorm_bench
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PolyNorm

PolyNorm-Bench: Description

Documentation

Normalization Categories

Languages

About

Uh oh!

Releases

Packages

Uh oh!

License

apple/ml-speech-polynorm-bench

Folders and files

Latest commit

History

Repository files navigation

PolyNorm

PolyNorm-Bench: Description

Documentation

Normalization Categories

Languages

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages