This dataset accompanies the research paper, PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech.
Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages showistent reductions in the word error rate (WER) compared to a produproduction-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.
PolyNorm-Bench is a dataset with normalized and unnormalized data generated by DeepSeek-R1 as foundation across 8 locales, spanning 27 categories. Each synthetic example is edited and verified by internal language experts to ensure the highest standards of linguistic precision, naturalness, and overall quality. The orthography and formatting of synthetic data follow the conventions of target languages. There are 20 examples per category per language, resulting in 540 high-quality data points per language.
- Cardinal
- Date
- Decimal
- Ordinal
- Fraction
- Time
- Currency
- Unit (Measure)
- Address
- Acronym/Initialism
- ISBN
- Biological Classification
- Roman Numeral
- Telephone
- Sports Score
- Mathematical Expression
- Symbol
- Abbreviation
- Chemical Formula
- Legal Reference
- Vehicle/Product Code
- Geographic Coordinates
- Version Number
- License/Serial Number
- Musical Notation
- Stock Ticker
- Electronic (URL/Email)
American EnglishGerman, French, Mexican Spanish, Italian, Lithuanian, Japanese, and Mandarin Chinese