Skip to content

apple/ml-speech-polynorm-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

PolyNorm

This dataset accompanies the research paper, PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech.

PolyNorm-Bench: Description

Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages showistent reductions in the word error rate (WER) compared to a produproduction-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.

Documentation

PolyNorm-Bench is a dataset with normalized and unnormalized data generated by DeepSeek-R1 as foundation across 8 locales, spanning 27 categories. Each synthetic example is edited and verified by internal language experts to ensure the highest standards of linguistic precision, naturalness, and overall quality. The orthography and formatting of synthetic data follow the conventions of target languages. There are 20 examples per category per language, resulting in 540 high-quality data points per language.

Normalization Categories

  1. Cardinal
  2. Date
  3. Decimal
  4. Ordinal
  5. Fraction
  6. Time
  7. Currency
  8. Unit (Measure)
  9. Address
  10. Acronym/Initialism
  11. ISBN
  12. Biological Classification
  13. Roman Numeral
  14. Telephone
  15. Sports Score
  16. Mathematical Expression
  17. Symbol
  18. Abbreviation
  19. Chemical Formula
  20. Legal Reference
  21. Vehicle/Product Code
  22. Geographic Coordinates
  23. Version Number
  24. License/Serial Number
  25. Musical Notation
  26. Stock Ticker
  27. Electronic (URL/Email)

Languages

American EnglishGerman, French, Mexican Spanish, Italian, Lithuanian, Japanese, and Mandarin Chinese

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published