Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commas and points inside numbers are considered like punctuation #87

Closed
donand opened this issue Nov 15, 2021 · 2 comments
Closed

Commas and points inside numbers are considered like punctuation #87

donand opened this issue Nov 15, 2021 · 2 comments
Assignees
Labels

Comments

@donand
Copy link

donand commented Nov 15, 2021

Describe the bug
The library does not transcribe commas and points inside numbers, but it considers them as normal punctuation.

Phonemizer version

phonemizer-3.0
available backends: espeak-ng-1.50, segments-2.2.0
uninstalled backends: espeak-mbrola, festival

System

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal
Python 3.9.5

To reproduce
Italian

phonemize("4,16 metri", language='it', backend='espeak', preserve_punctuation=True, with_stress=True, language_switch='remove-flags')

>> kwˈatːro ,sˈeditʃɪ mˈetrɪ

English

phonemize("4.16 meters", language='en-us', backend='espeak', preserve_punctuation=True, with_stress=True, language_switch='remove-flags')

>> fˈoːɹ .sˈɪkstiːn mˈiːɾɚz

Expected behavior
A clear and concise description of what you expected to happen.
Italian

phonemize("4,16 metri", language='it', backend='espeak', preserve_punctuation=True, with_stress=True, language_switch='remove-flags')

>> kwˈatːro vˈirɡola sˈeditʃɪ mˈetrɪ

English

phonemize("4.16 meters", language='en-us', backend='espeak', preserve_punctuation=True, with_stress=True, language_switch='remove-flags')

>> fˈoːɹ pɔɪnt wˈʌn sˈɪks mˈiːɾɚz

Additional context
If I set preserve_punctuation=False the comma or the point inside the number is just dropped and not transcribed.

@mmmaat
Copy link
Collaborator

mmmaat commented Nov 15, 2021

Hi, indeed this may be problematic. We must play with this in order to detect if comma or point are surrounded by numbers.

@jncasey
Copy link
Collaborator

jncasey commented Apr 22, 2022

@donand With the most recently merged PR, it's now possible to achieve what you're after by defining the punctuation with regular expressions.

The default marks are defined as the string ;:,.!?¡¿—…"«»“”

If instead you set the marks to the regular expression [;:!?¡¿—…"«»“”]|[,.](?!\d) commas and periods followed by a digit won't be treated as punctuation.

phonemize("4.16 meters", language='en-us', backend='espeak', preserve_punctuation=True, with_stress=True, language_switch='remove-flags', punctuation_marks=re.compile(r'[;:!?¡¿—…"«»“”]|[,.](?!\d)'))

Or, via the command line, with the new parameter --punctuation-marks-is-regex:

echo "4.16 meters" | phonemize --preserve-punctuation --with-stress --language-switch remove-flags --punctuation-marks '[;:!?¡¿—…"«»“”]|[,.](?!\d)' --punctuation-marks-is-regex 

returns

fˈoːɹ pɔɪnt wˈʌn sˈɪks mˈiːɾɚz 

@mmmaat mmmaat closed this as completed Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants