New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: more flexibility around punctuation definitions #99
Comments
@hadware Thinking about this request again. How about adding an option to define a punctuation regex directly? It would override the default punctuation marks, and probably throw an exception if a user tries to define both custom punctuation and a punctuation regex? This is probably the easiest path to get the features I need, without changing the default behavior people have come to expect. I can try working something up if that feature makes sense to you. |
Hi @jncasey and thank you for your implication in this project! Actually, you can already provide punctuation marks as parameter, both from CLI and API as a If you prefer to specify directly a precompiled regex instead of a string, it would be possible, for instance, by adding a |
I think I agree with @mmmaat 's suggestion. I'd rather have the signature of def __init__(self, marks: Optional[str] = None, pattern: Optional[Union[str, re.Pattern]] = None):
... to make things more obvious maybe? (and if both args are |
Thanks, @mmmaat! I'm aware that it's possible to pass punctuation marks as a string, but in my case, I'm looking to treat all characters that aren't alphanumeric (or accented alpha) as punctuation. That would include the standard marks on the keyboard, but also emoji and lots of other unicode characters. So it's not really practical to pass a list of every possibility. Which is why I'd like to pass the punctuation regex directly as a parameter. I like your thought of an additional Does that matter? |
@hadware Oops, we responded at the same time. I can work off of that signature. And the corresponding new |
Yes! That would be perfect! |
(And, again, you can start right away with a [WIP] PR if you feel like working on this now) PS: you're the best 🚀 |
Ok @jncasey my idea is incompatible with the CLI. Why not to add a boolean flag What do you think? |
In that case, the flag would be checked in main.py, and if it's true, simply But ultimately, it seems like a stylistic choice. Happy to implement it either way – I'll follow the lead of whatever you guys collectively decide. |
Is your feature request related to a problem? Please describe.
I'd like more flexibility in defining punctuation, ideally by having access directly to the regex.
Specifically, instead of defining the characters to be counted as punctuation, I think it'd be more useful to me to define which characters are words to be phonemized, and treat everything else as punctuation.
Describe the solution you'd like
Something as broad as
[^\p{L}\p{M}0-9']
could work as a default, which from what I understand would capture everything that's not a number, unicode letter or its diacritics.That may be overly broad, though, because I've run into trouble with espeak and characters from Cyrillic and Korean sets already, and I'd imagine characters from other less-supported languages could also be problematic.
Describe alternatives you've considered
In my local copy of phonemizer, I've played with hard-coding the punctuation regex like so:
which captures everything that's not a latin character or a set of marks that the backends can pronounce, like "æt" for "@".
The back half of this is also an attempt to handle the problem raised in #87, though I haven't tested it much, and there may be some cases where it breaks.
The text was updated successfully, but these errors were encountered: