Request: more flexibility around punctuation definitions #99

jncasey · 2021-12-13T23:48:04Z

Is your feature request related to a problem? Please describe.
I'd like more flexibility in defining punctuation, ideally by having access directly to the regex.

Specifically, instead of defining the characters to be counted as punctuation, I think it'd be more useful to me to define which characters are words to be phonemized, and treat everything else as punctuation.

Describe the solution you'd like
Something as broad as [^\p{L}\p{M}0-9'] could work as a default, which from what I understand would capture everything that's not a number, unicode letter or its diacritics.

That may be overly broad, though, because I've run into trouble with espeak and characters from Cyrillic and Korean sets already, and I'd imagine characters from other less-supported languages could also be problematic.

Describe alternatives you've considered
In my local copy of phonemizer, I've played with hard-coding the punctuation regex like so:

marks = "[^a-zA-ZÀ-ÖØ-öø-ÿ0-9',.$@&+\\-=/\\\]"
self._marks_re = re.compile(fr'(\s*{marks}+\s*|\s*(?<!\d),\s*|\s*(?<!\d)\.(?!\d)\s*)+')

which captures everything that's not a latin character or a set of marks that the backends can pronounce, like "æt" for "@".

The back half of this is also an attempt to handle the problem raised in #87, though I haven't tested it much, and there may be some cases where it breaks.

The text was updated successfully, but these errors were encountered:

jncasey · 2022-04-16T18:11:35Z

@hadware Thinking about this request again. How about adding an option to define a punctuation regex directly? It would override the default punctuation marks, and probably throw an exception if a user tries to define both custom punctuation and a punctuation regex? This is probably the easiest path to get the features I need, without changing the default behavior people have come to expect.

I can try working something up if that feature makes sense to you.

mmmaat · 2022-04-19T15:51:18Z

Hi @jncasey and thank you for your implication in this project!

Actually, you can already provide punctuation marks as parameter, both from CLI and API as a str that is fed to the Punctuation class and compiled as a regex.

If you prefer to specify directly a precompiled regex instead of a string, it would be possible, for instance, by adding a isinstance check in Punctuation.__init__ (here). But, in that case, you'll lose the Punctuation.mark attribute. I don't think it is really used in the code, so we can imagine raising an exception when accessing a mark attribute from a Punctuation instance built from a regex...

hadware · 2022-04-19T16:28:31Z

I think I agree with @mmmaat 's suggestion. I'd rather have the signature of Punctuation.__init__ be something like

def __init__(self, marks: Optional[str] = None, pattern: Optional[Union[str, re.Pattern]] = None):
   ...

to make things more obvious maybe? (and if both args are None, it'd still default to DEFAULT_MARKS)

jncasey · 2022-04-19T16:29:21Z

Thanks, @mmmaat!

I'm aware that it's possible to pass punctuation marks as a string, but in my case, I'm looking to treat all characters that aren't alphanumeric (or accented alpha) as punctuation. That would include the standard marks on the keyboard, but also emoji and lots of other unicode characters. So it's not really practical to pass a list of every possibility.

Which is why I'd like to pass the punctuation regex directly as a parameter.

I like your thought of an additional isinstance check allowing for marks to be a precompiled regex. That would work perfectly in my use case, and require minimal changes everywhere else. But with that approach, it wouldn't be possible to expose the functionality to CLI version, right? There'd be no way to know if the --punctuation-marks parameter was intended as a regex or not.

Does that matter?

jncasey · 2022-04-19T16:33:24Z

@hadware Oops, we responded at the same time.

I can work off of that signature. And the corresponding new phonemize() param would be punctuation_pattern and the CLI param would be --punctuation-pattern?

hadware · 2022-04-19T16:34:52Z

Yes! That would be perfect!

hadware · 2022-04-19T16:35:34Z

(And, again, you can start right away with a [WIP] PR if you feel like working on this now)

PS: you're the best 🚀

mmmaat · 2022-04-19T16:48:41Z

Ok @jncasey my idea is incompatible with the CLI.

Why not to add a boolean flag --punctuation-marks-is-regex defaulting to False from the CLI? The name is pretty long, OK, but the punctuation_pattern name is very ambiguous to me and it's confusing to have both marks and pattern for the same functionality.

What do you think?

jncasey · 2022-04-19T17:00:28Z

In that case, the flag would be checked in main.py, and if it's true, simply args.punctuation_marks = re.compile(args.punctuation_marks)? That's pretty clean.

But ultimately, it seems like a stylistic choice. Happy to implement it either way – I'll follow the lead of whatever you guys collectively decide.

mmmaat assigned hadware Dec 14, 2021

mmmaat added the feature request label Dec 14, 2021

hadware assigned jncasey Apr 19, 2022

jncasey mentioned this issue Apr 19, 2022

add direct access to punctuation regex #120

Merged

mmmaat closed this as completed in #120 Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: more flexibility around punctuation definitions #99

Request: more flexibility around punctuation definitions #99

jncasey commented Dec 13, 2021

jncasey commented Apr 16, 2022

mmmaat commented Apr 19, 2022

hadware commented Apr 19, 2022 •

edited

jncasey commented Apr 19, 2022

jncasey commented Apr 19, 2022

hadware commented Apr 19, 2022

hadware commented Apr 19, 2022

mmmaat commented Apr 19, 2022 •

edited

jncasey commented Apr 19, 2022

Request: more flexibility around punctuation definitions #99

Request: more flexibility around punctuation definitions #99

Comments

jncasey commented Dec 13, 2021

jncasey commented Apr 16, 2022

mmmaat commented Apr 19, 2022

hadware commented Apr 19, 2022 • edited

jncasey commented Apr 19, 2022

jncasey commented Apr 19, 2022

hadware commented Apr 19, 2022

hadware commented Apr 19, 2022

mmmaat commented Apr 19, 2022 • edited

jncasey commented Apr 19, 2022

hadware commented Apr 19, 2022 •

edited

mmmaat commented Apr 19, 2022 •

edited