Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: more flexibility around punctuation definitions #99

Closed
jncasey opened this issue Dec 13, 2021 · 9 comments · Fixed by #120
Closed

Request: more flexibility around punctuation definitions #99

jncasey opened this issue Dec 13, 2021 · 9 comments · Fixed by #120
Assignees

Comments

@jncasey
Copy link
Collaborator

jncasey commented Dec 13, 2021

Is your feature request related to a problem? Please describe.
I'd like more flexibility in defining punctuation, ideally by having access directly to the regex.

Specifically, instead of defining the characters to be counted as punctuation, I think it'd be more useful to me to define which characters are words to be phonemized, and treat everything else as punctuation.

Describe the solution you'd like
Something as broad as [^\p{L}\p{M}0-9'] could work as a default, which from what I understand would capture everything that's not a number, unicode letter or its diacritics.

That may be overly broad, though, because I've run into trouble with espeak and characters from Cyrillic and Korean sets already, and I'd imagine characters from other less-supported languages could also be problematic.

Describe alternatives you've considered
In my local copy of phonemizer, I've played with hard-coding the punctuation regex like so:

marks = "[^a-zA-ZÀ-ÖØ-öø-ÿ0-9',.$@&+\\-=/\\\]"
self._marks_re = re.compile(fr'(\s*{marks}+\s*|\s*(?<!\d),\s*|\s*(?<!\d)\.(?!\d)\s*)+')

which captures everything that's not a latin character or a set of marks that the backends can pronounce, like "æt" for "@".

The back half of this is also an attempt to handle the problem raised in #87, though I haven't tested it much, and there may be some cases where it breaks.

@jncasey
Copy link
Collaborator Author

jncasey commented Apr 16, 2022

@hadware Thinking about this request again. How about adding an option to define a punctuation regex directly? It would override the default punctuation marks, and probably throw an exception if a user tries to define both custom punctuation and a punctuation regex? This is probably the easiest path to get the features I need, without changing the default behavior people have come to expect.

I can try working something up if that feature makes sense to you.

@mmmaat
Copy link
Collaborator

mmmaat commented Apr 19, 2022

Hi @jncasey and thank you for your implication in this project!

Actually, you can already provide punctuation marks as parameter, both from CLI and API as a str that is fed to the Punctuation class and compiled as a regex.

If you prefer to specify directly a precompiled regex instead of a string, it would be possible, for instance, by adding a isinstance check in Punctuation.__init__ (here). But, in that case, you'll lose the Punctuation.mark attribute. I don't think it is really used in the code, so we can imagine raising an exception when accessing a mark attribute from a Punctuation instance built from a regex...

@hadware
Copy link
Collaborator

hadware commented Apr 19, 2022

I think I agree with @mmmaat 's suggestion. I'd rather have the signature of Punctuation.__init__ be something like

def __init__(self, marks: Optional[str] = None, pattern: Optional[Union[str, re.Pattern]] = None):
   ...

to make things more obvious maybe? (and if both args are None, it'd still default to DEFAULT_MARKS)

@jncasey
Copy link
Collaborator Author

jncasey commented Apr 19, 2022

Thanks, @mmmaat!

I'm aware that it's possible to pass punctuation marks as a string, but in my case, I'm looking to treat all characters that aren't alphanumeric (or accented alpha) as punctuation. That would include the standard marks on the keyboard, but also emoji and lots of other unicode characters. So it's not really practical to pass a list of every possibility.

Which is why I'd like to pass the punctuation regex directly as a parameter.

I like your thought of an additional isinstance check allowing for marks to be a precompiled regex. That would work perfectly in my use case, and require minimal changes everywhere else. But with that approach, it wouldn't be possible to expose the functionality to CLI version, right? There'd be no way to know if the --punctuation-marks parameter was intended as a regex or not.

Does that matter?

@jncasey
Copy link
Collaborator Author

jncasey commented Apr 19, 2022

@hadware Oops, we responded at the same time.

I can work off of that signature. And the corresponding new phonemize() param would be punctuation_pattern and the CLI param would be --punctuation-pattern?

@hadware
Copy link
Collaborator

hadware commented Apr 19, 2022

Yes! That would be perfect!

@hadware
Copy link
Collaborator

hadware commented Apr 19, 2022

(And, again, you can start right away with a [WIP] PR if you feel like working on this now)

PS: you're the best 🚀

@mmmaat
Copy link
Collaborator

mmmaat commented Apr 19, 2022

Ok @jncasey my idea is incompatible with the CLI.

Why not to add a boolean flag --punctuation-marks-is-regex defaulting to False from the CLI? The name is pretty long, OK, but the punctuation_pattern name is very ambiguous to me and it's confusing to have both marks and pattern for the same functionality.

What do you think?

@jncasey
Copy link
Collaborator Author

jncasey commented Apr 19, 2022

In that case, the flag would be checked in main.py, and if it's true, simply args.punctuation_marks = re.compile(args.punctuation_marks)? That's pretty clean.

But ultimately, it seems like a stylistic choice. Happy to implement it either way – I'll follow the lead of whatever you guys collectively decide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants