Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word level mapping of phonemized output #96

Open
mmmaat opened this issue Dec 3, 2021 · 6 comments
Open

Word level mapping of phonemized output #96

mmmaat opened this issue Dec 3, 2021 · 6 comments

Comments

@mmmaat
Copy link
Collaborator

mmmaat commented Dec 3, 2021

Suugested by @CorentinJ, see his implementation here.

At some point we got interested in being able to map from characters in the input text of our TTS system to its audio output. That required being able to map from an orthographic input to its phonemized output. Since your library does not provide such mappings, and since espeak doesn't seem to either, I wrote an algorithm to figure them out. It operates at the word level, and we verified that it is correct even for complex edge cases.

For instance an exemple with edge cases (on the -> "ɔnðɪ", Youtubers -> "juː ɾuːbɚz")
|Youtubers| |no| |longer| |belong| |on the| |internet| becomes
|juː ɾuːbɚz| |noʊ| |lɑːŋɡɚ| |bᵻlɔŋ| |ɔnðɪ| |ɪntɚnɛt|.

It must still be decided how to implement that:

  • As a custom word separator: juːɾuːbɚz [Youtubers] noʊ [no] lɑːŋɡɚ [longer] bᵻlɔŋ [belong] ɔnðɪ [on the] ɪntɚnɛt [internet].
  • As an extension of the prepend-text option with a tree-like structure as output:
    [('Youtubers no longer belong on the internet': [
        ('Youtubers', ['juː', 'ɾuːbɚz']),
        ('no', ['noʊ']),
        ...
        ('internet', ['ɪntɚnɛt'])
    ]]
  • A completely new option

This feature seems to be incompatible with phone/syllable separators. What about punctuation preservation?

@trenslow
Copy link

trenslow commented May 9, 2023

First off, thanks for the awesome tool. It has made my life so much easier in a lot of respects.

Has there been any progress made on this topic? It's something that could be really useful.

One simple solution I tried was to use the same word separator that I sent to .phonemize to split the original text. However, sometimes eSpeak-NG merges words, so the number of 'words' I get back from .phonemize doesn't align with the number I get back from the original text split (e.g. the "That's it, words are merged." example from the documentation).

I don't think the merging is configurable on the eSpeak side, as the merging comes from the underlying pronunciation dictionaries. So what remains is sending the split words one-by-one. This also isn't perfect, as you lose information about e.g. sandhi effects.

Maybe there's some information flowing from eSpeak about which words are merged which could be provided to phonemizer users? That would allow people to at least have the choice of what to do about that information.

@mmmaat
Copy link
Collaborator Author

mmmaat commented May 9, 2023

No progress by now... Did you try the code by @CorentinJ here?

@trenslow
Copy link

i didn't, as my use case is for languages other than english

@CorentinJ
Copy link

Word-level mappings should work for all languages with the algo I provided.

@trenslow
Copy link

Ah ok! I will explore it ASAP.

@mmmaat
Copy link
Collaborator Author

mmmaat commented May 25, 2023

If someone want to do a PR with that, it will be great, I have no time for this project in the next few months...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants