Multilingual tokenizer #2229

WeberJulian · 2022-12-20T17:16:05Z

Add ability to specify language and tokenizer for each dataset.

from TTS.tts.utils.text.phonemizers.multi_phonemizer import MultiPhonemizer
texts = {
    "tr": "Merhaba, bu Türkçe bit örnek!",
    "en-us": "Hello, this is English example!",
    "de": "Hallo, das ist ein Deutches Beipiel!",
    "zh-cn": "这是中国的例子",
}
phonemes = {}
ph = MultiPhonemizer({"tr": "espeak", "en-us": "", "de": "gruut", "zh-cn": ""})
for lang, text in texts.items():
    phoneme = ph.phonemize(text, lang)
    phonemes[lang] = phoneme
print(phonemes)

You set language and phonemizer for each dataset in the config/recipe. If phonemizer is not specified, it's using the default phonemizer for that language.

Also added a recipe to train on mailabs with phonemes. (espeak).

WeberJulian · 2022-12-20T17:18:12Z

I'll add tests when that approach/design is validated.

Edresson

Great PR :)

erogol

Just need testing for multi-lang phonemizer

erogol · 2022-12-22T14:30:56Z

I approved it but it is still Draft. Is there something more to come?

WeberJulian · 2022-12-22T14:34:17Z

I approved it but it is still Draft. Is there something more to come?

I just wanted to see if the test were passing.
Also wondering about the text_cleaners, should we allow the by the same way we allow setting phonemizer for each dataset?
Or do we consider that phonemizers should also do the work that multilingual_cleaner doesn't (like espeak does)

MuruganR96 · 2022-12-26T10:33:24Z

@WeberJulian This PR not working for me. A few steps after training were stuck (froze)

I tried with Multilingual-MultiSpeaker Training(English, Tamil, Telugu).

I checked MultiPhonemizer working. but in training, certain batches after training were freezed

erogol · 2022-12-26T10:42:20Z

I think we should be able to set different cleaners per dataset.

MuruganR96 · 2022-12-26T11:41:52Z

I think we should be able to set different cleaners per dataset.

What is the root cause for the above training stuck issue? any reason why we should set different cleaners per dataset?

How to implement it? needed initial guidance.

I need to add a set of different cleaners per dataset in cleaners.py, and BaseDatasetConfig includes text_cleaners
and a similar way of implementing MultiCleaners in tokenizer.py init_from_config.

better I will move to issues for this conversation

Edresson · 2022-12-26T14:47:09Z

rent cleaners per dataset in cleaners.py, and BaseDatasetConfig includes text_cleaners and a similar way of implementing MultiCleaners in tokenizer.py init_from_config.

better I will move to issues for this conversation

It would be nice, But I don't think that @MuruganR96 issue is related to it. multilingual_cleaners is simple and should be compatible with all languages.

MuruganR96 · 2022-12-26T15:21:50Z

Thank you @WeberJulian @erogol @Edresson

PR is working fine. espeak is not having support for the Telugu language. espeak-ng created this problem. so I uninstalled espeak-ng. now I am able to train Multilingual-MultiSpeaker TTS in English & Tamil.

Msahiitrpr · 2023-06-18T07:54:25Z

@erogol @Edresson
espeak-ng is not suporting training language , is there any specific reason ??. how to solve that

WeberJulian added 2 commits December 20, 2022 18:10

Implement multilingual tokenizer

2ba1d1e

Add multi_phonemizer receipe

707d231

Edresson approved these changes Dec 20, 2022

View reviewed changes

erogol requested changes Dec 20, 2022

View reviewed changes

WeberJulian added 2 commits December 22, 2022 14:55

Fix lint

58471b7

Add TestMultiPhonemizer

6c2d118

erogol approved these changes Dec 22, 2022

View reviewed changes

WeberJulian marked this pull request as ready for review December 22, 2022 14:31

WeberJulian added 2 commits December 22, 2022 15:35

Fix lint

0dbaa0e

make style

49da913

erogol merged commit a073977 into dev Jan 2, 2023

erogol deleted the multilingual-tokenizer branch January 2, 2023 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual tokenizer #2229

Multilingual tokenizer #2229

WeberJulian commented Dec 20, 2022 •

edited

WeberJulian commented Dec 20, 2022

Edresson left a comment

erogol left a comment

erogol commented Dec 22, 2022

WeberJulian commented Dec 22, 2022

MuruganR96 commented Dec 26, 2022 •

edited

erogol commented Dec 26, 2022

MuruganR96 commented Dec 26, 2022 •

edited

Edresson commented Dec 26, 2022

MuruganR96 commented Dec 26, 2022

Msahiitrpr commented Jun 18, 2023

Multilingual tokenizer #2229

Multilingual tokenizer #2229

Conversation

WeberJulian commented Dec 20, 2022 • edited

WeberJulian commented Dec 20, 2022

Edresson left a comment

Choose a reason for hiding this comment

erogol left a comment

Choose a reason for hiding this comment

erogol commented Dec 22, 2022

WeberJulian commented Dec 22, 2022

MuruganR96 commented Dec 26, 2022 • edited

erogol commented Dec 26, 2022

MuruganR96 commented Dec 26, 2022 • edited

Edresson commented Dec 26, 2022

MuruganR96 commented Dec 26, 2022

Msahiitrpr commented Jun 18, 2023

WeberJulian commented Dec 20, 2022 •

edited

MuruganR96 commented Dec 26, 2022 •

edited

MuruganR96 commented Dec 26, 2022 •

edited