Basic cleaners or phoneme cleaners #312

JRMeyer · 2021-03-07T09:07:14Z

JRMeyer
Mar 7, 2021
Maintainer

>>> nana_nan
[December 2, 2020, 6:24am]

I am trying to prepare cleaners.py for Turkish language. I want to
use phonemes for training. But it confused me what it said in
config.json and in FAQ and in function for cleaners.py's portuguese.

Question is:: slash
It is written that phonemizer handles expanding abbreviation and
numbers in cleaners.py's function for portuguese. If phonemizer does it,
can I get rid of typing the turkish_cleaner function using
'phoneme_cleaners' as 'text_cleaner'? How can I check if Phonemizer does
expanding abbreviation and numbers for Turkish? How is phoneme_cleaner
different from other cleaners and why is it used in config.json?

> 2. If you have a dataset with a different alphabet than English
> Latin, you need to add your alphabet in utils.text.symbols .
>
> - If you use phonemes for training and your language is supported
> here,
> you don't need to do that.
>
> 3. Write your own text cleaner in utils.text.cleaners . It is not
> always necessary to expect you have a different alphabet or
> language-specific requirements.
>
> - This step is used to expand numbers, abbreviations and normalizing
> the text.

In step 2, 'If you use phonemes for training and your language is
supported slash [here slash ]
(https://github.com/bootphon/phonemizer#supported-languages), you
don't need to do that.' is written. I checked and there is support for
Turkish. Then I don't need to change the 'utils.text.symbols' document. slash
Also, when I examined cleaners.py,

I came across a comment like this for portuguese cleaners.

> def portuguese_cleaners(text):
>
> curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz slash > '' slash 'Basic pipeline for Portuguese text. There is no need to
> expand abbreviation and** slash
> curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz slash > numbers, phonemizer already does that'' slash '** slash
> text = lowercase(text) slash
> text = replace_symbols(text, lang='pt') slash
> text = remove_aux_symbols(text) slash
> text = collapse_whitespace(text) slash
> return text

But there is a section in config.json like

> // DATA LOADING
> 'text_cleaner': 'phoneme_cleaners',
> 'enable_eos_bos_chars': false,
> 'num_loader_workers': 4,
> 'num_val_loader_workers': 4,
> 'batch_group_size': 0, .
> 'min_seq_len': 6,
> 'max_seq_len': 153,
>
> ------------------------------------------------------------------------

> // PHONEMES slash
> 'phoneme_cache_path': 'phoneme_cache/', slash
> 'use_phonemes': true, slash
> 'phoneme_language': 'en-us',

Here it says phoneme_cleaners in the text_cleaner section. I want to use
phoneme, not the normal alphabet for training. For this

> 'text_cleaner': 'phoneme_cleaners'

should the part 'text_cleaner' remain like this? Or should I change it
to this?

> 'text_cleaner': 'turkish_cleaners',

My question is

It is written that phonemizer handles expanding abbreviation and
numbers in cleaners.py's function for portuguese. If phonemizer does it,
can I get rid of typing the turkish_cleaner function using
'phoneme_cleaners' as 'text_cleaner'? How can I check if Phonemizer
does expanding abbreviation and numbers for Turkish? How is
phoneme_cleaner different from other cleaners and why is it used in
config.json?

[Do we need to change symbols when using phonemic text as
input?

[This is an archived TTS discussion thread from discourse.mozilla.org/t/basic-cleaners-or-phoneme-cleaners]

JRMeyer · 2021-03-07T09:07:16Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> othiele
[December 2, 2020, 8:49am]

why don't you join forces
as you seem to be
working on the same
problem.

> If phonemizer does it, can I get rid of typing the turkish_cleaner
> function using 'phoneme_cleaners' as 'text_cleaner' ?

Check the phonemizer docs and test how it handles Turkish. You will
usually need some sort of pre-processing. This is done by the cleaner
function.

> How can I check if Phonemizer does expanding abbreviation and numbers
> for Turkish?

Install separately and use on command line.

> How is phoneme_cleaner different from other cleaners and why is it
> used in config.json?

It is in config.json to be able to switch quickly. So if you come up
with a good Turkish cleaner, do a PR and others can profit from that. As
you saw in the script, different cleaners perform different
pre-processing steps.

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic cleaners or phoneme cleaners #312

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Basic cleaners or phoneme cleaners #312

JRMeyer Mar 7, 2021 Maintainer

Replies: 1 comment

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

JRMeyer
Mar 7, 2021
Maintainer Author