You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to prepare cleaners.py for Turkish language. I want to use phonemes for training. But it confused me what it said in config.json and in FAQ and in function for cleaners.py's portuguese.
Question is:: slash It is written that phonemizer handles expanding abbreviation and numbers in cleaners.py's function for portuguese. If phonemizer does it, can I get rid of typing the turkish_cleaner function using 'phoneme_cleaners' as 'text_cleaner'? How can I check if Phonemizer does expanding abbreviation and numbers for Turkish? How is phoneme_cleaner different from other cleaners and why is it used in config.json?
> 2. If you have a dataset with a different alphabet than English > Latin, you need to add your alphabet in utils.text.symbols . > > - If you use phonemes for training and your language is supported > here, > you don't need to do that. > > 3. Write your own text cleaner in utils.text.cleaners . It is not > always necessary to expect you have a different alphabet or > language-specific requirements. > > - This step is used to expand numbers, abbreviations and normalizing > the text.
In step 2, 'If you use phonemes for training and your language is supported slash [here slash ] (https://github.com/bootphon/phonemizer#supported-languages), you don't need to do that.' is written. I checked and there is support for Turkish. Then I don't need to change the 'utils.text.symbols' document. slash Also, when I examined cleaners.py,
I came across a comment like this for portuguese cleaners.
> def portuguese_cleaners(text): > > curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz slash > '' slash 'Basic pipeline for Portuguese text. There is no need to > expand abbreviation and** slash > curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz slash > numbers, phonemizer already does that'' slash '** slash > text = lowercase(text) slash > text = replace_symbols(text, lang='pt') slash > text = remove_aux_symbols(text) slash > text = collapse_whitespace(text) slash > return text
Here it says phoneme_cleaners in the text_cleaner section. I want to use phoneme, not the normal alphabet for training. For this
> 'text_cleaner': 'phoneme_cleaners'
should the part 'text_cleaner' remain like this? Or should I change it to this?
> 'text_cleaner': 'turkish_cleaners',
My question is
It is written that phonemizer handles expanding abbreviation and numbers in cleaners.py's function for portuguese. If phonemizer does it, can I get rid of typing the turkish_cleaner function using 'phoneme_cleaners' as 'text_cleaner'? How can I check if Phonemizer does expanding abbreviation and numbers for Turkish? How is phoneme_cleaner different from other cleaners and why is it used in config.json?
[Do we need to change symbols when using phonemic text as input?
[This is an archived TTS discussion thread from discourse.mozilla.org/t/basic-cleaners-or-phoneme-cleaners]
> If phonemizer does it, can I get rid of typing the turkish_cleaner > function using 'phoneme_cleaners' as 'text_cleaner' ?
Check the phonemizer docs and test how it handles Turkish. You will usually need some sort of pre-processing. This is done by the cleaner function.
> How can I check if Phonemizer does expanding abbreviation and numbers > for Turkish?
Install separately and use on command line.
> How is phoneme_cleaner different from other cleaners and why is it > used in config.json?
It is in config.json to be able to switch quickly. So if you come up with a good Turkish cleaner, do a PR and others can profit from that. As you saw in the script, different cleaners perform different pre-processing steps.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
>>> nana_nan
[December 2, 2020, 6:24am]
I am trying to prepare cleaners.py for Turkish language. I want to
use phonemes for training. But it confused me what it said in
config.json and in FAQ and in function for cleaners.py's portuguese.
Question is:: slash
It is written that phonemizer handles expanding abbreviation and
numbers in cleaners.py's function for portuguese. If phonemizer does it,
can I get rid of typing the turkish_cleaner function using
'phoneme_cleaners' as 'text_cleaner'? How can I check if Phonemizer does
expanding abbreviation and numbers for Turkish? How is phoneme_cleaner
different from other cleaners and why is it used in config.json?
> 2. If you have a dataset with a different alphabet than English
> Latin, you need to add your alphabet in
utils.text.symbols
.>
> - If you use phonemes for training and your language is supported
> here,
> you don't need to do that.
>
> 3. Write your own text cleaner in
utils.text.cleaners
. It is not> always necessary to expect you have a different alphabet or
> language-specific requirements.
>
> - This step is used to expand numbers, abbreviations and normalizing
> the text.
In step 2, 'If you use phonemes for training and your language is
supported slash [here slash ]
(https://github.com/bootphon/phonemizer#supported-languages), you
don't need to do that.' is written. I checked and there is support for
Turkish. Then I don't need to change the 'utils.text.symbols' document. slash
Also, when I examined cleaners.py,
I came across a comment like this for portuguese cleaners.
> def portuguese_cleaners(text):
>
> curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz slash > '' slash 'Basic pipeline for Portuguese text. There is no need to
> expand abbreviation and** slash
> curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz slash > numbers, phonemizer already does that'' slash '** slash
> text = lowercase(text) slash
> text = replace_symbols(text, lang='pt') slash
> text = remove_aux_symbols(text) slash
> text = collapse_whitespace(text) slash
> return text
But there is a section in config.json like
> // DATA LOADING
> 'text_cleaner': 'phoneme_cleaners',
> 'enable_eos_bos_chars': false,
> 'num_loader_workers': 4,
> 'num_val_loader_workers': 4,
> 'batch_group_size': 0, .
> 'min_seq_len': 6,
> 'max_seq_len': 153,
>
> ------------------------------------------------------------------------
> // PHONEMES slash
> 'phoneme_cache_path': 'phoneme_cache/', slash
> 'use_phonemes': true, slash
> 'phoneme_language': 'en-us',
Here it says phoneme_cleaners in the text_cleaner section. I want to use
phoneme, not the normal alphabet for training. For this
> 'text_cleaner': 'phoneme_cleaners'
should the part 'text_cleaner' remain like this? Or should I change it
to this?
> 'text_cleaner': 'turkish_cleaners',
My question is
It is written that phonemizer handles expanding abbreviation and
numbers in cleaners.py's function for portuguese. If phonemizer does it,
can I get rid of typing the turkish_cleaner function using
'phoneme_cleaners' as 'text_cleaner'? How can I check if Phonemizer
does expanding abbreviation and numbers for Turkish? How is
phoneme_cleaner different from other cleaners and why is it used in
config.json?
[Do we need to change symbols when using phonemic text as
input?
[This is an archived TTS discussion thread from discourse.mozilla.org/t/basic-cleaners-or-phoneme-cleaners]
Beta Was this translation helpful? Give feedback.
All reactions