[Feature request] [TTS] Support SSML in input text #752

mariusa · 2021-08-19T19:27:14Z

Is your feature request related to a problem? Please describe.

For TTS, there is a need to choose a specific model or send additional data to the engine on how to handle a part of the text. Examples:

rendering a dialog with different voices
rendering a part with specific emotion (joy, fear, sadness, suprise...)

Describe the solution you'd like
Support SSML / coqui markup in input text. Example:

- <tts model="male_voice_1"> Check the box under the tree </tts>
- <tts model="child_voice_1"> This one?   <tts emotion="joy">Wow, it's the Harry Poter lego!</tts>  </tts:model>

The text was updated successfully, but these errors were encountered:

nitinthewiz · 2021-08-27T19:26:21Z

Seconded! SSML would be a very valuable value-add to the TTS. It would be specially useful for controlling pauses, linebreaks, emotion (if possible, using heightened pitch), urgency (by increasing the speed of spoken text to 1.5x).

It would also be useful in multi-speaker models where we would give speaker ID in the SSML itself and Coqui would string them together. Though this would be a stretch goal to the basic SSML implementation.

Please let us know how we can help in this. Is there an SSML implementation you need us to research? (like gruut was integrated, perhaps we can integrate an existing SSML framework as well). Is there some code we can contribute?

synesthesiam · 2021-08-27T20:37:34Z

Which SSML tags/properties do you think would be the most valuable to implement?

nitinthewiz · 2021-08-28T21:36:16Z

Well, here are the tags that would be most relevant according to me (in some order of relevance) -

tag	Explanation
`<speak></speak>`	this encapsulates the entire SSML section, telling Coqui that this section of the text must be treated as SSML. This also allows us to intersperse SSML and non-SSML text in the same input.
`<break />`	this tag is used to give a pause in the speech. We can also add time="3s" and other parameters to accommodate for how long the break must be
`<say-as interpret-as="spell-out">` or `<say-as interpret-as="cardinal"></say-as>`	this would tell Coqui that the enclosed text must be treated as special. One of the things I've noticed with gruut is that it doesn't know how to say capitalized letters like USA, CDN, AWS. I haven't tried numbers, but that's what cardinal is for. This can also be extended to currency and then country of currency, so we can do localization. e.g. Million in the US is spoken as 10 Lakhs in India.
`<voice name="Mike">` `<voice id="p235"></voice>`	this is very useful for multi-speaker models, to specify the voice we want to use, and also (as a stretch goal) to create multi-voice audio, creating the potential for dialogs between voices
`<prosody>`	This tag is useful for a number of things - setting the volume of the enclosed text, setting the rate of the speech, setting the pitch so that voices can be made more unique.

There are a lot of implementations and features that are non-standard, but can be very useful, such as <emotion name="excited"> or <alias>, which can be used to say words like element names (Mg spelled out as Magnesium) and spell "Mr." out etc. But those are enhancements that companies have done to their SSML implementations and we do not necessarily have to follow.

Sources -
W3 SSML documentation
Amazon's SSML implementation
Microsoft's SSML implementation

By the way, I have a question about how SSML is implemented in neural TTS - I do not understand how the SSML tags would be translated to the voice. Would we need to train models which have different pitch, pauses, and volumes? Would we need to train models that know how to pronounce certain words we ask them to spell out (like USA, AWS, ISIS etc)?

Could you help me understand how this would be implemented?

erogol · 2021-08-30T14:56:01Z

@nitinthewiz thx for the great post. All the use-cases make sense, however, implementing SSML required a lot of effort. I think we can start implementing some of the basic functionalities and expand them as we go.

I don't know when we can start implementing SSML but I add it to our task list here #378

When it comes to your question, some basic manipulations (speed, volume, etc.) are straightforward to implement with a single model. However, some needs model-level architectural changes or improvements, as you noted, emotions, pitch, and so on.

synesthesiam · 2021-08-30T15:37:54Z

@erogol I would be interested in starting on this. Some tags can be handled by gruut, such as <say-as> while others will need to be passed through to TTS.

It may be worth (me) implementing support for PLS lexicons as well, so users could expand gruut's vocabulary.

nitinthewiz · 2021-08-30T20:49:48Z

@erogol thanks a lot for following up and for the explanation!

@synesthesiam let me know how I can help with the lexicon, or once you've implemented it, we can start contributing to the vocab.

synesthesiam · 2021-09-19T01:29:14Z

Small update: I've got preliminary SSML functionality in a side branch of gruut now with support for:

<speak>, <p>, <s>, and <w> tags (allowing for manual tokenization)
<say-as> with support for numbers (cardinal/ordinal/year/digits), dates, currency, and spell-out
<voice> (currently just name)

Numbers, dates, currency, and initialisms are automatically detected and verbalized. I've gone the extra mile and made full use of the lang attribute, so you can have:

<speak>
  <w lang="en_US">1</w> 
  <w lang="es_ES">1</w>
</speak>

verbalized as "one uno". This works for dates, etc. as well, and can even generate phonemes from different languages in the same document. I imagine this could be used in 🐸 TTS with <voice> to generate multi-lingual utterances.

The biggest thing that would help me in completing this feature is deciding on the default settings for non-English languages:

Default date format - can be any combination of day/month/year, where day can be either cardinal ("one") or ordinal ("first")
Default currency - which currency symbol/name (e.g., "$" / "USD")
Default punctuation - what set of characters/strings should (by default) break apart sentences, phrases, and words (e.g., "ninety-nine" -> "ninety", "nine")

erogol · 2021-09-20T08:23:41Z

I think default formats need to be handled by the text normalizer in a way that the model can read. Is this what you also mean? @synesthesiam ?

synesthesiam · 2021-09-20T14:27:25Z

Yes, and also the normalization needs to mirror what the speaker likely did when reading the text. So when gruut comes across "4/1/2021" in text, it needs to come out as the most likely verbalization in the given language/locale.

For U.S. English, "4/1/2021" becomes "April first twenty twenty one". For German, it is "Januar vierte zweitausendeinundzwanzig" instead, which I'm hoping is the right thing to do.

Regarding punctuation, I know that dashes and underscores (and event camelCasing) can be used to break apart English words for the purpose of phonemization -- "ninety" and "nine" are likely in the lexicon, but "ninety-nine" may not be. But this gets more complicated in French: "est-que" is present in the lexicon and is not the same as the phonemes("est") + phonemes("que"). So what I'm doing now is checking the lexicon first, and only breaking words if they're not present.

synesthesiam · 2021-09-28T21:20:50Z

@erogol It might be worth moving this to a discussion

I've completed my first prototype of 🐸 TTS with SSML support (currently here)! I'm using a gruut side branch for now (supported SSML tags).

Now something like this works:

SSML=$(cat << EOF
<speak>
  <s lang="en">123</s>
  <s lang="de">123</s>
  <s lang="es">123</s>
  <s lang="fr">123</s>
  <s lang="nl">123</s>
</speak>
EOF
)

python3 TTS/bin/synthesize.py \
    --model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
    --extra_model_name tts_models/de/thorsten/tacotron2-DCA \
    --extra_model_name tts_models/es/mai/tacotron2-DDC \
    --extra_model_name tts_models/fr/mai/tacotron2-DDC \
    --extra_model_name tts_models/nl/mai/tacotron2-DDC\
     --text "$SSML" --ssml true --out_path ssml.wav

Which outputs a WAV file with:

"one hundred and twenty three" in English
"einhundertdreiundzwanzig" in German
"ciento veintitrés" in Spanish
"cent vingt trois" in French, and
"honderddrieëntwintig" in Dutch

Before getting any deeper, I wanted to see if I'm on the right track.

The three main changes I've made are:

Support for multiple TTS models/SSML input in the Synthesizer
Ability to load additional TTS models when running the server.py and synthesize.py scripts (--extra_model_name)
Changes to the web UI and API to support SSML and TTS model selection

Synthesizer

I created a VoiceConfig class that holds the TTS/vocoder models and configs. When creating a Synthesizer, there is now an extra_voices argument that accepts a list of VoiceConfig objects to load in addition to the "default" voice.

The Synthesizer.tts method now operates in two modes: when the ssml argument is True, it uses gruut to partially parse and split the SSML into multiple sentence objects. Each sentence object is synthesized with the correct TTS model, referenced in one of two ways:

By voice name, such as <voice name="tts_models/en/ljspeech/tacotron2-DDC">...</voice>
- For multi-speaker models, the format name#speaker_idx is used (e.g., tts_models/en/vctk/vits#p228)
By language, such as <s lang="de">...</s>

If no voice or language is specified, the default voice is used.

Command-Line

The server.py and synthesize.py scripts now accept a --extra_model_name argument, which is used to load additional voices by model name:

python3 TTS/server/server.py \
    --model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
    --extra_model_name tts_models/de/thorsten/tacotron2-DCA \
    --extra_model_name tts_models/en/vctk/vits

The default voice is specified as normal (with --model_name or --model_path). All of the extra voices can (currently) only be loaded by name with their default vocoders.

Additionally, the synthesize.py script accepts a --ssml true argument to tell 🐸 TTS that the input text is SSML.

Web UI

The two web UI changes are:

SSML checkbox that adds ssml=true to GET variables
Ability to select different voices (only shown if more than one TTS model is loaded)

erogol · 2021-09-29T16:39:26Z

@synesthesiam it is a great start to SSML!!!

I think we should also decide how we want to land SSML to the library architecture. Before saying anything, I'd be interested in hearing your opinions about that.

synesthesiam · 2021-09-29T19:26:51Z

The biggest change so far is SSML being able to reference multiple voices and languages. In the future, <break> tags and prosody will also introduce new challenges.

Architecturally with SSML, text processing, model loading, and synthesis are all tied together at roughly the same abstraction level. Some important questions are:

When should models be loaded?

Is the user (or code) required to pre-load all relevant models, or could it happen dynamically? If dynamic, is the user able to specify a custom model? Perhaps the <voice> tag could be extended to support model/vocoder paths or URIs.

With the proper use of a ThreadPoolExecutor, model loading can be semi-parallelized along with synthesis (I've done something very similar in Larynx already). This is especially useful in a multi-voice context, where synthesis for already-loaded models can proceed during the loading of newly encountered models.

How do text cleaners, text processing, and phonemization interact?

Text processing with str.replace and re.sub is too low-level for SSML, even just operating on the text between tags. Explicit sentence and word boundaries (<s> and <w>) need to be respected, as well as <say-as> and <alias>.

Phonemization is no longer an independent stage either, since the <phoneme> tag can override a word or phrase. gruut goes even further by avoiding post-processing of words that are already in the lexicon. For example, "NASA" is correctly pronounced as /nˈæsə/ whereas NASAA is pronounced like "N", "A", "S", "A", "A".

gruut's TextProcessor constructs a tree from the initial SSML, and then iteratively refines the leaves during each stage of its pipeline. This keeps the overall structure intact, but allows for sentences/words to be moved, tagged, broken apart, or ignored.

Maybe 🐸 TTS could plug user-defined functions into this pipeline? They don't have to operate on the whole graph, many of mine just word on a single word at a time. For example, this code converts numbers into words for any language supported by num2words.

Depending on where you are in the pipeline, user-defined functions could also operate specifically on numbers, dates, currency, etc. I have code, for instance, that verbalizes numbers as years similar to your code, but done in a (mostly) language-independent manner.

I'll stop before this gets any more long-winded as see what your thoughts are 🙂

erogol · 2021-09-30T08:57:31Z

When should models be loaded?

I think users should define not only the language but also the model name and we can load models dynamically. Something like en/tacotron-ddc. Also, we can define default models for each language to be loaded when no model name is defined.

Threading would be a nice perf improvement too.

How do text cleaners, text processing, and phonemization interact?

I think before we go and solve SSML we need to write up a Tokenizer class to handle all the text processing steps. This would make the code easier to manage. Then we can inherit it or pass as a class member for/to SSMLParser.

My understanding of SSMLparser is that it parses the given text, returns the text and the meta-data (SSML values) alongside it. So this meta-data is taken by the Synthesizer and then it calls the right set of functions to do TTS, interfacing the model.

But also the SSMLParser should know what options are available for the chosen model since different models support different sets of SSML tags.

synesthesiam · 2021-09-30T13:48:32Z

Also, we can define default models for each language to be loaded when no model name is defined.

A default model for each dataset may be worth it too. So, "ljspeech" could default to whatever the best sounding model is currently.

I think before we go and solve SSML we need to write up a Tokenizer class to handle all the text processing steps. This would make the code easier to manage. Then we can inherit it or pass as a class member for/to SSMLParser.

This sounds reasonable, though I suspect over time that the Tokenizer and SSMLParser will end up merging. The non-SSML case can always be handled by just escaping the input text (relative to XML) and wrapping it in a <speak> tag. Then you can have a single class with overridable methods for splitting text into words, verbalizing numbers [1], expanding abbreviations, etc.

[1] For example, is "1234" a cardinal number, ordinal number, year, or digits? The Tokenizer would still need this context in order for the SSMLParser to use it properly.

But also the SSMLParser should know what options are available for the chosen model since different models support different sets of SSML tags.

Another approach is to parse everything into the metadata and leave it up to the Synthesizer to decide what to ignore. If the Tokenzier/SSMLParser returns Sentence/Word objects with the metadata embedded, this would be straightforward. A word may be marked with <emphasis> in the input text, and show up as Word(text="word", emphasis=True), but the Synthesizer can just focus on the text if the underlying model doesn't support emphasis.

erogol · 2021-10-04T13:59:57Z

When I say Tokenizer, I mean something that the model can also use in Training. So as there is no use for SSML in training, it makes sense to use the Tokenizer as the base class I guess. But I mostly agree with you for inference.

Tokenizer can have preprocess, tokenize, and postprocess steps and we can deal with the contextual information in the preprocess step by providing the right set of preprocessing steps for the selected language.

I don't like the "ignoring" idea since then the user does not really know what really works and what doesn't . To me, defining the available tags based on the selected models makes more sense. But it is also definitely harder than just ignoring. Maybe we should start by ignoring for simplicity.

synesthesiam · 2021-10-05T14:26:46Z

I'll implement a proof of concept with the Tokenizer idea 👍

davidak · 2021-11-04T18:34:01Z

oh no

stale · 2021-12-05T15:43:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

mariusa · 2021-12-05T18:07:45Z

not stale

lebigsquare · 2021-12-30T10:16:34Z

<emphasis></emphasis> would be a good one to implement.

https://cloud.google.com/text-to-speech/docs/ssml#emphasis

Is this <emphasis level="moderate">your</emphasis> bag ?

erogol · 2022-01-14T13:03:14Z

TokenizerAPI is WIP #1079

erogol · 2022-01-14T13:03:30Z

@synesthesiam any updates?

stale · 2022-02-13T15:04:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

davidak · 2022-02-13T20:04:45Z

No activity does not mean that it is not important anymore.

jeremy-hyde · 2022-02-17T08:15:24Z

The <mark> tag is also very useful.
Google use it to extract timestamp of the generated audio.
Useful to know where things are said

Google doc
W3C ref

stale · 2022-03-19T10:50:28Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

erogol · 2022-03-26T14:32:40Z

@WeberJulian 👀

WeberJulian · 2022-03-27T11:23:41Z

Hey I started hacking some basic SSML support using Gruut as parsing.
For now it's really just a draft but you can try it out here #1452

liaeh · 2022-11-07T13:18:36Z

@WeberJulian is there an update on adding SSML support to coqui?

Thanks for the info!

s-wel · 2022-11-08T10:12:24Z

I think this feature is even much more relevant now. Consider that, for instance, many future research projects will need alternatives for Google TTS (which is quite strong in SSML) because of the European Union's ambition to strengthen solutions that contribute to trustworthy AI. Is anyone capable of outlining what the specific bottleneck for this feature is? What makes it difficult to implement?

thetrebor · 2023-03-01T18:17:04Z

i really liked where this is going. what are the chances it could get merged wtih main so we can continue with it?

jav-ed · 2023-04-09T19:04:51Z

Having the possibility to add user defined pauses in the speech would be great.

MesumRaza · 2023-04-30T19:58:40Z

Any update?

erogol · 2023-05-01T10:06:10Z

Nobody's working on it from the dev team.

Hunanbean · 2023-06-26T11:31:31Z

For what it is worth, i consider this absolutely essential. I hope very much that this is re-opened and worked towards. you kind of quickly hit a brick wall of what you can do without SSML present

vodiylik · 2023-08-02T07:21:10Z

Any updates? 👀

grabani · 2023-08-03T14:06:06Z

Still hanging around to see if there is any progress...

jgsaez9 · 2023-11-13T08:45:25Z

Any update??

genglinxiao · 2023-11-15T13:36:06Z

Do we have basic support of SSML now? Or is it only supported in the Gruut branch?

erogol · 2023-11-16T10:16:16Z

No we don't have SSML and no timeline for it unless someone is contributing it.

mariusa added the feature request feature requests for making TTS better. label Aug 19, 2021

mariusa changed the title ~~[Feature request] [TTS] Support coqui markup in input text~~ [Feature request] [TTS] Support SSML in input text Aug 27, 2021

synesthesiam mentioned this issue Sep 19, 2021

[Bug] Phoneme extraction with punctuations is wrongly delimited #771

Closed

synesthesiam mentioned this issue Sep 22, 2021

[Feature request] Plans to introduce support for SSML? #670

Closed

synesthesiam mentioned this issue Sep 29, 2021

[Bug] Gruut espeak inconsistencies makes the training harder. #680

Closed

erogol mentioned this issue Sep 30, 2021

🐸 TTS roadmap #378

Closed

58 tasks

stale bot added the wontfix This will not be worked on but feel free to help. label Nov 4, 2021

erogol removed the wontfix This will not be worked on but feel free to help. label Nov 5, 2021

coqui-ai deleted a comment from stale bot Nov 5, 2021

stale bot added the wontfix This will not be worked on but feel free to help. label Dec 5, 2021

stale bot removed the wontfix This will not be worked on but feel free to help. label Dec 5, 2021

stale bot added the wontfix This will not be worked on but feel free to help. label Feb 13, 2022

stale bot removed the wontfix This will not be worked on but feel free to help. label Feb 13, 2022

stale bot added the wontfix This will not be worked on but feel free to help. label Mar 19, 2022

stale bot closed this as completed Mar 26, 2022

erogol removed the wontfix This will not be worked on but feel free to help. label Mar 26, 2022

erogol assigned WeberJulian Mar 26, 2022

eginhard mentioned this issue May 16, 2023

[Bug] Letter-by-letter pronunciation not possible #2619

Closed

Th3rdSergeevich mentioned this issue Oct 7, 2023

[Feature request] [SSML] Manual Stress Control #3039

Closed

[Feature request] [TTS] Support SSML in input text #752

[Feature request] [TTS] Support SSML in input text #752

Comments

mariusa commented Aug 19, 2021 • edited Loading

nitinthewiz commented Aug 27, 2021

synesthesiam commented Aug 27, 2021

nitinthewiz commented Aug 28, 2021 • edited Loading

erogol commented Aug 30, 2021

synesthesiam commented Aug 30, 2021

nitinthewiz commented Aug 30, 2021

synesthesiam commented Sep 19, 2021

erogol commented Sep 20, 2021

synesthesiam commented Sep 20, 2021

synesthesiam commented Sep 28, 2021

Synthesizer

Command-Line

Web UI

erogol commented Sep 29, 2021

synesthesiam commented Sep 29, 2021

erogol commented Sep 30, 2021

synesthesiam commented Sep 30, 2021

erogol commented Oct 4, 2021

synesthesiam commented Oct 5, 2021

davidak commented Nov 4, 2021

stale bot commented Dec 5, 2021

mariusa commented Dec 5, 2021

lebigsquare commented Dec 30, 2021 • edited Loading

erogol commented Jan 14, 2022

erogol commented Jan 14, 2022

stale bot commented Feb 13, 2022

davidak commented Feb 13, 2022

jeremy-hyde commented Feb 17, 2022

stale bot commented Mar 19, 2022

erogol commented Mar 26, 2022

WeberJulian commented Mar 27, 2022

liaeh commented Nov 7, 2022

s-wel commented Nov 8, 2022

thetrebor commented Mar 1, 2023

jav-ed commented Apr 9, 2023

MesumRaza commented Apr 30, 2023

erogol commented May 1, 2023

Hunanbean commented Jun 26, 2023

vodiylik commented Aug 2, 2023 • edited Loading

grabani commented Aug 3, 2023

jgsaez9 commented Nov 13, 2023

genglinxiao commented Nov 15, 2023

erogol commented Nov 16, 2023

mariusa commented Aug 19, 2021 •

edited

Loading

nitinthewiz commented Aug 28, 2021 •

edited

Loading

lebigsquare commented Dec 30, 2021 •

edited

Loading

vodiylik commented Aug 2, 2023 •

edited

Loading