Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] [TTS] Support SSML in input text #752

Closed
mariusa opened this issue Aug 19, 2021 · 40 comments
Closed

[Feature request] [TTS] Support SSML in input text #752

mariusa opened this issue Aug 19, 2021 · 40 comments
Assignees
Labels
feature request feature requests for making TTS better.

Comments

@mariusa
Copy link

mariusa commented Aug 19, 2021

Is your feature request related to a problem? Please describe.

For TTS, there is a need to choose a specific model or send additional data to the engine on how to handle a part of the text. Examples:

  • rendering a dialog with different voices
  • rendering a part with specific emotion (joy, fear, sadness, suprise...)

Describe the solution you'd like
Support SSML / coqui markup in input text. Example:

- <tts model="male_voice_1"> Check the box under the tree </tts>
- <tts model="child_voice_1"> This one?   <tts emotion="joy">Wow, it's the Harry Poter lego!</tts>  </tts:model>
@mariusa mariusa added the feature request feature requests for making TTS better. label Aug 19, 2021
@nitinthewiz
Copy link

Seconded! SSML would be a very valuable value-add to the TTS. It would be specially useful for controlling pauses, linebreaks, emotion (if possible, using heightened pitch), urgency (by increasing the speed of spoken text to 1.5x).

It would also be useful in multi-speaker models where we would give speaker ID in the SSML itself and Coqui would string them together. Though this would be a stretch goal to the basic SSML implementation.

Please let us know how we can help in this. Is there an SSML implementation you need us to research? (like gruut was integrated, perhaps we can integrate an existing SSML framework as well). Is there some code we can contribute?

@mariusa mariusa changed the title [Feature request] [TTS] Support coqui markup in input text [Feature request] [TTS] Support SSML in input text Aug 27, 2021
@synesthesiam
Copy link
Contributor

Which SSML tags/properties do you think would be the most valuable to implement?

@nitinthewiz
Copy link

nitinthewiz commented Aug 28, 2021

Well, here are the tags that would be most relevant according to me (in some order of relevance) -

tag Explanation
<speak></speak> this encapsulates the entire SSML section, telling Coqui that this section of the text must be treated as SSML. This also allows us to intersperse SSML and non-SSML text in the same input.
<break /> this tag is used to give a pause in the speech. We can also add time="3s" and other parameters to accommodate for how long the break must be
<say-as interpret-as="spell-out"> or <say-as interpret-as="cardinal"></say-as> this would tell Coqui that the enclosed text must be treated as special. One of the things I've noticed with gruut is that it doesn't know how to say capitalized letters like USA, CDN, AWS. I haven't tried numbers, but that's what cardinal is for. This can also be extended to currency and then country of currency, so we can do localization. e.g. Million in the US is spoken as 10 Lakhs in India.
<voice name="Mike"> <voice id="p235"></voice> this is very useful for multi-speaker models, to specify the voice we want to use, and also (as a stretch goal) to create multi-voice audio, creating the potential for dialogs between voices
<prosody> This tag is useful for a number of things - setting the volume of the enclosed text, setting the rate of the speech, setting the pitch so that voices can be made more unique.

There are a lot of implementations and features that are non-standard, but can be very useful, such as <emotion name="excited"> or <alias>, which can be used to say words like element names (Mg spelled out as Magnesium) and spell "Mr." out etc. But those are enhancements that companies have done to their SSML implementations and we do not necessarily have to follow.

Sources -
W3 SSML documentation
Amazon's SSML implementation
Microsoft's SSML implementation

By the way, I have a question about how SSML is implemented in neural TTS - I do not understand how the SSML tags would be translated to the voice. Would we need to train models which have different pitch, pauses, and volumes? Would we need to train models that know how to pronounce certain words we ask them to spell out (like USA, AWS, ISIS etc)?

Could you help me understand how this would be implemented?

@erogol
Copy link
Member

erogol commented Aug 30, 2021

@nitinthewiz thx for the great post. All the use-cases make sense, however, implementing SSML required a lot of effort. I think we can start implementing some of the basic functionalities and expand them as we go.

I don't know when we can start implementing SSML but I add it to our task list here #378

When it comes to your question, some basic manipulations (speed, volume, etc.) are straightforward to implement with a single model. However, some needs model-level architectural changes or improvements, as you noted, emotions, pitch, and so on.

@synesthesiam
Copy link
Contributor

@erogol I would be interested in starting on this. Some tags can be handled by gruut, such as <say-as> while others will need to be passed through to TTS.

It may be worth (me) implementing support for PLS lexicons as well, so users could expand gruut's vocabulary.

@nitinthewiz
Copy link

@erogol thanks a lot for following up and for the explanation!

@synesthesiam let me know how I can help with the lexicon, or once you've implemented it, we can start contributing to the vocab.

@synesthesiam
Copy link
Contributor

Small update: I've got preliminary SSML functionality in a side branch of gruut now with support for:

  • <speak>, <p>, <s>, and <w> tags (allowing for manual tokenization)
  • <say-as> with support for numbers (cardinal/ordinal/year/digits), dates, currency, and spell-out
  • <voice> (currently just name)

Numbers, dates, currency, and initialisms are automatically detected and verbalized. I've gone the extra mile and made full use of the lang attribute, so you can have:

<speak>
  <w lang="en_US">1</w> 
  <w lang="es_ES">1</w>
</speak>

verbalized as "one uno". This works for dates, etc. as well, and can even generate phonemes from different languages in the same document. I imagine this could be used in 🐸 TTS with <voice> to generate multi-lingual utterances.

The biggest thing that would help me in completing this feature is deciding on the default settings for non-English languages:

  • Default date format - can be any combination of day/month/year, where day can be either cardinal ("one") or ordinal ("first")
  • Default currency - which currency symbol/name (e.g., "$" / "USD")
  • Default punctuation - what set of characters/strings should (by default) break apart sentences, phrases, and words (e.g., "ninety-nine" -> "ninety", "nine")

@erogol
Copy link
Member

erogol commented Sep 20, 2021

I think default formats need to be handled by the text normalizer in a way that the model can read. Is this what you also mean? @synesthesiam ?

@synesthesiam
Copy link
Contributor

Yes, and also the normalization needs to mirror what the speaker likely did when reading the text. So when gruut comes across "4/1/2021" in text, it needs to come out as the most likely verbalization in the given language/locale.

For U.S. English, "4/1/2021" becomes "April first twenty twenty one". For German, it is "Januar vierte zweitausendeinundzwanzig" instead, which I'm hoping is the right thing to do.

Regarding punctuation, I know that dashes and underscores (and event camelCasing) can be used to break apart English words for the purpose of phonemization -- "ninety" and "nine" are likely in the lexicon, but "ninety-nine" may not be. But this gets more complicated in French: "est-que" is present in the lexicon and is not the same as the phonemes("est") + phonemes("que"). So what I'm doing now is checking the lexicon first, and only breaking words if they're not present.

@synesthesiam
Copy link
Contributor

@erogol It might be worth moving this to a discussion

I've completed my first prototype of 🐸 TTS with SSML support (currently here)! I'm using a gruut side branch for now (supported SSML tags).

Now something like this works:

SSML=$(cat << EOF
<speak>
  <s lang="en">123</s>
  <s lang="de">123</s>
  <s lang="es">123</s>
  <s lang="fr">123</s>
  <s lang="nl">123</s>
</speak>
EOF
)

python3 TTS/bin/synthesize.py \
    --model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
    --extra_model_name tts_models/de/thorsten/tacotron2-DCA \
    --extra_model_name tts_models/es/mai/tacotron2-DDC \
    --extra_model_name tts_models/fr/mai/tacotron2-DDC \
    --extra_model_name tts_models/nl/mai/tacotron2-DDC\
     --text "$SSML" --ssml true --out_path ssml.wav

Which outputs a WAV file with:

  • "one hundred and twenty three" in English
  • "einhundertdreiundzwanzig" in German
  • "ciento veintitrés" in Spanish
  • "cent vingt trois" in French, and
  • "honderddrieëntwintig" in Dutch

Before getting any deeper, I wanted to see if I'm on the right track.

The three main changes I've made are:

  1. Support for multiple TTS models/SSML input in the Synthesizer
  2. Ability to load additional TTS models when running the server.py and synthesize.py scripts (--extra_model_name)
  3. Changes to the web UI and API to support SSML and TTS model selection

Synthesizer

I created a VoiceConfig class that holds the TTS/vocoder models and configs. When creating a Synthesizer, there is now an extra_voices argument that accepts a list of VoiceConfig objects to load in addition to the "default" voice.

The Synthesizer.tts method now operates in two modes: when the ssml argument is True, it uses gruut to partially parse and split the SSML into multiple sentence objects. Each sentence object is synthesized with the correct TTS model, referenced in one of two ways:

  • By voice name, such as <voice name="tts_models/en/ljspeech/tacotron2-DDC">...</voice>
    • For multi-speaker models, the format name#speaker_idx is used (e.g., tts_models/en/vctk/vits#p228)
  • By language, such as <s lang="de">...</s>

If no voice or language is specified, the default voice is used.

Command-Line

The server.py and synthesize.py scripts now accept a --extra_model_name argument, which is used to load additional voices by model name:

python3 TTS/server/server.py \
    --model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
    --extra_model_name tts_models/de/thorsten/tacotron2-DCA \
    --extra_model_name tts_models/en/vctk/vits

The default voice is specified as normal (with --model_name or --model_path). All of the extra voices can (currently) only be loaded by name with their default vocoders.

Additionally, the synthesize.py script accepts a --ssml true argument to tell 🐸 TTS that the input text is SSML.

Web UI

coqui-ssml

The two web UI changes are:

  • SSML checkbox that adds ssml=true to GET variables
  • Ability to select different voices (only shown if more than one TTS model is loaded)

@erogol
Copy link
Member

erogol commented Sep 29, 2021

@synesthesiam it is a great start to SSML!!!

I think we should also decide how we want to land SSML to the library architecture. Before saying anything, I'd be interested in hearing your opinions about that.

@synesthesiam
Copy link
Contributor

The biggest change so far is SSML being able to reference multiple voices and languages. In the future, <break> tags and prosody will also introduce new challenges.

Architecturally with SSML, text processing, model loading, and synthesis are all tied together at roughly the same abstraction level. Some important questions are:

  • When should models be loaded?

Is the user (or code) required to pre-load all relevant models, or could it happen dynamically? If dynamic, is the user able to specify a custom model? Perhaps the <voice> tag could be extended to support model/vocoder paths or URIs.

With the proper use of a ThreadPoolExecutor, model loading can be semi-parallelized along with synthesis (I've done something very similar in Larynx already). This is especially useful in a multi-voice context, where synthesis for already-loaded models can proceed during the loading of newly encountered models.

  • How do text cleaners, text processing, and phonemization interact?

Text processing with str.replace and re.sub is too low-level for SSML, even just operating on the text between tags. Explicit sentence and word boundaries (<s> and <w>) need to be respected, as well as <say-as> and <alias>.

Phonemization is no longer an independent stage either, since the <phoneme> tag can override a word or phrase. gruut goes even further by avoiding post-processing of words that are already in the lexicon. For example, "NASA" is correctly pronounced as /nˈæsə/ whereas NASAA is pronounced like "N", "A", "S", "A", "A".

gruut's TextProcessor constructs a tree from the initial SSML, and then iteratively refines the leaves during each stage of its pipeline. This keeps the overall structure intact, but allows for sentences/words to be moved, tagged, broken apart, or ignored.

Maybe 🐸 TTS could plug user-defined functions into this pipeline? They don't have to operate on the whole graph, many of mine just word on a single word at a time. For example, this code converts numbers into words for any language supported by num2words.

Depending on where you are in the pipeline, user-defined functions could also operate specifically on numbers, dates, currency, etc. I have code, for instance, that verbalizes numbers as years similar to your code, but done in a (mostly) language-independent manner.


I'll stop before this gets any more long-winded as see what your thoughts are 🙂

@erogol
Copy link
Member

erogol commented Sep 30, 2021

  • When should models be loaded?

I think users should define not only the language but also the model name and we can load models dynamically. Something like en/tacotron-ddc. Also, we can define default models for each language to be loaded when no model name is defined.

Threading would be a nice perf improvement too.

  • How do text cleaners, text processing, and phonemization interact?

I think before we go and solve SSML we need to write up a Tokenizer class to handle all the text processing steps. This would make the code easier to manage. Then we can inherit it or pass as a class member for/to SSMLParser.

My understanding of SSMLparser is that it parses the given text, returns the text and the meta-data (SSML values) alongside it. So this meta-data is taken by the Synthesizer and then it calls the right set of functions to do TTS, interfacing the model.

But also the SSMLParser should know what options are available for the chosen model since different models support different sets of SSML tags.

@erogol erogol mentioned this issue Sep 30, 2021
58 tasks
@synesthesiam
Copy link
Contributor

Also, we can define default models for each language to be loaded when no model name is defined.

A default model for each dataset may be worth it too. So, "ljspeech" could default to whatever the best sounding model is currently.

I think before we go and solve SSML we need to write up a Tokenizer class to handle all the text processing steps. This would make the code easier to manage. Then we can inherit it or pass as a class member for/to SSMLParser.

This sounds reasonable, though I suspect over time that the Tokenizer and SSMLParser will end up merging. The non-SSML case can always be handled by just escaping the input text (relative to XML) and wrapping it in a <speak> tag. Then you can have a single class with overridable methods for splitting text into words, verbalizing numbers [1], expanding abbreviations, etc.

[1] For example, is "1234" a cardinal number, ordinal number, year, or digits? The Tokenizer would still need this context in order for the SSMLParser to use it properly.

But also the SSMLParser should know what options are available for the chosen model since different models support different sets of SSML tags.

Another approach is to parse everything into the metadata and leave it up to the Synthesizer to decide what to ignore. If the Tokenzier/SSMLParser returns Sentence/Word objects with the metadata embedded, this would be straightforward. A word may be marked with <emphasis> in the input text, and show up as Word(text="word", emphasis=True), but the Synthesizer can just focus on the text if the underlying model doesn't support emphasis.

@erogol
Copy link
Member

erogol commented Oct 4, 2021

When I say Tokenizer, I mean something that the model can also use in Training. So as there is no use for SSML in training, it makes sense to use the Tokenizer as the base class I guess. But I mostly agree with you for inference.

Tokenizer can have preprocess, tokenize, and postprocess steps and we can deal with the contextual information in the preprocess step by providing the right set of preprocessing steps for the selected language.

I don't like the "ignoring" idea since then the user does not really know what really works and what doesn't . To me, defining the available tags based on the selected models makes more sense. But it is also definitely harder than just ignoring. Maybe we should start by ignoring for simplicity.

@synesthesiam
Copy link
Contributor

I'll implement a proof of concept with the Tokenizer idea 👍

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Nov 4, 2021
@davidak
Copy link

davidak commented Nov 4, 2021

oh no

@erogol erogol removed the wontfix This will not be worked on but feel free to help. label Nov 5, 2021
@coqui-ai coqui-ai deleted a comment from stale bot Nov 5, 2021
@stale
Copy link

stale bot commented Dec 5, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Dec 5, 2021
@mariusa
Copy link
Author

mariusa commented Dec 5, 2021

not stale

@stale stale bot removed the wontfix This will not be worked on but feel free to help. label Dec 5, 2021
@lebigsquare
Copy link

lebigsquare commented Dec 30, 2021

<emphasis></emphasis> would be a good one to implement.

https://cloud.google.com/text-to-speech/docs/ssml#emphasis

Is this <emphasis level="moderate">your</emphasis> bag ?

@erogol
Copy link
Member

erogol commented Jan 14, 2022

TokenizerAPI is WIP #1079

@erogol
Copy link
Member

erogol commented Jan 14, 2022

@synesthesiam any updates?

@stale
Copy link

stale bot commented Feb 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Feb 13, 2022
@davidak
Copy link

davidak commented Feb 13, 2022

No activity does not mean that it is not important anymore.

@stale stale bot removed the wontfix This will not be worked on but feel free to help. label Feb 13, 2022
@jeremy-hyde
Copy link

The <mark> tag is also very useful.
Google use it to extract timestamp of the generated audio.
Useful to know where things are said

Google doc
W3C ref

@stale
Copy link

stale bot commented Mar 19, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Mar 19, 2022
@stale stale bot closed this as completed Mar 26, 2022
@erogol
Copy link
Member

erogol commented Mar 26, 2022

@WeberJulian 👀

@erogol erogol removed the wontfix This will not be worked on but feel free to help. label Mar 26, 2022
@WeberJulian
Copy link
Contributor

Hey I started hacking some basic SSML support using Gruut as parsing.
For now it's really just a draft but you can try it out here #1452

@liaeh
Copy link

liaeh commented Nov 7, 2022

@WeberJulian is there an update on adding SSML support to coqui?

Thanks for the info!

@s-wel
Copy link

s-wel commented Nov 8, 2022

I think this feature is even much more relevant now. Consider that, for instance, many future research projects will need alternatives for Google TTS (which is quite strong in SSML) because of the European Union's ambition to strengthen solutions that contribute to trustworthy AI. Is anyone capable of outlining what the specific bottleneck for this feature is? What makes it difficult to implement?

@thetrebor
Copy link

i really liked where this is going. what are the chances it could get merged wtih main so we can continue with it?

@jav-ed
Copy link

jav-ed commented Apr 9, 2023

Having the possibility to add user defined pauses in the speech would be great.

@MesumRaza
Copy link

Any update?

@erogol
Copy link
Member

erogol commented May 1, 2023

Nobody's working on it from the dev team.

@Hunanbean
Copy link

For what it is worth, i consider this absolutely essential. I hope very much that this is re-opened and worked towards. you kind of quickly hit a brick wall of what you can do without SSML present

@vodiylik
Copy link
Contributor

vodiylik commented Aug 2, 2023

Any updates? 👀

@grabani
Copy link

grabani commented Aug 3, 2023

Still hanging around to see if there is any progress...

@joska1993
Copy link

Any update??

@genglinxiao
Copy link

Do we have basic support of SSML now? Or is it only supported in the Gruut branch?

@erogol
Copy link
Member

erogol commented Nov 16, 2023

No we don't have SSML and no timeline for it unless someone is contributing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request feature requests for making TTS better.
Projects
None yet
Development

No branches or pull requests