# Seamless and Coqui AI TTS

The goal here is to play with TTS and STS models. The flow right now is basically:

```
Text --> Seamless TTS (translation) --> Seamless STT (text) --> Coqui AI TTS (custom voice)
```

This runs near real-time in terms of cost per second on my M3 Macbook Pro. The final custom voice TTS model by Coqui is very good. Can't believe this is an open source model. Same with Meta's seamless model. Having options to use TTS or STS or STT is pretty insane. It's also so damn fast!


Models used:

1. https://huggingface.co/facebook/seamless-m4t-v2-large
2. https://github.com/coqui-ai/TTS

In [1]:
# import numpy
# print(numpy.__path__)
from transformers import AutoProcessor, SeamlessM4Tv2Model
import torchaudio
from IPython.display import Audio



In [46]:
# model_name = "facebook/seamless-streaming"
model_name = "facebook/seamless-m4t-v2-large"
processor = AutoProcessor.from_pretrained(model_name)
model = SeamlessM4Tv2Model.from_pretrained(model_name)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [44]:
# from text
text_inputs = processor(text = "My dog loves to play outside under the tree.", src_lang="eng", return_tensors="pt")
audio_array_from_text = model.generate(**text_inputs, tgt_lang="deu")[0].cpu().numpy().squeeze()

# demo
sample_rate = model.config.sampling_rate
Audio(audio_array_from_text, rate=sample_rate)

In [47]:
# back to english
audio_inputs = processor(audios=audio_array_from_text, return_tensors="pt")
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng")[0].cpu().numpy().squeeze()

# demo
sample_rate = model.config.sampling_rate
Audio(audio_array_from_audio, rate=sample_rate)

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [11]:
# from audio
audio, orig_freq = torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/gettysburg.wav")
# audio, orig_freq = torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")

audio =  torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
audio_inputs = processor(audios=audio, return_tensors="pt")
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="ukr")[0].cpu().numpy().squeeze()

# to text
audio_inputs = processor(audios=audio_array_from_audio, return_tensors="pt")
output_tokens = model.generate(**audio_inputs, tgt_lang="ukr", generate_speech=False)
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(translated_text_from_audio)


It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Чотири скор і сім років тому наші батьки принесли на цей континент нову націю, задуману в свободі і присвячену пропозиції, що всі люди створені рівними. Тепер ми втягнуті в велику громадянську війну, перевіряючи, чи може ця нація, чи будь-яка нація, задумана і присвячена довго витримати.


In [12]:
sample_rate = model.config.sampling_rate
Audio(audio_array_from_audio, rate=sample_rate)

In [27]:
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")

 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
 > Using model: xtts


In [31]:
print(tts.languages)

['en', 'es', 'fr', 'de', 'it', 'pt', 'pl', 'tr', 'ru', 'nl', 'cs', 'ar', 'zh-cn', 'hu', 'ko', 'ja', 'hi']


In [28]:
tts.tts_to_file(
    text=translated_text_from_audio, 
    speaker_wav="/tmp/michelle.wav", 
    file_path="/tmp/out2.wav",
    language="ru"
)

# sample_rate = model.config.sampling_rate
Audio("/tmp/out2.wav") #, rate=sample_rate)

 > Text splitted to sentences.
['Чотири скор і сім років тому наші батьки принесли на цей континент нову націю, задуману в свободі і присвячену пропозиції, що всі люди створені рівними.', 'Тепер ми втягнуті в велику громадянську війну, перевіряючи, чи може ця нація, чи будь-яка нація, задумана і присвячена довго витримати.']
 > Processing time: 20.204262018203735
 > Real-time factor: 0.8112580442234434


In [15]:
Audio("/tmp/out.wav") #, rate=sample_rate)

In [16]:
Audio("/tmp/michelle.wav") #, rate=sample_rate)

In [37]:
def speak_as_me(text, tts_model, seamless_model, seamless_processor, speaker_voice, source_lang="en", target_lang="ru", output_path="/tmp/speak_as_me.wav"):
    seamless_lang_lookup = {
        "en": "eng",
        "ru": "rus",
        "de": "deu",
    }

    # Convert to an audio feed of the original text
    text_inputs = seamless_processor(
        text = "My dog loves to play outside under the tree.", 
        src_lang=seamless_lang_lookup[source_lang], 
        return_tensors="pt"
    )
    audio_array_from_text = seamless_model.generate(
        **text_inputs, 
        tgt_lang=seamless_lang_lookup[target_lang]
    )[0].cpu().numpy().squeeze()

    # Convert this back into text until seamless has a better vocoder
    audio_inputs = seamless_processor(audios=audio_array_from_text, return_tensors="pt")
    output_tokens = seamless_model.generate(
        **audio_inputs, 
        tgt_lang=seamless_lang_lookup[target_lang], 
        generate_speech=False
    )
    translated_text_from_audio = seamless_processor.decode(
        output_tokens[0].tolist()[0], 
        skip_special_tokens=True
    )

    tts_model.tts_to_file(
        text=translated_text_from_audio, 
        speaker_wav=speaker_voice, 
        file_path=output_path,
        language=target_lang
    )

    return output_path
    


In [49]:
speak_as_me(
    text="My dog loves to play outside under the tree.",
    source_lang="en",
    target_lang="ru",
    output_path="/tmp/out45.wav",
    speaker_voice="/tmp/garrett_voice.wav",
    tts_model=tts,
    seamless_processor=processor,
    seamless_model=model,
)

Audio("/tmp/out45.wav") #, rate=sample_rate)

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


 > Text splitted to sentences.
['моя собака любит играть на улице под деревом']
 > Processing time: 2.589897394180298
 > Real-time factor: 0.7241230160995583
