# Mandarin Text to Speech with Coqui TTS

---

[Github](https://github.com/eugenesiow/practical-ml/) | More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml)

---

Notebook to convert an input piece of text into an speech audio file automatically.

[Text-To-Speech synthesis](https://paperswithcode.com/task/text-to-speech-synthesis) is the task of converting written text in natural language to speech.

The mandarin model used is one of the pre-trained [Coqui TTS](https://github.com/coqui-ai/TTS) model. This model was from the Mozilla TTS days (of which Coqui TTS is a hard-fork). The model was trained on data from the [中文标准女声音库](https://www.data-baker.com/data/index/source/) with 10000 sentences from [DataBaker Technology](https://www.data-baker.com/).

The notebook is structured as follows:
* Setting up the Environment
* Using the Model (Running Inference)
* Apply Speech Enhancement/Noise Reduction (Optional)

# Setting up the Environment

#### Dependencies and Runtime

If you're running this notebook in Google Colab, most of the dependencies are already installed and we **don't need the GPU** for this particular example. 

We need to install the Coqui TTS library called `TTS` for this example to run, so execute the command below to setup the dependencies.

In [1]:
!pip install -q TTS==0.4.1

[?25l[K     |▎                               | 10 kB 18.3 MB/s eta 0:00:01[K     |▌                               | 20 kB 24.5 MB/s eta 0:00:01[K     |▊                               | 30 kB 29.2 MB/s eta 0:00:01[K     |█                               | 40 kB 30.5 MB/s eta 0:00:01[K     |█▏                              | 51 kB 31.4 MB/s eta 0:00:01[K     |█▍                              | 61 kB 34.0 MB/s eta 0:00:01[K     |█▊                              | 71 kB 26.7 MB/s eta 0:00:01[K     |██                              | 81 kB 28.2 MB/s eta 0:00:01[K     |██▏                             | 92 kB 28.6 MB/s eta 0:00:01[K     |██▍                             | 102 kB 30.2 MB/s eta 0:00:01[K     |██▋                             | 112 kB 30.2 MB/s eta 0:00:01[K     |██▉                             | 122 kB 30.2 MB/s eta 0:00:01[K     |███▏                            | 133 kB 30.2 MB/s eta 0:00:01[K     |███▍                            | 143 kB 30.2 MB/s eta 0:

# Using the Model (Running Inference)

Now we want to load the specific mandarin speaker model. You can browse the full set of [available models](https://github.com/coqui-ai/TTS/blob/main/TTS/.models.json) from Coqui.

Specifically we are running the following steps:

* `manager.download_model()` - Downloads the `tts_models/zh-CN/baker/tacotron2-DDC-GST` pre-trained model from Coqui. This model is a female `zh-cn` (mandarin) language speaker.
* `Synthesizer()` - Setup a `Sythesizer` from our model.

In [4]:
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer

manager = ModelManager()
model_path, config_path, model_item = manager.download_model("tts_models/zh-CN/baker/tacotron2-DDC-GST")
synthesizer = Synthesizer(
    model_path, config_path, None, None, None,
)

 > tts_models/zh-CN/baker/tacotron2-DDC-GST is already downloaded.
 > Using model: tacotron2
 > Model's reduction rate `r` is set to: 2


Now we define the `example_text` variable, a piece of mandarin text that we want to convert to a speech audio file. This particular example text asks "How are you? I'm doing fine.".

Next, we synthesize/generate the audio file with the `synthezier.tts()` function.

The notebook will then display the audio sample produced for us to playback.

In [11]:
from IPython.display import Audio, display

example_text = '你好吗？我很好。'

wavs = synthesizer.tts(example_text)

display(Audio(wavs, rate=synthesizer.output_sample_rate))

 > Text splitted to sentences.
['你好吗？', '我很好。']
 > Processing time: 1.8753728866577148
 > Real-time factor: 0.7543777756640873


We notice that there is actually very little noise in the generated sample. If we want to try to further enhance the quality of speech using a speech enhancement model we can move on to the next section. This is entirely optional.

# Apply Speech Enhancement/Noise Reduction

We use the simple and convenient LogMMSE algorithm (Log Minimum Mean Square Error) with the [logmmse library](https://github.com/wilsonchingg/logmmse).

In [12]:
!pip install -q logmmse

Run the LogMMSE algorithm on the generated audio `audio[0]` and  display the enhanced audio sample produced in an audio player.

In [24]:
import numpy as np
from logmmse import logmmse

enhanced = logmmse(np.array(wavs, dtype=np.float32), synthesizer.output_sample_rate, output_file=None, initial_noise=1, window_size=160, noise_threshold=0.15)
display(Audio(enhanced, rate=synthesizer.output_sample_rate))

Save the enhanced audio to file.

In [None]:
from scipy.io.wavfile import write

write('/content/audio.wav', sample_rate, enhanced)

We can connect to Google Drive with the following code. You can also click the `Files` icon on the left panel and click `Mount Drive` to mount your Google Drive.

The root of your Google Drive will be mounted to `/content/drive/My Drive/`. If you have problems mounting the drive, you can check out this [tutorial](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166).

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

You can move the output files which are saved in the `/content/` directory to the root of your Google Drive.

In [None]:
import shutil
shutil.move('/content/audio.wav', '/content/drive/My Drive/audio.wav')

More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml) and do star or drop us some feedback on how to improve the notebooks on the [Github repo](https://github.com/eugenesiow/practical-ml/).