# Singlish Text to Speech with Malaya Speech

---

[Github](https://github.com/eugenesiow/practical-ml/) | More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml)

---

Notebook to convert an input piece of text into an speech audio file automatically.

[Text-To-Speech synthesis](https://paperswithcode.com/task/text-to-speech-synthesis) is the task of converting written text in natural language to speech.

The [model used](https://malaya-speech.readthedocs.io/en/latest/tts-singlish.html) is one of the pre-trained models from `malaya_speech`.

The notebook is structured as follows:
* Setting up the Environment
* Using the Model (Running Inference)
* Apply Speech Enhancement/Noise Reduction

# Setting up the Environment

#### Dependencies and Runtime

If you're running this notebook in Google Colab, most of the dependencies are already installed and we don't need the GPU for this particular example. 

If you decide to run this on many (>thousands) images and want the inference to go faster though, you can select `Runtime` > `Change Runtime Type` from the menubar. Ensure that `GPU` is selected as the `Hardware accelerator`.

We need to install `malaya` and `malaya_speech` for this example to run, so execute the command below to setup the dependencies.

In [3]:
!pip install -q malaya malaya_speech

[K     |████████████████████████████████| 2.2 MB 5.4 MB/s 
[K     |████████████████████████████████| 279 kB 48.7 MB/s 
[K     |████████████████████████████████| 1.2 MB 60.7 MB/s 
[K     |████████████████████████████████| 1.6 MB 27.7 MB/s 
[K     |████████████████████████████████| 64 kB 2.6 MB/s 
[?25h  Building wheel for ftfy (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m


# Using the Model (Running Inference)

Now we want to load and run the specific singlish text-to-speech model.

Specifically we are running the following steps:

* Define `load_model` - We define the `load_model` function, which will download the fastspeech2 and melgan models of `model_name` using the `malaya_speech` library. 
* Define `predict` - We define the `predict` function for inferencing. It will take in the models that we loaded and an `input_text` string. In this function/method, we run the `.predict` of the fastspeech model and then pass it through the `melgan` model to get the audio output vector.
* Run `load_model`

In [6]:
import malaya_speech


def load_models(model_name):
  fs2 = malaya_speech.tts.fastspeech2(model = model_name)
  melgan = malaya_speech.vocoder.melgan(model = model_name)
  return fs2, melgan

def predict(input_text, singlish, melgan):
  r_singlish = singlish.predict(input_text)
  y_ = melgan(r_singlish['postnet-output'])
  data = malaya_speech.astype.float_to_int(y_)
  return data

fs2, melgan = load_models('female-singlish')

Now we define the `input_text` variable, a piece of text that we want to convert to a speech audio file. Next, we synthesize/generate the audio file.

The notebook will then display the audio sample produced for us to playback.

In [10]:
from IPython.display import Audio, display

sample_rate = 22050
input_text = 'The second Rental Support Scheme payout will be disbursed about a month early to ensure businesses get cash flow relief as soon as possible, say IRAS and MOF.'
audio = predict(input_text, fs2, melgan)
display(Audio(audio, rate=sample_rate))

Other available models can be seen by running the code below:

In [12]:
malaya_speech.tts.available_fastspeech2()

Unnamed: 0,Size (MB),Quantized Size (MB),Combined loss,understand punctuations
male,125.0,31.7,1.846,False
male-v2,65.5,16.7,1.886,False
female,125.0,31.7,1.744,False
female-v2,65.5,16.7,1.804,False
husein,125.0,31.7,0.6411,False
husein-v2,65.5,16.7,0.7712,False
haqkiem,125.0,31.7,0.5663,True
female-singlish,125.0,31.7,0.5112,True


We notice that there is some noise in the previously generated sample which can easily be reduced to enhance the quality of speech using a speech enhancement model. We try this in the next section. This is entirely optional.

# Apply Speech Enhancement/Noise Reduction

We use the simple and convenient LogMMSE algorithm (Log Minimum Mean Square Error) with the [logmmse library](https://github.com/wilsonchingg/logmmse).

In [8]:
!pip install -q logmmse

Run the LogMMSE algorithm on the generated audio `audio[0]` and  display the enhanced audio sample produced in an audio player.

In [11]:
import numpy as np
from logmmse import logmmse

enhanced = logmmse(np.array(audio), sample_rate, output_file=None, initial_noise=1, window_size=160, noise_threshold=0.15)
display(Audio(enhanced, rate=sample_rate))

Save the enhanced audio to file.

In [None]:
from scipy.io.wavfile import write

write('/content/audio.wav', sample_rate, enhanced)

We can connect to Google Drive with the following code. You can also click the `Files` icon on the left panel and click `Mount Drive` to mount your Google Drive.

The root of your Google Drive will be mounted to `/content/drive/My Drive/`. If you have problems mounting the drive, you can check out this [tutorial](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166).

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

You can move the output files which are saved in the `/content/` directory to the root of your Google Drive.

In [None]:
import shutil
shutil.move('/content/audio.wav', '/content/drive/My Drive/audio.wav')

More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml) and do star or drop us some feedback on how to improve the notebooks on the [Github repo](https://github.com/eugenesiow/practical-ml/).