<a href="https://colab.research.google.com/github/baiju012/TTS/blob/main/TTS_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1.1 Set up the Environment

First, make sure you're using a GPU. In Colab, go to Runtime > Change runtime

type and select GPU as the hardware accelerator.

Then, install the necessary dependencies and clone the Tacotron 2 and WaveGlow repositories.

In [None]:
# Install required libraries
!pip install numpy scipy torch unidecode librosa matplotlib gdown


Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.8


1.2 Download Pre-trained Models

You can either train the models from scratch or use pre-trained models to save time.

To download pre-trained Tacotron 2 and WaveGlow models:

In [None]:
# Clone the Tacotron 2 repository
!git clone https://github.com/NVIDIA/tacotron2.git

# Clone the WaveGlow repository
!git clone https://github.com/NVIDIA/waveglow.git


Cloning into 'tacotron2'...
remote: Enumerating objects: 412, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 412 (delta 2), reused 2 (delta 1), pack-reused 406 (from 1)[K
Receiving objects: 100% (412/412), 2.70 MiB | 4.09 MiB/s, done.
Resolving deltas: 100% (202/202), done.
Cloning into 'waveglow'...
remote: Enumerating objects: 196, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 196 (delta 2), reused 2 (delta 0), pack-reused 190 (from 1)[K
Receiving objects: 100% (196/196), 437.57 KiB | 2.21 MiB/s, done.
Resolving deltas: 100% (108/108), done.


1.3 Install Text Processing Dependencies

Tacotron 2 requires some additional preprocessing libraries:

In [None]:
!pip install unidecode



1.4 Import Libraries

Now, import all the necessary libraries to run Tacotron 2 and WaveGlow.

In [None]:
import os
os.chdir("/content/tacotron2")

2. Load the Pre-trained Models

You’ll need to load the downloaded Tacotron 2 and WaveGlow models into the

Colab environment.

In [None]:
# Load Tacotron 2 model
tacotron2 = torch.load('tacotron2_statedict.pt')['state_dict']

# Load WaveGlow model
waveglow = torch.load('waveglow_256channels.pt')['state_dict']

# Set both models to evaluation mode
tacotron2.eval()
waveglow.eval()


3. Text to Mel-Spectrogram (Tacotron 2)

This function converts input text into a mel-spectrogram using Tacotron 2.

In [None]:
def text_to_mel(text):
    sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :]
    sequence = torch.autograd.Variable(torch.from_numpy(sequence)).cuda().long()

    # Generate mel spectrogram
    with torch.no_grad():
        mel_outputs, mel_outputs_postnet, _, alignments = tacotron2.inference(sequence)

    return mel_outputs_postnet


4. Mel-Spectrogram to Audio (WaveGlow)

This function converts the mel-spectrogram into audio using WaveGlow.

In [None]:
def mel_to_audio(mel_spectrogram):
    with torch.no_grad():
        audio = waveglow.infer(mel_spectrogram)

    # Denoise the audio using WaveGlow's denoiser
    denoiser = Denoiser(waveglow)
    audio = denoiser(audio, strength=0.01)[:, 0]

    return audio


5. Speech Synthesis Function

Now, we combine everything into one function that converts text into speech.

In [None]:
def synthesize_speech(text):
    # Convert text to mel-spectrogram
    mel_spectrogram = text_to_mel(text)

    # Convert mel-spectrogram to audio waveform
    audio = mel_to_audio(mel_spectrogram)

    # Save the generated audio to a .wav file
    audio_np = audio.cpu().numpy()
    write("output.wav", 22050, audio_np)

    # Display the waveform
    plt.figure(figsize=(10, 4))
    librosa.display.waveshow(audio_np, sr=22050)
    plt.title('Generated Speech Waveform')
    plt.show()

    return "output.wav"


6. Run the Model (Synthesize Speech)

Now, you can input any text, and the model will generate the corresponding speech.

In [None]:
# Input text to synthesize
text = "Hello! How are you doing today?"

# Generate and play the audio
output_wav = synthesize_speech(text)

# Play the audio
import IPython.display as ipd
ipd.Audio(output_wav)
