# Results of GAN based speech synthesis
The GitHub repository for the implementation of GAN TTS can be found at [gantts](https://github.com/r9y9/gantts).
- Clone the above repository to our local system.
- Set up the Python environment as required, and run the following two shell scripts:
    - `tts_demo.sh` for text-to-speech synthesis;
    - `vc_demo.sh1` for voice conversion.
    
The resultant audios are generated by the above demos. In this notebook, I mainly exhibit them for comparison.

## Important relevant pacakges and read transcripts for the audios

In [1]:
from scipy.io import wavfile
from os import path
import IPython
from IPython.display import Audio   # to display audios in notebook

In [2]:
# read the transcript for each utterance into a directory
transcripts = {}
with open('../results/cmuarctic.data.txt') as f:
    for line in f:
        line = line.strip('() \r\n') # remove the leading and training parentheses/whitespaces
        name, text = line.split(' ', maxsplit=1)
        transcripts[name] = text

# TTS results
**Dataset**
- CMU_ARCTIC database, a female speaker called *[slt](http://www.festvox.org/cmu_arctic/dbs_slt.html)*.
- 1132 utterances in total. The last five utterances are separated as a held-out test set.

**Ground truth**: the original human utterance is also included for comparison as the ground truth.  
**Baseline**: in this demo, a conventional MGE (minimum generation error) basd method is used.   
**GAN**: GAN based method.

In [3]:
test_utterance_names = ['arctic_b0535', 'arctic_b0536', 'arctic_b0537', 'arctic_b0538', 'arctic_b0539']

In [4]:
root = '../results/tts'
print('[From up to bottom: ground truth, baseline, GAN.]\n')
for name in test_utterance_names:
    print(transcripts[name])
    fs, gt_wav = wavfile.read(path.join(root, 'groundtruth', f'{name}.wav'))
    IPython.display.display(Audio(gt_wav, rate=fs))
    
    fs, baseline_wav = wavfile.read(path.join(root, 'baseline', f'{name}.wav'))
    IPython.display.display(Audio(baseline_wav, rate=fs))
    
    fs, gan_wav = wavfile.read(path.join(root, 'gan', f'{name}.wav'))
    IPython.display.display(Audio(gan_wav, rate=fs))
    
    print('')

[From up to bottom: ground truth, baseline, GAN.]

"He read his fragments aloud."



"Typhoid -- did I tell you."



"But she had become an automaton."



"At the best, they were necessary accessories."



"You were making them talk shop, Ruth charged him."





# Voice conversion results
**Dataset**
- CMU_ARCTIC database. Source: a female speaker called *[clb](http://www.festvox.org/cmu_arctic/dbs_clb.html)*. Target: another female speaker called *[slt](http://www.festvox.org/cmu_arctic/dbs_slt.html)*.
- 1132 utterances in total. Five utterances are separated as a held-out test set.

**Source and Target**: utterances from the source speaker and the target speaker as the ground truth.  
**Baseline**: in this demo, a conventional MGE (minimum generation error) basd method is used.    
**GAN**: GAN based method.

In [5]:
test_utterance_names = ['arctic_a0496', 'arctic_a0497', 'arctic_a0498', 'arctic_a0499', 'arctic_a0500']

In [6]:
root = '../results/vc'
print('[From up to bottom: source, target, baseline, GAN.]\n')
for name in test_utterance_names:
    print(transcripts[name])
    for folder in ['source', 'target', 'baseline', 'gan']:
        fs, wav = wavfile.read(path.join(root, folder, f'{name}.wav'))
        IPython.display.display(Audio(wav, rate=fs))
    print('')

[From up to bottom: source, target, baseline, GAN.]

"And Tom King patiently endured."



"King took every advantage he knew."



"The lines were now very taut."



"And right there I saw and knew it all."



"Who the devil gave it to you to be judge and jury."



