<h1><center>Joint-Attention Learning in Prosody Transfer Speech Synthesis</center></h1>

<h2><center>Demo</center></h2>


High-quality text-to-speech (TTS) synthesis has remained a challenging research topic for years. Pushing the edge of the general naturalness of the synthesized utterance, several state-of-the-art models such as Tacotron and DeepVoice3 achieve excellent results in improving the quality of synthesized speech. To aim at more realistic speech synthesis, prosody-flexible TTS, also called expressive TTS has recently becomes a topic of significant research. For example, Google has proposed an ex- pressive TTS framework to successfully learn a reference utter- ance’s prosody and transfer it to a new utterance synthesized by the system. In this paper, we propose a prosody transfer text- to-speech synthesis model. Our work is implemented based on the end-to-end CNN block-based model of Baidu’s DeepVoice3 (DV3). Different from former models, in our work, we use a joint-attention learning process of the reference prosody and text. This comparatively simpler model can learn the reference input’s prosody along with the text input. A token table and weights are also learned with the reference input to factorize the possible styles in an unsupervised manner. The results show our model can successfully factorize the reference prosodies to represent characteristics of different speakers and styles, under unsupervised learning from the training data.


#### Deep-Voice3 Model
![title](./DV3_mine.png)

#### Style Encoder and Tokens
In this paper, our prosody transfer TTS system is built on an open source Baidu’s DV3 system. The model is shown as following. Based on the DV3, We added in a reference encoder into the framework. The reference encoder learns the extracted weights directly from the reference audio. And the weights matrix will then be directly multiplied with the randomly initialized token table to give the combined token as reference embedding. In our model there is no explicit attention module used to learn the similarity between the reference audio’s feature with the global tokens table. In contrast, predicted weights are ex- tracted directly from the reference utterance input and combined with global token table to give a reference embedding for further usage.
The learned reference embedding is inserted to the text encoder in the Encoder PreNet and Convolution Blocks, so that a joint learned (key, value) based on the text and reference style is feed into the attention block in the decoder.

#### Proposed model
![title](./reference_encoderDV3.png)


### 1. Token factorization

<h3><center>Assigned tokens gave different speaker voice as synthesis results</center></h3>

<center>Figure 1. Visualization of the spectrogram of 5 tokens' synthesis results trained on VCTK dataset. From top to bottom is 1 to 5. </center>

![title](./Tokens/VCTK/spectrogram.png)

#### Uterrance: "I’ve felt the chance that I have a number of options."

To listen, files are at following:

In [6]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './Tokens/'
    # Firstdataset
    path = dirn + 'VCTK/'

    for i in range(5):
        print('Token'+ str(i+1)+ ': ')
        Bsrc_path = path + 'token' + str(i+1) + '_16bitPCM.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        
    

Token1: 


Token2: 


Token3: 


Token4: 


Token5: 




<h3><center>Assigned tokens gave different synthesis results</center></h3>

<center>Figure 2. Visualization of the spectrogram of 5 tokens' synthesis results trained on an internal dataset. From top to bottom is 1 to 5. </center>

![title](./Tokens/otherDataset/spectrogram.png)



#### Uterrance: "Just recovered a fumble on ensuing kickoff."

To listen, files are at following:

In [5]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './Tokens/'
    # Firstdataset
    path = dirn + 'otherDataset/'

    for i in range(5):
        print('Token'+ str(i+1)+ ': ')
        Bsrc_path = path + 'Token' + str(i+1) + '_16bitPCM.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        
    


Token1: 


Token2: 


Token3: 


Token4: 


Token5: 


<h3><center> Token factorization on Blizzard 2013 dataset </center></h3>

<center>Figure 3. Visualization of the spectrogram of 5 tokens' synthesis results trained on Blizzard2013 dataset. From top to bottom is 1 to 5. </center>

![title](./Tokens/Blizzards/spectrogram.png)

#### Uterrance: "Just recovered a fumble on ensuing kickoff."

To listen, files are at following:

In [7]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './Tokens/'
    # Firstdataset
    path = dirn + 'Blizzards/'

    for i in range(5):
        print('Token'+ str(i+1)+ ': ')
        Bsrc_path = path + 'token' + str(i+1) + '_16bitPCM.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        

Token1: 


Token2: 


Token3: 


Token4: 


Token5: 


### 2. Prosody Transfer

<center><h3> Parallel utterances </h3></center>

The following shows three example of prosody transfer synthesis. 

In each example, text of the utterance to synthesis is the same as the reference's. The first utterance shown in each example is the reference. The second one is the synthesis results using neutral prosody. The third one is the prosody transfer result. 

#### Example 1

Utterance text content: My mother always took him to the town on a market day in a light gig.

In [12]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'parallel/example1/'

    name = ['ref','neutral_16bitPCM','prosodyT_16bitPCM']
    content = ['Refence utterance:','Neutral prosody result:','Prosody Transfer result:']
    for i in range(3):
        print(content[i])
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
    
    

Refence utterance:


Neutral prosody result:


Prosody Transfer result:


#### Example 2

Utterance text content: So we never saw Dick any more.

In [11]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'parallel/example2/'

    name = ['ref','neutral_16bitPCM','prosodyT_16bitPCM']
    content = ['Refence utterance:','Neutral prosody result:','Prosody Transfer result:']
    for i in range(3):
        print(content[i])
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))

Refence utterance:


Neutral prosody result:


Prosody Transfer result:


#### Example 3

Utterance text content: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?

In [10]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'parallel/example3/'

    name = ['ref','neutral_16bitPCM','prosodyT_16bitPCM']
    content = ['Refence utterance:','Neutral prosody result:','Prosody Transfer result:']
    for i in range(3):
        print(content[i])
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))

Refence utterance:


Neutral prosody result:


Prosody Transfer result:


<h3><center> Unparallel utterance <\center><\h3>

The following shows three example of unparallel prosody transfer synthesis. 

In each example, text of the utterance to synthesis is different from the reference's. The first utterance shown in each example is the reference. The second and third ones are two prosody transfer synthesis results with different text contents. 

#### Example 1

Reference text: My mother always took him to the town on a market day in a light gig.

Prosody Transfer result 1's text: So we never saw Dick any more.

Prosody Transfer result 2's text: Just recovered a fumble on ensuing kickoff.

The prosody of the unparallel reference utterance will be transfered to the synthesis results having different text contents. 

In [27]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'unparallel/example1/'

    name = ['ref','transfer_16bitPCM','transfer2_16bitPCM']
    content = ['Reference utterance: ', 'Prosody Transfer text 1: ','Prosody Transfer text 2: ']
    textc = ['My mother always took him to the town on a market day in a light gig.', 'So we never saw Dick any more.', 'Just recovered a fumble on ensuing kickoff.']
    for i in range(3):
        print(content[i])
        
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        print('Text: '+ textc[i])
        print('\n')

Reference utterance: 


Text: My mother always took him to the town on a market day in a light gig.


Prosody Transfer text 1: 


Text: So we never saw Dick any more.


Prosody Transfer text 2: 


Text: Just recovered a fumble on ensuing kickoff.




#### Example 2

Reference text: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?

Prosody Transfer result 1's text: My mother always took him to the town on a market day in a light gig.

Prosody Transfer result 2's text: There was nothing disagreeable in Mister Rushworth's appearance.

The prosody of the unparallel reference utterance will be transfered to the synthesis results having different text contents. 

In [26]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'unparallel/example2/'

    name = ['ref','transfer_16bitPCM','transfer2_16bitPCM']
    content = ['Reference utterance:', 'Prosody Transfer text 1:','Prosody Transfer text 2:']
    textc = ['You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?','My mother always took him to the town on a market day in a light gig.',"There was nothing disagreeable in Mister Rushworth\'s appearance."]
    for i in range(3):
        print(content[i])
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        print('Text: '+ textc[i])
        print('\n')

Reference utterance:


Text: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?


Prosody Transfer text 1:


Text: My mother always took him to the town on a market day in a light gig.


Prosody Transfer text 2:


Text: There was nothing disagreeable in Mister Rushworth's appearance.




#### Example 3

Reference text: There was nothing disagreeable in Mister Rushworth's appearance, and Sir Thomas was liking him already.

Prosody Transfer result 1's text: Just recovered a fumble on ensuing kickoff.

Prosody Transfer result 2's text: My mother always took him to the town on a market day in a light gig.

The prosody of the unparallel reference utterance will be transfered to the synthesis results having different text contents. 


In [28]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'unparallel/example3/'

    name = ['ref','transfer_16bitPCM','transfer2_16bitPCM']
    content = ['Reference utterance:', 'Prosody Transfer text 1:','Prosody Transfer text 2:']
    textc = ['There was nothing disagreeable in Mister Rushworth\'s appearance, and Sir Thomas was liking him already.','Just recovered a fumble on ensuing kickoff.','My mother always took him to the town on a market day in a light gig.']
    for i in range(3):
        print(content[i])
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        print('Text: '+textc[i])
        print('\n')

Reference utterance:


Text: There was nothing disagreeable in Mister Rushworth's appearance, and Sir Thomas was liking him already.


Prosody Transfer text 1:


Text: Just recovered a fumble on ensuing kickoff.


Prosody Transfer text 2:


Text: My mother always took him to the town on a market day in a light gig.


