# CS224S HW4: Voice Cloning with neural models

This part of the homework is worth 40/100 total points for the overall assignment. The goal for this part of the homework is simple -- clone your own voice using some reasonably state of the art neural modeling approaches! You can see the voice cloning framework we are using described [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning). Roughly, there is a neural network which generates a speaker embedding from audio, and a neural TTS system which conditions on this speaker embedding to create an adjusted audio sample for a given text input. 

Goals for this work:
* Record a sample of your own voice and use it as input to a voice cloning system
* Try variations of inputs for speaker embeddings and input text to understand how well this voice cloning solution works in practice
* Visualize spectograms from real vs synthesized audio examples to compare them

**Note:** You will need to make a copy of this Colab notebook in your Google Drive before you can edit it.

# Dependencies

In [None]:
!git clone https://github.com/CorentinJ/Real-Time-Voice-Cloning.git
!apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg
!pip install -r Real-Time-Voice-Cloning/requirements.txt -q
!pip install --upgrade matplotlib

In [None]:
# used to play audio files as in HW1
import IPython.display as ipd
from base64 import b64decode
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record(sec=5):
  try:
    from google.colab import output
  except ImportError:
    print('No possible to import output from google.colab')
    return ''
  else:
    print('Recording')
    display(ipd.Javascript(RECORD))
    s = output.eval_js('record(%d)' % (sec*1000))
    fname = 'recorded_audio.wav'
    print('Saving to', fname)
    b = b64decode(s.split(',')[1])
    with open(fname, 'wb') as f:
      f.write(b)
    return fname


# helper function to plot a mel spectrogram
# arguments: (wave array, sampling rate, number of mel bins, max frequency of mel scale)
def plot_melspectrogram(wav, sr, annotations=None, n_mels=256, fmax=4096, 
                        fig=None, ax=None, show_legend=True):
    
    if ax == None:
        fig, ax = plt.subplots(1,1,figsize=(20,5))
    M = librosa.feature.melspectrogram(y=wav, sr=sr, n_mels=n_mels, fmax=fmax, n_fft=2048)
    M_db = librosa.power_to_db(M, ref=np.max)
    img = librosa.display.specshow(M_db, y_axis='mel', x_axis='time', ax=ax, fmax=fmax)
    if show_legend:
        ax.set(title='Mel spectrogram display')
        fig.colorbar(img, ax=ax, format="%+2.f dB")
        
    # iterate over list of text annotations and draw them
    if annotations is not None:
        for x,y,text in annotations:
            ax.annotate(
            text,
            xy=(x,y), xycoords='data',
            xytext=(10, -50), textcoords='offset pixels',
            horizontalalignment='right',
            color='white',
            fontsize=20,
            verticalalignment='bottom',
            arrowprops=dict(
                arrowstyle= '-|>',
                 color='white',
                 lw=1,
                 ls='-')
            ) 

# Record and clone **your** voice

In this colab, we'll be poking around a [toolkit](https://github.com/CorentinJ/Real-Time-Voice-Cloning) for cloning your own voice. It's important that we only use this tool for good--do **not** record others without their permission.

Note: if recording your voice doesn't work, feel free to comment on one of the examples in the `Real-Time-Voice-Cloning/samples` directory.


## **Task: Record your voice and plot a spectrogram**. (5 points)

In [None]:
record(5)
input_path = 'recorded_audio.wav'
ipd.Audio(input_path)

In [None]:
input_wav, input_sr = librosa.load(input_path)
plot_melspectrogram(input_wav, input_sr)

Now use the tool to synthesize the exact same words you said.

In [None]:
# If the downloads fail, try again after a minute or run this commented code.
# You can also download the models locally and manually drag them into the 
# colab file structure
# !gdown --folder 1fU6umc5uQAVR2udZdHX-lDgXYzTyqG_j
# !mv /content/RTVC\ models /content/default
# !mv default Real-Time-Voice-Cloning/saved_models/default

# Use the stop button or Runtime->Interrupt Execution when finished
# Please click into the output cell to enter the path when prompted
!python Real-Time-Voice-Cloning/demo_cli.py

## **Task: Run voice conversion and plot some output** (5 points)

Run the voice conversion steps above on your audio sample. This should render an audio file with a TTS system producing the requested output using the requested speaker embedding.

Play the audio file and plot a spectrogram.

In [None]:
output_path = 'demo_output_00.wav' # change this as necessary
ipd.Audio(output_path)

In [None]:
    #############################
    #### YOUR CODE GOES HERE ####

    #############################

## **Task: Compare spectrograms of original and synthesized audio** (10 points)

Plot spectrograms for your original utterance and synthesized utterance next to one another. Describe the differences you notice when listening in the audio, and how you think such differences register in the spectrogram plot.

In [None]:
    #############################
    #### YOUR CODE GOES HERE ####

    #############################

# Try variations of input examples

Now that we've gotten the hang of it, please use the tool to comment on what you notice from the audio and mel spectrogram for some of these situations (these are just ideas, you may choose what you actually try):

1. Speak very monotone and synthesize the exact same words you said. 
1. Speak very expressively and synthesize the exact same words you said.
1. Speak a fixed input and synthesize a phrase with common words, then using the same input synthesize a phrase with rare words. 

## **Task: Try at least 2 input variations** (10 points)

Try two different input audio files and/or input audio / transcription request pairing. See how the voice cloning system responds when you are more expressive, accented, or when synthesizing non-standard words (e.g. technical jargon with TTS pronunciation errors).

Run voice conversion on two new examples.

In [None]:
    #############################
    #### YOUR CODE GOES HERE ####

    #############################

## **Task: Comment on what you notice from the mel spectrograms and audio.** (10 points)

For the new examples you just generated, comment on your findings about how the voice conversion system responds when you alter inputs used to build the samples (input text or audio samples). 

Show at least two spectrograms and comment on your findings about how the system works for the types of inputs you tried.

In [None]:
    #############################
    #### YOUR CODE GOES HERE ####

    #############################