# Soft Speech Units for Improved Voice Conversion

Demo for the paper: [A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion](https://ieeexplore.ieee.org/abstract/document/9746484).

- [Companion webpage](https://bshall.github.io/soft-vc/)
- [Home repo](https://github.com/bshall/soft-vc)
- [HuBERT content encoders](https://github.com/bshall/hubert)
- [Acoustic Models](https://github.com/bshall/acoustic-model)
- [HiFiGAN vocoder](https://github.com/bshall/hifigan)

In [1]:
import torch, torchaudio
import requests
import IPython.display as display

Download the HuBERT content encoder (either hubert_soft or hubert_discrete):

In [2]:
hubert = torch.hub.load("bshall/hubert:main", "hubert_soft", trust_repo=True).cuda()

Downloading: "https://github.com/bshall/hubert/zipball/main" to /root/.cache/torch/hub/main.zip
Downloading: "https://github.com/bshall/hubert/releases/download/v0.2/hubert-soft-35d9f29f.pt" to /root/.cache/torch/hub/checkpoints/hubert-soft-35d9f29f.pt


100%|██████████| 361M/361M [00:07<00:00, 48.2MB/s]


Download the acoustic model (either hubert_soft or hubert_discrete)

In [3]:
acoustic = torch.hub.load("bshall/acoustic-model:main", "hubert_soft", trust_repo=True).cuda()

Downloading: "https://github.com/bshall/acoustic-model/zipball/main" to /root/.cache/torch/hub/main.zip
Downloading: "https://github.com/bshall/acoustic-model/releases/download/v0.1/hubert-soft-0321fd7e.pt" to /root/.cache/torch/hub/checkpoints/hubert-soft-0321fd7e.pt


100%|██████████| 71.8M/71.8M [00:01<00:00, 42.7MB/s]


Download the vocoder (either hifigan_hubert_soft or hifigan_hubert_discrete)

In [4]:
hifigan = torch.hub.load("bshall/hifigan:main", "hifigan_hubert_soft", trust_repo=True).cuda()

Downloading: "https://github.com/bshall/hifigan/zipball/main" to /root/.cache/torch/hub/main.zip


  WeightNorm.apply(module, name, dim)


Downloading: "https://github.com/bshall/hifigan/releases/download/v0.1/hifigan-hubert-soft-65f03469.pt" to /root/.cache/torch/hub/checkpoints/hifigan-hubert-soft-65f03469.pt


100%|██████████| 54.9M/54.9M [00:00<00:00, 59.2MB/s]


Download an example utterance:

In [5]:
with open("example.wav", "wb") as file:
  response = requests.get("https://drive.google.com/uc?export=preview&id=1Y3KuPAhB5VcsmIaokBVKu3LUEZOfhSu8")
  file.write(response.content)

Or upload your own:

In [6]:
from google.colab import files

uploaded = files.upload()
for name, data in uploaded.items():
  with open("example.wav", 'wb') as f:
    f.write(data)
  print(f'Saved file: {name}')


Saving wav1.wav to wav1.wav
Saved file: wav1.wav


Load the source audio (and resample to 16kHz if necessary)

In [7]:
source, sr = torchaudio.load("example.wav")
# Select the first channel if the audio has multiple channels and ensure the correct shape [batch_size, sequence_length]
source = source[0, :].unsqueeze(0)
source = torchaudio.functional.resample(source, sr, 16000)
source = source.cuda()

  s = torchaudio.io.StreamReader(src, format, None, buffer_size)


Convert to the target speaker:

In [9]:
with torch.inference_mode():
    # Extract speech units
    units = hubert.units(source)
    # Generate target spectrogram
    mel = acoustic.generate(units).transpose(1, 2)
    # Generate audio waveform
    target = hifigan(mel)

RuntimeError: Expected number of channels in input to be divisible by num_groups, but got input of shape [512, 50799] and num_groups=512

Lets listen to the results!

The source:

In [None]:
display.Audio(source.squeeze().cpu(), rate=16000)

and the converted utterance:

In [None]:
display.Audio(target.squeeze().cpu(), rate=16000)