<a href="https://colab.research.google.com/github/bshall/urhythmic/blob/main/urhythmic_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Urhythmic: Rhythm Modeling for Voice Conversion

Demo for the paper: [Rhythm Modeling for Voice Conversion]().

*   [Companion webpage](https://ubisoft-laforge.github.io/speech/urhythmic/)
*   [Code repository](https://github.com/bshall/urhythmic)
*   [HuBERT content encoder](https://github.com/bshall/hubert)

In [1]:
import torch, torchaudio
import requests
import IPython.display as display

Download the `HubertSoft` content encoder (see https://github.com/bshall/hubert for details):

In [2]:
hubert = torch.hub.load("bshall/hubert:main", "hubert_soft", trust_repo="check").cuda()

The repository bshall_hubert does not belong to the list of trusted repositories and as such cannot be downloaded. Do you trust this repository and wish to add it to the trusted list of repositories (y/N)?y


Downloading: "https://github.com/bshall/hubert/zipball/main" to /root/.cache/torch/hub/main.zip
Downloading: "https://github.com/bshall/hubert/releases/download/v0.2/hubert-soft-35d9f29f.pt" to /root/.cache/torch/hub/checkpoints/hubert-soft-35d9f29f.pt
100%|██████████| 361M/361M [00:22<00:00, 16.5MB/s]


 Select the source and target speakers. Pretrained models are available for:
1.   VCTK: p228, p268, p225, p232, p257, p231
2.   and LJSpeech.



In [20]:
source, target = "p228", "p232"

Download the `Urhythmic` voice conversion mode (either urhythmic_fine or urhythmic_global):

In [21]:
urhythmic, encode = torch.hub.load("bshall/urhythmic:main", "urhythmic_fine", source_speaker=source, target_speaker=target, trust_repo="check")
urhythmic = urhythmic.cuda()

Using cache found in /root/.cache/torch/hub/bshall_urhythmic_main
Downloading: "https://github.com/bshall/urhythmic/releases/download/v0.1/hifigan-p232-e0efc4c3.pt" to /root/.cache/torch/hub/checkpoints/hifigan-p232-e0efc4c3.pt
100%|██████████| 56.6M/56.6M [00:03<00:00, 16.5MB/s]


Download an example utterance:

In [22]:
with open("p228_003.wav", "wb") as file:
  response = requests.get("https://github.com/bshall/urhythmic/raw/gh-pages/samples/urhythmic-fine/source/p228_003.wav")
  file.write(response.content)

Load the audio file:

In [23]:
wav, sr = torchaudio.load("p228_003.wav")
wav = wav.unsqueeze(0).cuda()

Extract the soft speech units and log probabilies:

In [24]:
units, log_probs = encode(hubert, wav)

Convert to the target speaker:

In [25]:
wav_ = urhythmic(units, log_probs)

Listen to the result!

In [26]:
display.Audio(wav.squeeze().cpu().numpy(), rate=16000)  # source

In [27]:
display.Audio(wav_.squeeze().cpu().numpy(), rate=16000)  # converted