<a href="https://colab.research.google.com/github/bshall/urhythmic/blob/main/urhythmic_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Urhythmic: Rhythm Modeling for Voice Conversion

Demo for the paper: [Rhythm Modeling for Voice Conversion]().

*   [Companion webpage](https://ubisoft-laforge.github.io/speech/urhythmic/)
*   [Code repository](https://github.com/bshall/urhythmic)
*   [HuBERT content encoder](https://github.com/bshall/hubert)

In [None]:
import torch, torchaudio
import requests
import IPython.display as display

Download the `HubertSoft` content encoder (see https://github.com/bshall/hubert for details):

In [None]:
hubert = torch.hub.load("bshall/hubert:main", "hubert_soft", trust_repo=True).cuda()

 Select the source and target speakers. Pretrained models are available for:
1.   VCTK: p228, p268, p225, p232, p257, p231
2.   and LJSpeech.



In [None]:
source, target = "p228", "p232"

Download the `Urhythmic` voice conversion mode (either urhythmic_fine or urhythmic_global):

In [None]:
urhythmic, encode = torch.hub.load("bshall/urhythmic:main", "urhythmic_fine", source_speaker=source, target_speaker=target, trust_repo=True)
urhythmic = urhythmic.cuda()

Download an example utterance:

In [None]:
with open("p228_003.wav", "wb") as file:
  response = requests.get("https://github.com/bshall/urhythmic/raw/gh-pages/samples/urhythmic-fine/source/p228_003.wav")
  file.write(response.content)

Load the audio file:

In [None]:
wav, sr = torchaudio.load("p228_003.wav")
wav = wav.unsqueeze(0).cuda()

Extract the soft speech units and log probabilies:

In [None]:
units, log_probs = encode(hubert, wav)

Convert to the target speaker:

In [None]:
wav_ = urhythmic(units, log_probs)

Listen to the result!

In [None]:
display.Audio(wav.squeeze().cpu().numpy(), rate=16000)  # source

In [None]:
display.Audio(wav_.squeeze().cpu().numpy(), rate=16000)  # converted