# Discrete Speech Resynthesis and Speech Continuation walk-through

Below we will see how to use textless-lib to resynthesis speech and generate speech continuations.

### Prerequisites

We'll need fairseq, textless and a bit of other dependencies.

At the moment there is a caveat that Colab doesn't support numpy versions above 1.21, which are required for some textless-lib functionality.

Here we'll use a small workaround by cloning and using a disk copy of textless. If you're running it locally, you can just uncomment the next line

In [None]:
# ! pip install git+https://github.com/pytorch/fairseq.git@dd106d9534b22e7db859a6b87ffd7780c38341f8
# ! git clone https://github.com/facebookresearch/textlesslib.git && cd textless && pip install -e .

In [None]:
# this will ask to restart the runtime as it updated
# an already loaded package -- please restart
! pip install numpy==1.21.5

Collecting numpy==1.21.5
  Downloading numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 5.4 MB/s 
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.21.5 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully installed numpy-1.21.5


In [None]:
! pip install torch>=1.1.0 torchaudio AMFM_decompy librosa threadpoolctl==3.0.0 numba==0.53.0 joblib scikit-learn unidecode inflect

In [None]:
! pip install git+https://github.com/pytorch/fairseq.git@dd106d9534b22e7db859a6b87ffd7780c38341f8
! git clone https://github.com/facebookresearch/textlesslib.git

Collecting git+https://github.com/pytorch/fairseq.git@dd106d9534b22e7db859a6b87ffd7780c38341f8
  Cloning https://github.com/pytorch/fairseq.git (to revision dd106d9534b22e7db859a6b87ffd7780c38341f8) to /tmp/pip-req-build-6v0jsg76
  Running command git clone -q https://github.com/pytorch/fairseq.git /tmp/pip-req-build-6v0jsg76
  Running command git rev-parse -q --verify 'sha^dd106d9534b22e7db859a6b87ffd7780c38341f8'
  Running command git fetch -q https://github.com/pytorch/fairseq.git dd106d9534b22e7db859a6b87ffd7780c38341f8
  Running command git checkout -q dd106d9534b22e7db859a6b87ffd7780c38341f8
  Running command git submodule update --init --recursive -q
  From https://github.com/ngoyal2707/Megatron-LM
   * branch            adb23324c222aad0aad89308e70302d996a5eaeb -> FETCH_HEAD
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l

In [None]:
cd textlesslib

/content/textlesslib


In [None]:
import textless

In [None]:
import IPython.display as ipd

import torch
import torchaudio
import pathlib

import textless
from textless.data.speech_encoder import SpeechEncoder
from textless.data.quantized_datasets import QuantizedLibriSpeech
from textless.vocoders.tacotron2.vocoder import TacotronVocoder

## Resynthesis

Firstly, let us configure what dense model and quantizer we will use:

In [None]:
dense_model_name = "hubert-base-ls960"
quantizer_name = "kmeans"
vocab_size = 200 # one of [50, 100, 200]

We can initialise a SpeechEncoder using its name; this way a corresponding checkpoint will be downloaded automatically

In [None]:
encoder = SpeechEncoder.by_name(
    dense_model_name=dense_model_name,
    quantizer_model_name=quantizer_name,
    vocab_size=vocab_size,
    need_f0=False,
    deduplicate=True,
    f0_normalizer=None,
    f0_quantizer=None,
).cuda()

We will use a LibriSpeech dataset for our example. We can start with a vanilla version of it, load a single example and listen to it:

In [None]:
! mkdir -p datasets

In [None]:
raw_dataset = torchaudio.datasets.LIBRISPEECH(
    root="./datasets",
    url="dev-clean",
    download=True,
)

In [None]:
audio, input_sample_rate, *_ = raw_dataset[7]
audio

tensor([[-0.0003, -0.0006, -0.0006,  ..., -0.0003, -0.0003, -0.0003]])

In [None]:
ipd.Audio(audio, rate=input_sample_rate)

We can encode this audio example using our SpeechEncoder. The encoded audio is represented as a dictionary with key-value pairs:

In [None]:
encoded_audio = encoder(audio)
encoded_audio.keys()

dict_keys(['units', 'durations', 'dense'])

'units' contains the pseudo-unit stream, while 'durations' encodes per-token durations and 'dense' returns the original HuBERT representation of the audio.

Let's have a look how units look:

In [None]:
encoded_audio['units'][:10]

tensor([ 14, 131, 191,  11,  22,  86,  22, 125,  10, 154], device='cuda:0',
       dtype=torch.int32)

In [None]:
encoded_audio['durations'][:10]

tensor([4, 8, 8, 2, 1, 1, 1, 1, 1, 2], device='cuda:0')

Alternatively, textless-lib provides a simple wrapper around it which will return a "textless" representation of datapoints:

In [None]:
dataset = QuantizedLibriSpeech(
    encoder,
    root="./datasets",
    url="dev-clean",
    download=False,
)

Here datapoints are encoded just in the same way:

In [None]:
datum = dataset[7]
datum['units'][:10]

tensor([ 14, 131, 191,  11,  22,  86,  22, 125,  10, 154], dtype=torch.int32)

Now let us initialise a corresponding Tacotron instance with a matching configuration:

In [None]:
vocoder = TacotronVocoder.by_name(
    dense_model_name,
    quantizer_name,
    vocab_size,
).cuda()

Everythins is ready for resynthesising the unit stream back into the audio:

In [None]:
resynth_audio = vocoder(datum['units'])

In [None]:
ipd.Audio(resynth_audio.cpu().numpy(), rate=vocoder.output_sample_rate)

## Speech Continuation

One additional component we need is a unit-level language model. Here we re-use on from the GSLM paper

In [None]:
import sys
sys.path.append(str(pathlib.Path(textless.__path__[0]).parent / 'examples' / 'gslm/'))
from sampler import UnitLanguageModelSampler

...and download a pre-trained checkpoint (more checkpoints [here](https://github.com/pytorch/fairseq/tree/main/examples/textless_nlp/gslm/ulm)).

In [None]:
! mkdir -p LM && \
    wget https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/lm_km200/hubert200_lm.tgz -O LM/hubert200_lm.tgz && \
    cd LM/ && \
    tar -xvf hubert200_lm.tgz 

--2022-02-15 15:54:21--  https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/lm_km200/hubert200_lm.tgz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1450463515 (1.4G) [application/gzip]
Saving to: ‘LM/hubert200_lm.tgz’


2022-02-15 15:54:50 (48.0 MB/s) - ‘LM/hubert200_lm.tgz’ saved [1450463515/1450463515]

hubert200_lm/
hubert200_lm/dict.txt
hubert200_lm/checkpoint_best.pt


We take the first 5 seconds of the same audio as a prompt:

In [None]:
prompt = audio[:, :input_sample_rate * 5]
ipd.Audio(prompt, rate=input_sample_rate)

...and encode it into the unit stream and double-check how the resynthesised version sounds like:

In [None]:
encoded = encoder(prompt)
units = encoded['units']
units

tensor([ 14, 131, 191,  11,  22, 125,  22, 125,  10, 154,  46,  49,  50,  12,
         93,  66,  31, 127, 160,  17, 112,  23,  96,  12, 172,  85,  89,  31,
         46, 190,  33,   9,  87, 157,  41, 136,   1, 111,  19, 141, 120, 152,
        133,  57, 113,  28,   1, 151, 192,  87,  19, 152,  36, 162, 166, 191,
          8,  11, 149, 125,   8, 125,  22, 125,  89, 174,  37,  79, 143, 104,
        136, 115, 172,  13, 156,  44, 187,  79, 104, 109,  38, 119,  51, 182,
         93,  66, 196, 128, 199,  33, 169, 136, 172,  71,  31, 144,  61, 198,
         12,  85,  89,  31, 100, 115, 177, 106, 193,  72, 170,  78, 111,  19,
         15,  41, 115,  54, 177, 106, 193, 148,  35,  69, 127, 170,   1,  95,
         30,  39, 152,  36, 149, 197,  20, 125,  20, 137,  92],
       device='cuda:0', dtype=torch.int32)

In [None]:
resynth_prompt = vocoder(units).cpu().numpy()
ipd.Audio(resynth_prompt, rate=vocoder.output_sample_rate)

Now we load our downloaded checkpoint and define some parameters for the LM sampling:

In [None]:
sampler = UnitLanguageModelSampler.from_pretrained(model_name_or_path="LM/hubert200_lm")

In [None]:
sampling_kwargs = {
    "temperature": 0.7,
    "sampling": True,
    "beam": 1,
    "prefix_size": -1,
    "max_len_a": 0.0,
    "max_len_b": 400,
}

In [None]:
# It is a fairseq-based language model, so it accepts text strings as an input.
unit_str = " ".join(list(map(str, units.tolist())))
sampled_unit_str = sampler.sample([unit_str], **sampling_kwargs)[0]
continuation = torch.tensor([int(x) for x in sampled_unit_str.split()]).cuda()
continuation

  unfin_idx = idx // beam_size


Now, given this unit-level continuation we can vocode it into the audio:

In [None]:
resynth_continuation = vocoder(continuation).cpu().numpy()[:10 * vocoder.output_sample_rate]
ipd.Audio(resynth_continuation, rate=vocoder.output_sample_rate)