Extracting semantic tokens #3

jpc · 2023-02-20T14:39:55Z

I've experimented with extracting semantic tokens from Whisper and there are two challenges:

Whisper does not use fixed-windows during the speech-to-text decoding. Instead they are using a sliding window approach when they start the window at the end of the last complete utterance. This probably avoids edge-issues and improves quality a lot but complicates the encoding because you need to run the full decoding process.
Whisper does not have any bottlenecks in the model which means there is almost no incentive for the network to drop information. As shown by @sidhantls in his blog post on speaker identification in Lex Friedman podcasts the last encoder layer retains enough information to perform binary speaker identification (Lex vs. all other people) to 80% accuracy.

jpc · 2023-02-20T14:47:05Z

I tried using k-means to quantize the Whisper embeddings but the results are terrible. Even with 2048 clusters (4x what Google used) the quantization error is barely smaller than activation magnitudes which suggests we destroy basically all information in the embeddings.

I was able to train a model on these quantized embeddings but it does not generalize at all (generates complete noise on the validation set). I think the over-quantized embeddings are working only as strange positional embedding and the model basically memorizes the acoustic tokens at each position in the training set samples.

My next idea is to distill a frozen Whisper model with an added VQ bottleneck between encoder and decoder. The loss function should help us drop the redundant data.

jpc · 2023-02-24T16:16:42Z

Ok, distillation and Residual Quantization seems to be the way to go but still needs more data and longer training. notebook (towards the end, under "Residual quantization test"). OTOH with RQ we are increasing the token count quite substantially.

I wonder whether it would work to use the embeddings as-is, without any quantization. SPEAR-TTS needed quantization to force the w2v-BERT model to ignore unnecessary information when trained in an unsupervised manner. Whisper was trained supervised against text so maybe there is not need for and further bottlenecks?

jpc · 2023-02-28T08:20:24Z

I've added the training code for the semantic token quantization (RQ) model. It turns out learning rate warmup does wonders to training transformers. ;)

It's still far from optimally tuned but the quantized tokens work pretty well so I'll get back to optimizing it once the rest of the pipeline is working.

Ground truth (Whisper tiny decoder):

Richard Corrie by Edwin Arlington Robinson, read for librovox.org by Fox and the stars of shininghalf.com. Whenever Richard Corrie went downtown, we people on the pavement looked at him. He He was a gentleman from Seoul to crown, clean favored, and impurially slim.<|endoftext|>

Decoding the RQ quantized embeddings:

<|nospeech|> Richard Coryrie by Edwin Arlington Robinson, read for Libroox.org by Fox in the Stars of shining half.com. Whenever Richard Corrie went downtown, we people on the pavement looked at him. He He was a gentleman from sold to Crown, clean favored and and impurially slim.<|endoftext|>

crypticsymmetry · 2023-03-13T07:00:57Z

Not sure if this is applicable here, or if you already have checked this repo out before but might be helpful 🤷‍♂️ Whisperx

jpc · 2023-03-14T09:20:22Z

Hey @crypticsymmetry, thanks for the comment. I've seen WhisperX when looking for higher-precision timestamps but I am not sure how it could help me... Would you mind to elaborate a bit?

jpc · 2023-03-14T09:43:12Z

Btw. I've found a pretty serious issue with this model – I am not recreating the positional embeddings after the quantization block which means my tokens had to carry positional information. This is actually counterproductive for generalization and also requires a large codebook and multiple quantizers (RQ).

In some early tests on smaller data I managed to reduce the bottleneck to a single quantization with 1024 codes and got similar Whisper decoder performance. I will update VQ/RQ notebooks in a few days.

jpc · 2023-03-15T15:40:29Z

I've pushed the RQ notebook and the new model. Now training a S->A model.

jpc · 2023-03-29T08:38:45Z

I trained another quantization model on more data. This turns out to be quite challenging since flac decoding and running the Whisper encoder are quite time consuming and at the same time the encoder outputs (which are the input to the quantization model) take a lot of space (5x more than the flac files).

Fortunately the current model should be good enough for quite a while. We'll revisit this once we move to a bigger Whisper model for final training.

seastar105 · 2023-04-25T11:58:43Z

@jpc it seems project uses quantized whisper embedding(output of encoder) token with RQ. could you explain why choose it instead of plain encoder's output?

as you mentioned in comment, I wonder whether it would work to use the embeddings as-is, without any quantization. , i'm wondering too, is whisper's encoder good to capture content information especially in multilingual setting.

jpc · 2023-05-23T12:04:14Z

Hey, sorry for the long delay. I chose to quantize the embeddings to make sure they only keep the minimum required information needed to predict the text. This regularization should make them more predictable and easier to work with when we do T2S (text to semantic embeddings) modeling. The unquantized embeddings are also a lot bigger (384 or 768 FP16 values vs. a single int16 per semantic token) and don't compress well. (the quantized embeddings compress 10x with xz)

At some point I did some tests using the original embeddings straight from Whisper when training the S2A model and they seemed to work as well in this case but I have not done a proper comparison.

jpc added the goal Main sub-tasks of the project label Feb 20, 2023

jpc mentioned this issue Feb 28, 2023

Semantic -> acoustic modeling #4

Closed

9 tasks

jpc closed this as completed Mar 29, 2023

HuangZhiChao95 mentioned this issue Sep 8, 2023

Positional embeddings when training Whisper quantization mct10/RepCodec#3

Closed

BBC-Esq mentioned this issue Feb 13, 2024

Error: Adding speaker on Win 11 #72

Open

r666ay mentioned this issue Aug 22, 2024

It is not the first attempt to involve supervised speech tokens into TTS models. FunAudioLLM/CosyVoice#313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting semantic tokens #3

Extracting semantic tokens #3

jpc commented Feb 20, 2023

jpc commented Feb 20, 2023

jpc commented Feb 24, 2023

jpc commented Feb 28, 2023

crypticsymmetry commented Mar 13, 2023

jpc commented Mar 14, 2023

jpc commented Mar 14, 2023

jpc commented Mar 15, 2023

jpc commented Mar 29, 2023

seastar105 commented Apr 25, 2023

jpc commented May 23, 2023

Extracting semantic tokens #3

Extracting semantic tokens #3

Comments

jpc commented Feb 20, 2023

jpc commented Feb 20, 2023

jpc commented Feb 24, 2023

jpc commented Feb 28, 2023

crypticsymmetry commented Mar 13, 2023

jpc commented Mar 14, 2023

jpc commented Mar 14, 2023

jpc commented Mar 15, 2023

jpc commented Mar 29, 2023

seastar105 commented Apr 25, 2023

jpc commented May 23, 2023