Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting semantic tokens #3

Closed
jpc opened this issue Feb 20, 2023 · 10 comments
Closed

Extracting semantic tokens #3

jpc opened this issue Feb 20, 2023 · 10 comments
Labels
goal Main sub-tasks of the project

Comments

@jpc
Copy link
Contributor

jpc commented Feb 20, 2023

I've experimented with extracting semantic tokens from Whisper and there are two challenges:

  1. Whisper does not use fixed-windows during the speech-to-text decoding. Instead they are using a sliding window approach when they start the window at the end of the last complete utterance. This probably avoids edge-issues and improves quality a lot but complicates the encoding because you need to run the full decoding process.
  2. Whisper does not have any bottlenecks in the model which means there is almost no incentive for the network to drop information. As shown by @sidhantls in his blog post on speaker identification in Lex Friedman podcasts the last encoder layer retains enough information to perform binary speaker identification (Lex vs. all other people) to 80% accuracy.
@jpc
Copy link
Contributor Author

jpc commented Feb 20, 2023

I tried using k-means to quantize the Whisper embeddings but the results are terrible. Even with 2048 clusters (4x what Google used) the quantization error is barely smaller than activation magnitudes which suggests we destroy basically all information in the embeddings.

I was able to train a model on these quantized embeddings but it does not generalize at all (generates complete noise on the validation set). I think the over-quantized embeddings are working only as strange positional embedding and the model basically memorizes the acoustic tokens at each position in the training set samples.

My next idea is to distill a frozen Whisper model with an added VQ bottleneck between encoder and decoder. The loss function should help us drop the redundant data.

@jpc jpc added the goal Main sub-tasks of the project label Feb 20, 2023
@jpc
Copy link
Contributor Author

jpc commented Feb 24, 2023

Ok, distillation and Residual Quantization seems to be the way to go but still needs more data and longer training. notebook (towards the end, under "Residual quantization test"). OTOH with RQ we are increasing the token count quite substantially.

I wonder whether it would work to use the embeddings as-is, without any quantization. SPEAR-TTS needed quantization to force the w2v-BERT model to ignore unnecessary information when trained in an unsupervised manner. Whisper was trained supervised against text so maybe there is not need for and further bottlenecks?

@jpc
Copy link
Contributor Author

jpc commented Feb 28, 2023

I've added the training code for the semantic token quantization (RQ) model. It turns out learning rate warmup does wonders to training transformers. ;)

It's still far from optimally tuned but the quantized tokens work pretty well so I'll get back to optimizing it once the rest of the pipeline is working.

Ground truth (Whisper tiny decoder):

Richard Corrie by Edwin Arlington Robinson, read for librovox.org by Fox and the stars of shininghalf.com. Whenever Richard Corrie went downtown, we people on the pavement looked at him. He He was a gentleman from Seoul to crown, clean favored, and impurially slim.<|endoftext|>

Decoding the RQ quantized embeddings:

<|nospeech|> Richard Coryrie by Edwin Arlington Robinson, read for Libroox.org by Fox in the Stars of shining half.com. Whenever Richard Corrie went downtown, we people on the pavement looked at him. He He was a gentleman from sold to Crown, clean favored and and impurially slim.<|endoftext|>

@jpc jpc mentioned this issue Feb 28, 2023
9 tasks
@crypticsymmetry
Copy link

Not sure if this is applicable here, or if you already have checked this repo out before but might be helpful 🤷‍♂️ Whisperx

@jpc
Copy link
Contributor Author

jpc commented Mar 14, 2023

Hey @crypticsymmetry, thanks for the comment. I've seen WhisperX when looking for higher-precision timestamps but I am not sure how it could help me... Would you mind to elaborate a bit?

@jpc
Copy link
Contributor Author

jpc commented Mar 14, 2023

Btw. I've found a pretty serious issue with this model – I am not recreating the positional embeddings after the quantization block which means my tokens had to carry positional information. This is actually counterproductive for generalization and also requires a large codebook and multiple quantizers (RQ).

In some early tests on smaller data I managed to reduce the bottleneck to a single quantization with 1024 codes and got similar Whisper decoder performance. I will update VQ/RQ notebooks in a few days.

@jpc
Copy link
Contributor Author

jpc commented Mar 15, 2023

I've pushed the RQ notebook and the new model. Now training a S->A model.

@jpc
Copy link
Contributor Author

jpc commented Mar 29, 2023

I trained another quantization model on more data. This turns out to be quite challenging since flac decoding and running the Whisper encoder are quite time consuming and at the same time the encoder outputs (which are the input to the quantization model) take a lot of space (5x more than the flac files).

Fortunately the current model should be good enough for quite a while. We'll revisit this once we move to a bigger Whisper model for final training.

@jpc jpc closed this as completed Mar 29, 2023
@seastar105
Copy link

@jpc it seems project uses quantized whisper embedding(output of encoder) token with RQ. could you explain why choose it instead of plain encoder's output?

as you mentioned in comment, I wonder whether it would work to use the embeddings as-is, without any quantization. , i'm wondering too, is whisper's encoder good to capture content information especially in multilingual setting.

@jpc
Copy link
Contributor Author

jpc commented May 23, 2023

Hey, sorry for the long delay. I chose to quantize the embeddings to make sure they only keep the minimum required information needed to predict the text. This regularization should make them more predictable and easier to work with when we do T2S (text to semantic embeddings) modeling. The unquantized embeddings are also a lot bigger (384 or 768 FP16 values vs. a single int16 per semantic token) and don't compress well. (the quantized embeddings compress 10x with xz)

At some point I did some tests using the original embeddings straight from Whisper when training the S2A model and they seemed to work as well in this case but I have not done a proper comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
goal Main sub-tasks of the project
Development

No branches or pull requests

3 participants