-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting semantic tokens #3
Comments
I tried using k-means to quantize the Whisper embeddings but the results are terrible. Even with 2048 clusters (4x what Google used) the quantization error is barely smaller than activation magnitudes which suggests we destroy basically all information in the embeddings. I was able to train a model on these quantized embeddings but it does not generalize at all (generates complete noise on the validation set). I think the over-quantized embeddings are working only as strange positional embedding and the model basically memorizes the acoustic tokens at each position in the training set samples. My next idea is to distill a frozen Whisper model with an added VQ bottleneck between encoder and decoder. The loss function should help us drop the redundant data. |
Ok, distillation and Residual Quantization seems to be the way to go but still needs more data and longer training. notebook (towards the end, under "Residual quantization test"). OTOH with RQ we are increasing the token count quite substantially. I wonder whether it would work to use the embeddings as-is, without any quantization. SPEAR-TTS needed quantization to force the w2v-BERT model to ignore unnecessary information when trained in an unsupervised manner. Whisper was trained supervised against text so maybe there is not need for and further bottlenecks? |
I've added the training code for the semantic token quantization (RQ) model. It turns out learning rate warmup does wonders to training transformers. ;) It's still far from optimally tuned but the quantized tokens work pretty well so I'll get back to optimizing it once the rest of the pipeline is working. Ground truth (Whisper tiny decoder):
Decoding the RQ quantized embeddings:
|
Not sure if this is applicable here, or if you already have checked this repo out before but might be helpful 🤷♂️ Whisperx |
Hey @crypticsymmetry, thanks for the comment. I've seen WhisperX when looking for higher-precision timestamps but I am not sure how it could help me... Would you mind to elaborate a bit? |
Btw. I've found a pretty serious issue with this model – I am not recreating the positional embeddings after the quantization block which means my tokens had to carry positional information. This is actually counterproductive for generalization and also requires a large codebook and multiple quantizers (RQ). In some early tests on smaller data I managed to reduce the bottleneck to a single quantization with 1024 codes and got similar Whisper decoder performance. I will update VQ/RQ notebooks in a few days. |
I've pushed the RQ notebook and the new model. Now training a S->A model. |
I trained another quantization model on more data. This turns out to be quite challenging since Fortunately the current model should be good enough for quite a while. We'll revisit this once we move to a bigger Whisper model for final training. |
@jpc it seems project uses quantized whisper embedding(output of encoder) token with RQ. could you explain why choose it instead of plain encoder's output? as you mentioned in comment, |
Hey, sorry for the long delay. I chose to quantize the embeddings to make sure they only keep the minimum required information needed to predict the text. This regularization should make them more predictable and easier to work with when we do T2S (text to semantic embeddings) modeling. The unquantized embeddings are also a lot bigger (384 or 768 FP16 values vs. a single int16 per semantic token) and don't compress well. (the quantized embeddings compress 10x with At some point I did some tests using the original embeddings straight from Whisper when training the S2A model and they seemed to work as well in this case but I have not done a proper comparison. |
I've experimented with extracting semantic tokens from Whisper and there are two challenges:
The text was updated successfully, but these errors were encountered: