Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract audio representations for future use #21

Open
MorenoLaQuatra opened this issue Jun 10, 2023 · 13 comments
Open

Extract audio representations for future use #21

MorenoLaQuatra opened this issue Jun 10, 2023 · 13 comments

Comments

@MorenoLaQuatra
Copy link

Hi,

First of all, thank you all for the impressive work and for making the code and models available to the community. I would like to use the SSAST models to extract audio embeddings. Specifically, I'm interested in writing a script that accomplishes the following:

  1. Convert the waveform to a spectrogram using specific parameters.
  2. Perform a forward pass to the SSAST model to obtain embeddings for each "frame" or "patch". I would like to have N embeddings instead of just the average.

In previous issues, you pointed out some mean/variance normalization and a way to extract average pooled tokens (e.g., commenting out the mlp head from ASTModel forward). Do you have any suggestion about how to do point (1) and (2)?

Thanks

@MorenoLaQuatra MorenoLaQuatra changed the title Extract representation for future use Extract audio representations for future use Jun 10, 2023
@YuanGongND
Copy link
Owner

May I ask if you wish to pretrain SSAST or take our checkpoint?

@MorenoLaQuatra
Copy link
Author

I want to use the provided checkpoints. I don't want to fine-tune or pre-train

@YuanGongND
Copy link
Owner

Got it. Just a reminder that self-supervised checkpoint is not comparable with finetuned checkpoints for almost all downstream tasks.

I guess what you need to do is return before the mean pooling (or taking the cls):

x = torch.mean(x[:, self.cls_token_num:, :], dim=1)

or

if self.cls_token_num == 2:

But as I said, if you wish to take the representation for your downstream task, it won't work very well as there's no finetuning.

You can also consider finetuned models that fits your application, e.g., for speech tasks, whisper; for audio tasks, audio mae or the audio branch of finetuned https://github.com/YuanGongND/cav-mae.

-Yuan

@YuanGongND
Copy link
Owner

YuanGongND commented Jun 10, 2023

For spectrogram generation, we used

fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False,

But there's otherways, e.g., librosa. Note: there packages generate different outputs, you have to stick to one.

@MorenoLaQuatra
Copy link
Author

Thank you so much for this information. I'll fine-tune the "head only" on top of the extracted representation, I want to have an idea of the extracted representations. Just for comparison, is there any checkpoint that can be comparable to the first one in this list https://github.com/YuanGongND/ast#pretrained-models ? If I got it correctly they are pre-trained (not fine-tuned) on audioset, right?

@YuanGongND
Copy link
Owner

YuanGongND commented Jun 11, 2023

I'll fine-tune the "head only" on top of the extracted representation, I want to have an idea of the extracted representations.

This is usually refered as to "linear probing", which is a common way to evaluate the model representation, but in most cases, it is (a lot) worse than end-to-end finetuning (all parameters trainable). Specifically for SSAST, all results shown in the paper are end-to-end finetuning.

Just for comparison, is there any checkpoint that can be comparable to the first one in this list https://github.com/YuanGongND/ast#pretrained-models ? If I got it correctly they are pre-trained (not fine-tuned) on audioset, right?

No, AST is supervisedly finetuned on AudioSet (has seen labels during training) while all checkpoints in this repo are self-supervisedly pretrained (haven't seen labels). This is a significant difference. I expect much better linear probing result for AST.

-Yuan

@MorenoLaQuatra
Copy link
Author

That's the case indeed. I want to evaluate audio representations using probing (linear mostly). So, from what I understand there is no way to do it with AST, only with SSAST?

Just to set the context, this is a fair evaluation if you want to compare it with respect to this kind of model right?

@YuanGongND
Copy link
Owner

YuanGongND commented Jun 12, 2023

So, from what I understand there is no way to do it with AST, only with SSAST?

Sorry I didn't make it clear. I actually meant that AST would be better than SSAST for linear probing. Please check my previous reply.

Just to set the context, this is a fair evaluation if you want to compare it with respect to this kind of model right?

wav2vec2-base is analogous to SSAST, which is not finetuned, and you cannot use it for ASR.

wav2vec2-base-100h is analogous to AST, whch is finetuned, you can use it for ASR directly.

Another major difference between wav2vec and AST/SSAST is the task, wav2vec focus on speech and should be better for speech tasks while AST/SSAST is stronger on general audio event recognition. Please see Table 5 of the paper.

-Yuan

@MorenoLaQuatra
Copy link
Author

Sorry for the misunderstanding, but I would actually mean that they are not comparable (not that we can not use AST for linear probing). Thanks for all clarification. Sure I know that they are trained using different pre-training objectives and different data collections (audioset vs speech-related datasets).

What I meant in my previous reply was that there is no way to do a "fair" comparison between AST and SASST, it means there is no pre-trained AST without fine-tuning, am I right?

@YuanGongND
Copy link
Owner

What I meant in my previous reply was that there is no way to do a "fair" comparison between AST and SASST, it means there is no pre-trained AST without fine-tuning, am I right?

Yes, that is what I meant. AST is pretrained with ImageNet - a vision dataset, it soulds werid but actually works quite well. Closing the gap between self-supervised models and supervised models is a goal of the research community, but I think we are not yet there.

@sreenivasaupadhyaya
Copy link

@MorenoLaQuatra and @YuanGongND

Thanks for discussion.
@MorenoLaQuatra Could you please guide me if you managed to get the audio embeddings using AST or SASST?
My application is also to do linear probing for audio event classification and I would like to evaluate the models embedding in my case.

Regrads,
Srini

@Young973
Copy link

@YuanGongND Mr. Gong, I encounter a problem as well when extracting audio representations. My model is self.ast_model=ASTModel(label_dim=2, fshape=16, tshape=16, fstride=16, tstride=16, input_tdim=128, pretrain_stage=False, load_pretrained_mdl_path="./model/SSAST-Base-Patch-400.pth")
My input is (24, 128, 128) while the output before line 262 is (24, 66, 768), I can understand 768 may be the hidden dimension of transformer but how is get 66 dimension? How can I keep the original length to the input?

@YuanGongND
Copy link
Owner

YuanGongND commented Jul 25, 2023

hi there,

Your input [24, 128, 128] means batch size 24, input sequence length in time frames 128, and each frame has 128 features (mel fbanks). I.e., your input spectrogram dim is 128 x 128.

The output of the model is also a sequence, but in flattened patches (in your sample it is 16*16), so 128 x 128/ (16 x 16) = 64 is the actual patch sequence length. There are two prefix tokens [CLS] so in total 66.

Therefore the input length is time frame length and the output length is flattened patch length, they are totally different. In case to make them similar, you would need patch size fshape=128, tshape=1, fstride=128, tstride=1 (i.e., your patch shape is 128 x 1, a time frame), then the output would be something close to [24, 128 + 2 (cls tokens), 768]. Note we haven't test this setting, the closest is the 128 x 2 patch, where the output would be [24, 128/2 + 2, 768].

-Yuan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants