Extract audio representations for future use #21

MorenoLaQuatra · 2023-06-10T17:19:07Z

Hi,

First of all, thank you all for the impressive work and for making the code and models available to the community. I would like to use the SSAST models to extract audio embeddings. Specifically, I'm interested in writing a script that accomplishes the following:

Convert the waveform to a spectrogram using specific parameters.
Perform a forward pass to the SSAST model to obtain embeddings for each "frame" or "patch". I would like to have N embeddings instead of just the average.

In previous issues, you pointed out some mean/variance normalization and a way to extract average pooled tokens (e.g., commenting out the mlp head from ASTModel forward). Do you have any suggestion about how to do point (1) and (2)?

Thanks

YuanGongND · 2023-06-10T17:52:27Z

May I ask if you wish to pretrain SSAST or take our checkpoint?

MorenoLaQuatra · 2023-06-10T17:55:32Z

I want to use the provided checkpoints. I don't want to fine-tune or pre-train

YuanGongND · 2023-06-10T18:04:53Z

Got it. Just a reminder that self-supervised checkpoint is not comparable with finetuned checkpoints for almost all downstream tasks.

I guess what you need to do is return before the mean pooling (or taking the cls):

ssast/src/models/ast_models.py

Line 262 in a1a3eec

x = torch.mean(x[:, self.cls_token_num:, :], dim=1)

or

ssast/src/models/ast_models.py

Line 284 in a1a3eec

if self.cls_token_num == 2:

But as I said, if you wish to take the representation for your downstream task, it won't work very well as there's no finetuning.

You can also consider finetuned models that fits your application, e.g., for speech tasks, whisper; for audio tasks, audio mae or the audio branch of finetuned https://github.com/YuanGongND/cav-mae.

-Yuan

YuanGongND · 2023-06-10T18:06:15Z

For spectrogram generation, we used

ssast/src/dataloader.py

Line 126 in a1a3eec

    
           fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False,

But there's otherways, e.g., librosa. Note: there packages generate different outputs, you have to stick to one.

MorenoLaQuatra · 2023-06-11T07:55:29Z

Thank you so much for this information. I'll fine-tune the "head only" on top of the extracted representation, I want to have an idea of the extracted representations. Just for comparison, is there any checkpoint that can be comparable to the first one in this list https://github.com/YuanGongND/ast#pretrained-models ? If I got it correctly they are pre-trained (not fine-tuned) on audioset, right?

YuanGongND · 2023-06-11T22:48:34Z

I'll fine-tune the "head only" on top of the extracted representation, I want to have an idea of the extracted representations.

This is usually refered as to "linear probing", which is a common way to evaluate the model representation, but in most cases, it is (a lot) worse than end-to-end finetuning (all parameters trainable). Specifically for SSAST, all results shown in the paper are end-to-end finetuning.

Just for comparison, is there any checkpoint that can be comparable to the first one in this list https://github.com/YuanGongND/ast#pretrained-models ? If I got it correctly they are pre-trained (not fine-tuned) on audioset, right?

No, AST is supervisedly finetuned on AudioSet (has seen labels during training) while all checkpoints in this repo are self-supervisedly pretrained (haven't seen labels). This is a significant difference. I expect much better linear probing result for AST.

-Yuan

MorenoLaQuatra · 2023-06-12T05:44:26Z

That's the case indeed. I want to evaluate audio representations using probing (linear mostly). So, from what I understand there is no way to do it with AST, only with SSAST?

Just to set the context, this is a fair evaluation if you want to compare it with respect to this kind of model right?

YuanGongND · 2023-06-12T05:55:35Z

So, from what I understand there is no way to do it with AST, only with SSAST?

Sorry I didn't make it clear. I actually meant that AST would be better than SSAST for linear probing. Please check my previous reply.

Just to set the context, this is a fair evaluation if you want to compare it with respect to this kind of model right?

wav2vec2-base is analogous to SSAST, which is not finetuned, and you cannot use it for ASR.

wav2vec2-base-100h is analogous to AST, whch is finetuned, you can use it for ASR directly.

Another major difference between wav2vec and AST/SSAST is the task, wav2vec focus on speech and should be better for speech tasks while AST/SSAST is stronger on general audio event recognition. Please see Table 5 of the paper.

-Yuan

MorenoLaQuatra · 2023-06-12T06:24:05Z

Sorry for the misunderstanding, but I would actually mean that they are not comparable (not that we can not use AST for linear probing). Thanks for all clarification. Sure I know that they are trained using different pre-training objectives and different data collections (audioset vs speech-related datasets).

What I meant in my previous reply was that there is no way to do a "fair" comparison between AST and SASST, it means there is no pre-trained AST without fine-tuning, am I right?

YuanGongND · 2023-06-12T06:35:09Z

What I meant in my previous reply was that there is no way to do a "fair" comparison between AST and SASST, it means there is no pre-trained AST without fine-tuning, am I right?

Yes, that is what I meant. AST is pretrained with ImageNet - a vision dataset, it soulds werid but actually works quite well. Closing the gap between self-supervised models and supervised models is a goal of the research community, but I think we are not yet there.

sreenivasaupadhyaya · 2023-06-14T18:53:49Z

@MorenoLaQuatra and @YuanGongND

Thanks for discussion.
@MorenoLaQuatra Could you please guide me if you managed to get the audio embeddings using AST or SASST?
My application is also to do linear probing for audio event classification and I would like to evaluate the models embedding in my case.

Regrads,
Srini

Young973 · 2023-07-25T02:10:52Z

@YuanGongND Mr. Gong, I encounter a problem as well when extracting audio representations. My model is self.ast_model=ASTModel(label_dim=2, fshape=16, tshape=16, fstride=16, tstride=16, input_tdim=128, pretrain_stage=False, load_pretrained_mdl_path="./model/SSAST-Base-Patch-400.pth")
My input is (24, 128, 128) while the output before line 262 is (24, 66, 768), I can understand 768 may be the hidden dimension of transformer but how is get 66 dimension? How can I keep the original length to the input?

YuanGongND · 2023-07-25T08:15:28Z

hi there,

Your input [24, 128, 128] means batch size 24, input sequence length in time frames 128, and each frame has 128 features (mel fbanks). I.e., your input spectrogram dim is 128 x 128.

The output of the model is also a sequence, but in flattened patches (in your sample it is 16*16), so 128 x 128/ (16 x 16) = 64 is the actual patch sequence length. There are two prefix tokens [CLS] so in total 66.

Therefore the input length is time frame length and the output length is flattened patch length, they are totally different. In case to make them similar, you would need patch size fshape=128, tshape=1, fstride=128, tstride=1 (i.e., your patch shape is 128 x 1, a time frame), then the output would be something close to [24, 128 + 2 (cls tokens), 768]. Note we haven't test this setting, the closest is the 128 x 2 patch, where the output would be [24, 128/2 + 2, 768].

-Yuan

MorenoLaQuatra changed the title ~~Extract representation for future use~~ Extract audio representations for future use Jun 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract audio representations for future use #21

Extract audio representations for future use #21

MorenoLaQuatra commented Jun 10, 2023

YuanGongND commented Jun 10, 2023

MorenoLaQuatra commented Jun 10, 2023

YuanGongND commented Jun 10, 2023

YuanGongND commented Jun 10, 2023 •

edited

Loading

MorenoLaQuatra commented Jun 11, 2023

YuanGongND commented Jun 11, 2023 •

edited

Loading

MorenoLaQuatra commented Jun 12, 2023

YuanGongND commented Jun 12, 2023 •

edited

Loading

MorenoLaQuatra commented Jun 12, 2023

YuanGongND commented Jun 12, 2023

sreenivasaupadhyaya commented Jun 14, 2023

Young973 commented Jul 25, 2023

YuanGongND commented Jul 25, 2023 •

edited

Loading

Extract audio representations for future use #21

Extract audio representations for future use #21

Comments

MorenoLaQuatra commented Jun 10, 2023

YuanGongND commented Jun 10, 2023

MorenoLaQuatra commented Jun 10, 2023

YuanGongND commented Jun 10, 2023

YuanGongND commented Jun 10, 2023 • edited Loading

MorenoLaQuatra commented Jun 11, 2023

YuanGongND commented Jun 11, 2023 • edited Loading

MorenoLaQuatra commented Jun 12, 2023

YuanGongND commented Jun 12, 2023 • edited Loading

MorenoLaQuatra commented Jun 12, 2023

YuanGongND commented Jun 12, 2023

sreenivasaupadhyaya commented Jun 14, 2023

Young973 commented Jul 25, 2023

YuanGongND commented Jul 25, 2023 • edited Loading

YuanGongND commented Jun 10, 2023 •

edited

Loading

YuanGongND commented Jun 11, 2023 •

edited

Loading

YuanGongND commented Jun 12, 2023 •

edited

Loading

YuanGongND commented Jul 25, 2023 •

edited

Loading