Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration for training with CLAP embeddings #441

Open
jbm-composer opened this issue Mar 26, 2024 · 8 comments
Open

Configuration for training with CLAP embeddings #441

jbm-composer opened this issue Mar 26, 2024 · 8 comments

Comments

@jbm-composer
Copy link

I'm wondering if anyone has any configuration info they could share on training with CLAP embeddings?
I want to try the laion/larger_clap_music model from Huggingface, but it's really unclear to me how the project is supposed to be configured.

Any help greatly appreciated.

@jbm-composer
Copy link
Author

jbm-composer commented Mar 26, 2024

Just adding a bit more info, I managed to at least get to an attempt to load larger_clap_music using this config:

conditioners:
  description:
    model: clap
    clap: # based on
      checkpoint: //reference/clap/larger_clap_music/pytorch_model.bin
      name: laion/larger_clap_music
      model_arch: 'HTSAT-base'
      enable_fusion: false
      sample_rate: 32000
      max_audio_length: 10
      audio_stride: 1
      dim: 512
      attribute: description
      normalize: true
      quantize: true  # use RVQ quantization
      n_q: 12
      bins: 1024
      kmeans_iters: 50
      text_p: 0.  # probability of using text embed at train time
      cache_path: null

But loading the state_dict fails with a laundry list of "Unexpected key(s)"

I also tried just pointing it to the folder (it complained that it was not a file) and the config.json inside the HF download (which gave some kind of parse error).

@jbm-composer
Copy link
Author

Okay, I can load larger_clap_music using the ClapModel (and ClapProcessor) from Huggingface, but not in Audiocraft. I see that Audiocraft is based on CLAP from the Laion repo... Does anybody know if there's a way to load the HF weights into the Laion model? Or has anybody hacked the HF ClapModel into Audiocraft, by any chance?

@jbm-composer
Copy link
Author

I worked out a way around loading the HF weights. Now what I'm wondering about is how to configure a text prompt for running test generations during training. My goal is to test the performance of training on CLAP audio embeddings and using text embeddings for inference.

Any help greatly appreciated.

@yukara-ikemiya
Copy link

yukara-ikemiya commented Mar 28, 2024

In audiocraft, 'test generation' during training is little bit tricky, which is done at the following code part.

def generate_audio(self) -> dict:

We may have to prepare a dataset for generation as same as training data.
You may know we can add metadata with .json file to each audio data as shown in the example here.
https://github.com/facebookresearch/audiocraft/tree/main/dataset/example

If you don't need to do 'continuation generation' during training, dummy audio should be enough.
In this case, you may have to do

  1. Prepare dummy audio and metafile for test generation
  2. Add descriptions you want to use for test generation to metafile (.json) with "description" tag.

@jbm-composer
Copy link
Author

Thanks so much for the reply!

Digging around the solver code (as you pointed out) it did seem like the joint embedding might want a prompt, so I did add some super simple metadata files. I haven't run it to the point of a test generation yet, but hopefully it works as expected. I haven't added any dummy audio at this point, but I think in the past it's just used the audio from a data.... (I think...??)

Another "gotcha" that wasn't obvious to me at first is that dataset.valid.num_samples has to be >= the number of GPUs on the system. Makes sense, of course, but I crashed a few times before figuring it out.

@jbm-composer
Copy link
Author

Actually though... what determines when it will generate a sample output? I can see it running through train and valid steps, and it's saving checkpoints, but I don't seem to be getting audio. I also want audio sent to wandb, ideally... I do have wandb: with_media_logging: true set

@yukara-ikemiya
Copy link

It seems that 'test generation' runs at end of every epoch as same as evaluation, which is defined in the BaseSolver class (base class of every solver class).

def run_epoch(self):

As shown in the method, you can firstly check if your run go through self.should_run_stage('generate') statement. If not, it means 'test generation' is rejected here, so you can find which configuration causes the rejection.

And then, finally audio saving is done in the above mentioned method generate_audio after generating audio samples at the following line:

sample_manager.add_samples(

@jbm-composer
Copy link
Author

Yes, I saw from another Issue/comment that the "every" in the "generate" config refers to epochs, not steps. I had it set to 1000, thinking it meant step, so I would have been waiting a while... heh...
It's not always super clear when we're using steps (or "updates") and when we're using epochs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants