First of all, great project!
One question though: in the original paper, you mentioned using a four quantizer Encodec for MusicGen training, with a pretty large stride (50 Hz). This will produce a pretty low quality output (and monophonic, and 32 kHz-only).
Have you done any ablation studies with trying larger bandwidths? For instance, in the Encodec paper, you've trained a stereo 48kHz 24kbit/s model. What were the issues with using this in MusicGen?
@adefossez hopefully you can shed some light here. Thanks!