You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to perform finetuning on a small musicgen model. I have a dataset consisting of different short sounds. Actually, it's not big. But I successfully trained on the melodies of one instrument, where the dataset size was extremely small, only 3 hours, and I got interesting results. However, with sounds everything is different. In 80% of cases or even more often, I encounter the fact that after the main attack sound I have a crackling sound in the generated audio track.
I've tried various ways to optimize training, but so far nothing obvious has helped, such as reducing the learning rate or dropout. Also, my logs look very suspicious from the very beginning of the training: Train Summary | Epoch 1 | lr=1.00E+00 | grad_norm=INF | grad_scale=45645.824 | ce=0.962 | ppl=2.650 | duration=2472.758
It's also interesting that when I enter a word with a small letter and a word with a capital letter in the prompt, I get different results. In this case, everything depends on the word. In one case, the result is as expected, but in the other there is a complete bunch of random sounds, as if the model had not been trained. (By the way, I checked the original models and during generation the situation with uppercase and lowercase letters for the same word is similar.) In fact, I'd be very interested to know more about how merging text tags that are packaged in json format for each sample works. I'm new to learning your model. Thanks in advance for your answer and help!
The text was updated successfully, but these errors were encountered:
I would like to perform finetuning on a small musicgen model. I have a dataset consisting of different short sounds. Actually, it's not big. But I successfully trained on the melodies of one instrument, where the dataset size was extremely small, only 3 hours, and I got interesting results. However, with sounds everything is different. In 80% of cases or even more often, I encounter the fact that after the main attack sound I have a crackling sound in the generated audio track.
I've tried various ways to optimize training, but so far nothing obvious has helped, such as reducing the learning rate or dropout. Also, my logs look very suspicious from the very beginning of the training:
Train Summary | Epoch 1 | lr=1.00E+00 | grad_norm=INF | grad_scale=45645.824 | ce=0.962 | ppl=2.650 | duration=2472.758
It's also interesting that when I enter a word with a small letter and a word with a capital letter in the prompt, I get different results. In this case, everything depends on the word. In one case, the result is as expected, but in the other there is a complete bunch of random sounds, as if the model had not been trained. (By the way, I checked the original models and during generation the situation with uppercase and lowercase letters for the same word is similar.) In fact, I'd be very interested to know more about how merging text tags that are packaged in json format for each sample works. I'm new to learning your model. Thanks in advance for your answer and help!
The text was updated successfully, but these errors were encountered: