-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diverging evaluation loss using finetuning scripts Guanaco 7b #152
Comments
Same issue here, wonder if it is led by the token_id? I tried "self-instruct" and "alpaca", but still get the same problem, the performance even worse than using GPT2. |
Glad to hear I'm not the only one. But what do you mean exactly by this comment? |
Basically, I tried 4 ways to get rid of it, but none of them work:
Now I suspect the reason for this issue is that:
|
Interesting. I tried several things too, including some lr tuning, larger models as well as different models like falcon-7b and falcon-40b. Also played a bit with the |
Same issue, I think maybe the model is too large and dataset is too small, so model is overfitted into training dataset ? |
I think these PEFT models are overfitting very quickly on small-ish datasets. I have the same issue |
I have the same issue, with flan-t5-xxl, ul2 an xglm. But I ran the same code without 4bit and just LORA, and the model converged normally. So it is the 4bit part. And as far as I see the performance on the training set is decreasing as well. |
Relieved to see I'm not the only one. 😅 I get practically the same results as you @KJ-Waller when using the Guanaco fine-tuning script for LLaMA 2, and performance on MMLU goes down. Did anybody here figure out what's going on? |
Hello! 4-bit quantization is not responsible for what you are describing here. You are observing diverging loss and oscillating MMLU for the following reasons.
Ultimately, we showed that you should be evaluating on your downstream task when finetuning on a dataset. And you should think very carefully about what target benchmark you are optimizing as this is not always indicative of the desired performance. MMLU dev results were used to tune hyperparameters in our paper but Vicuna eval was much more relevant for chatbot evaluation. |
Thank you @artidoro! I ran the I get the reasoning around not depending on MMLU metrics (is there a reason why you have it on by default in the script?), but I thought eval loss should still give some indication of overfitting/underfitting issues. Is that a misconception on my part? |
@KJ-Waller @FHL1998 i get the same loss trend when full finetune on llama 2, have you solved this problem finally? |
Sorry, I haven't worked on this in a while and didn't pursue it any further. Good luck |
Is anyone else having this issue when using the finetune_guanaco_7b.sh script? I keep seeing the evaluation loss diverge rather than converge. I noticed this originally trying to finetune on my own datasets, and after troubleshooting, noticed that this happens for me as well with the original training scripts provided.
The training loss below:
The text was updated successfully, but these errors were encountered: