Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diverging evaluation loss using finetuning scripts Guanaco 7b #152

Closed
KJ-Waller opened this issue Jun 9, 2023 · 13 comments
Closed

Diverging evaluation loss using finetuning scripts Guanaco 7b #152

KJ-Waller opened this issue Jun 9, 2023 · 13 comments

Comments

@KJ-Waller
Copy link

KJ-Waller commented Jun 9, 2023

Is anyone else having this issue when using the finetune_guanaco_7b.sh script? I keep seeing the evaluation loss diverge rather than converge. I noticed this originally trying to finetune on my own datasets, and after troubleshooting, noticed that this happens for me as well with the original training scripts provided.

guanaco-7b
The training loss below:
guanaco-7b-train_loss

@FHL1998
Copy link

FHL1998 commented Jun 9, 2023

Same issue here, wonder if it is led by the token_id? I tried "self-instruct" and "alpaca", but still get the same problem, the performance even worse than using GPT2.

@KJ-Waller
Copy link
Author

wonder if it is led by the token_id?

Glad to hear I'm not the only one. But what do you mean exactly by this comment?

@FHL1998
Copy link

FHL1998 commented Jun 9, 2023

wonder if it is led by the token_id?

Glad to hear I'm not the only one. But what do you mean exactly by this comment?

Basically, I tried 4 ways to get rid of it, but none of them work:

  1. tried different dataset formats;
  2. lr tuning;
  3. change the source_max_len and target_max_len;
  4. A larger model (30b)

Now I suspect the reason for this issue is that:

  1. The length and format of the input, as some of my inputs are long (around 500 tokens), but some of them are just 200 tokens, and I do have lots of \n and - in my dataset, so I wonder if these are the causes.
  2. Maybe the size of the dataset is not large enough? But it is still weird that my train_loss is convergent but eval_loss increased.

@KJ-Waller
Copy link
Author

wonder if it is led by the token_id?

Glad to hear I'm not the only one. But what do you mean exactly by this comment?

Basically, I tried 4 ways to get rid of it, but none of them work:

  1. tried different dataset formats;
  2. lr tuning;
  3. change the source_max_len and target_max_len;
  4. A larger model (30b)

Now I suspect the reason for this issue is that:

  1. The length and format of the input, as some of my inputs are long (around 500 tokens), but some of them are just 200 tokens, and I do have lots of \n and - in my dataset, so I wonder if these are the causes.
  2. Maybe the size of the dataset is not large enough? But it is still weird that my train_loss is convergent but eval_loss increased.

Interesting. I tried several things too, including some lr tuning, larger models as well as different models like falcon-7b and falcon-40b. Also played a bit with the source_max_len parameter, and used different datasets. Then I decided to train the default script without changing anything about the model or dataset, but then we still see training loss convergence but eval loss divergence.

@KJ-Waller KJ-Waller changed the title Divering evaluation loss using finetuning scripts Guanaco 7b Diverging evaluation loss using finetuning scripts Guanaco 7b Jun 9, 2023
@quannguyen268
Copy link

Same issue, I think maybe the model is too large and dataset is too small, so model is overfitted into training dataset ?

@griff4692
Copy link

I think these PEFT models are overfitting very quickly on small-ish datasets. I have the same issue

@OfirArviv
Copy link

I have the same issue, with flan-t5-xxl, ul2 an xglm. But I ran the same code without 4bit and just LORA, and the model converged normally. So it is the 4bit part.

And as far as I see the performance on the training set is decreasing as well.

@ghost
Copy link

ghost commented Jun 22, 2023

@artidoro

@marclove
Copy link

Relieved to see I'm not the only one. 😅 I get practically the same results as you @KJ-Waller when using the Guanaco fine-tuning script for LLaMA 2, and performance on MMLU goes down. Did anybody here figure out what's going on?

Screenshot 2023-07-25 at 11 50 26 AM Screenshot 2023-07-25 at 11 50 37 AM

@artidoro
Copy link
Owner

Hello! 4-bit quantization is not responsible for what you are describing here. You are observing diverging loss and oscillating MMLU for the following reasons.

  1. In NLP eval loss is not always directly related to downstream performance (like task accuracy measured by GPT-4 eval).
  2. Dev set of MMLU is small explaining swings in MMLU accuracy while finetuning. These values are indicative and you have to compute test set performance to have a more stable result. We use the last checkpoint for this.
  3. As shown in our paper, finetuning on the OpenAssistant dataset significantly improves chatbot performance, as measured by Vicuna GPT-4 eval. However, it does not help much on MMLU (performance degrades or stays the same compared to no OA finetuning).

Ultimately, we showed that you should be evaluating on your downstream task when finetuning on a dataset. And you should think very carefully about what target benchmark you are optimizing as this is not always indicative of the desired performance. MMLU dev results were used to tune hyperparameters in our paper but Vicuna eval was much more relevant for chatbot evaluation.

@marclove
Copy link

Thank you @artidoro! I ran the finetune_llama2_guanaco_7b.sh script to try to orient myself to what I should expect with qlora before running on my own dataset and thought I had screwed something up. Very much appreciate the clarification!

I get the reasoning around not depending on MMLU metrics (is there a reason why you have it on by default in the script?), but I thought eval loss should still give some indication of overfitting/underfitting issues. Is that a misconception on my part?

@waterluck
Copy link

@KJ-Waller @FHL1998 i get the same loss trend when full finetune on llama 2, have you solved this problem finally?

@KJ-Waller
Copy link
Author

@KJ-Waller @FHL1998 i get the same loss trend when full finetune on llama 2, have you solved this problem finally?

Sorry, I haven't worked on this in a while and didn't pursue it any further. Good luck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants