Diverging evaluation loss using finetuning scripts Guanaco 7b #152

KJ-Waller · 2023-06-09T07:49:40Z

Is anyone else having this issue when using the finetune_guanaco_7b.sh script? I keep seeing the evaluation loss diverge rather than converge. I noticed this originally trying to finetune on my own datasets, and after troubleshooting, noticed that this happens for me as well with the original training scripts provided.

The training loss below:

FHL1998 · 2023-06-09T08:16:42Z

Same issue here, wonder if it is led by the token_id? I tried "self-instruct" and "alpaca", but still get the same problem, the performance even worse than using GPT2.

KJ-Waller · 2023-06-09T08:21:56Z

wonder if it is led by the token_id?

Glad to hear I'm not the only one. But what do you mean exactly by this comment?

FHL1998 · 2023-06-09T08:34:11Z

wonder if it is led by the token_id?

Glad to hear I'm not the only one. But what do you mean exactly by this comment?

Basically, I tried 4 ways to get rid of it, but none of them work:

tried different dataset formats;
lr tuning;
change the source_max_len and target_max_len;
A larger model (30b)

Now I suspect the reason for this issue is that:

The length and format of the input, as some of my inputs are long (around 500 tokens), but some of them are just 200 tokens, and I do have lots of \n and - in my dataset, so I wonder if these are the causes.
Maybe the size of the dataset is not large enough? But it is still weird that my train_loss is convergent but eval_loss increased.

KJ-Waller · 2023-06-09T08:38:18Z

wonder if it is led by the token_id?

Glad to hear I'm not the only one. But what do you mean exactly by this comment?

Basically, I tried 4 ways to get rid of it, but none of them work:

tried different dataset formats;

lr tuning;

change the source_max_len and target_max_len;

A larger model (30b)

Now I suspect the reason for this issue is that:

The length and format of the input, as some of my inputs are long (around 500 tokens), but some of them are just 200 tokens, and I do have lots of \n and - in my dataset, so I wonder if these are the causes.

Maybe the size of the dataset is not large enough? But it is still weird that my train_loss is convergent but eval_loss increased.

Interesting. I tried several things too, including some lr tuning, larger models as well as different models like falcon-7b and falcon-40b. Also played a bit with the source_max_len parameter, and used different datasets. Then I decided to train the default script without changing anything about the model or dataset, but then we still see training loss convergence but eval loss divergence.

quannguyen268 · 2023-06-13T14:05:39Z

Same issue, I think maybe the model is too large and dataset is too small, so model is overfitted into training dataset ?

griff4692 · 2023-06-19T13:47:34Z

I think these PEFT models are overfitting very quickly on small-ish datasets. I have the same issue

OfirArviv · 2023-06-20T10:06:21Z

I have the same issue, with flan-t5-xxl, ul2 an xglm. But I ran the same code without 4bit and just LORA, and the model converged normally. So it is the 4bit part.

And as far as I see the performance on the training set is decreasing as well.

ghost · 2023-06-22T14:22:39Z

@artidoro

marclove · 2023-07-25T15:52:50Z

Relieved to see I'm not the only one. 😅 I get practically the same results as you @KJ-Waller when using the Guanaco fine-tuning script for LLaMA 2, and performance on MMLU goes down. Did anybody here figure out what's going on?

artidoro · 2023-07-25T17:59:39Z

Hello! 4-bit quantization is not responsible for what you are describing here. You are observing diverging loss and oscillating MMLU for the following reasons.

In NLP eval loss is not always directly related to downstream performance (like task accuracy measured by GPT-4 eval).
Dev set of MMLU is small explaining swings in MMLU accuracy while finetuning. These values are indicative and you have to compute test set performance to have a more stable result. We use the last checkpoint for this.
As shown in our paper, finetuning on the OpenAssistant dataset significantly improves chatbot performance, as measured by Vicuna GPT-4 eval. However, it does not help much on MMLU (performance degrades or stays the same compared to no OA finetuning).

Ultimately, we showed that you should be evaluating on your downstream task when finetuning on a dataset. And you should think very carefully about what target benchmark you are optimizing as this is not always indicative of the desired performance. MMLU dev results were used to tune hyperparameters in our paper but Vicuna eval was much more relevant for chatbot evaluation.

marclove · 2023-07-25T21:27:15Z

Thank you @artidoro! I ran the finetune_llama2_guanaco_7b.sh script to try to orient myself to what I should expect with qlora before running on my own dataset and thought I had screwed something up. Very much appreciate the clarification!

I get the reasoning around not depending on MMLU metrics (is there a reason why you have it on by default in the script?), but I thought eval loss should still give some indication of overfitting/underfitting issues. Is that a misconception on my part?

waterluck · 2024-05-02T06:30:30Z

@KJ-Waller @FHL1998 i get the same loss trend when full finetune on llama 2, have you solved this problem finally?

KJ-Waller · 2024-05-02T07:08:44Z

@KJ-Waller @FHL1998 i get the same loss trend when full finetune on llama 2, have you solved this problem finally?

Sorry, I haven't worked on this in a while and didn't pursue it any further. Good luck

KJ-Waller changed the title ~~Divering evaluation loss using finetuning scripts Guanaco 7b~~ Diverging evaluation loss using finetuning scripts Guanaco 7b Jun 9, 2023

KJ-Waller closed this as completed Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diverging evaluation loss using finetuning scripts Guanaco 7b #152

Diverging evaluation loss using finetuning scripts Guanaco 7b #152

KJ-Waller commented Jun 9, 2023 •

edited

Loading

FHL1998 commented Jun 9, 2023 •

edited

Loading

KJ-Waller commented Jun 9, 2023

FHL1998 commented Jun 9, 2023

KJ-Waller commented Jun 9, 2023

quannguyen268 commented Jun 13, 2023

griff4692 commented Jun 19, 2023

OfirArviv commented Jun 20, 2023

ghost commented Jun 22, 2023

marclove commented Jul 25, 2023

artidoro commented Jul 25, 2023

marclove commented Jul 25, 2023

waterluck commented May 2, 2024

KJ-Waller commented May 2, 2024

Diverging evaluation loss using finetuning scripts Guanaco 7b #152

Diverging evaluation loss using finetuning scripts Guanaco 7b #152

Comments

KJ-Waller commented Jun 9, 2023 • edited Loading

FHL1998 commented Jun 9, 2023 • edited Loading

KJ-Waller commented Jun 9, 2023

FHL1998 commented Jun 9, 2023

KJ-Waller commented Jun 9, 2023

quannguyen268 commented Jun 13, 2023

griff4692 commented Jun 19, 2023

OfirArviv commented Jun 20, 2023

ghost commented Jun 22, 2023

marclove commented Jul 25, 2023

artidoro commented Jul 25, 2023

marclove commented Jul 25, 2023

waterluck commented May 2, 2024

KJ-Waller commented May 2, 2024

KJ-Waller commented Jun 9, 2023 •

edited

Loading

FHL1998 commented Jun 9, 2023 •

edited

Loading