-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama3 GGUF conversion with merged LORA Adapter seems to lose training data randomly #7062
Comments
You should merge the model with pytorch and then convert the merged model to gguf. The lora conversion script has issues exporting tensors that are permuted during model conversion, and it should probably be removed. |
@slaren Thank you will try that And update this thread with the result |
@slaren I can't seem to get it to work with other methods neither, do you by chance have any external guide or link to some guide that demonstrates or documentation how to merge it using pytorch? I'm getting same output regardless of how I save it. Edit: To clarify, this is only during conversion to GGUF, when merging and loading the safetensors for inference everything works as expected. It is only during merging to GGUF (regardless of quantization, F16---) it becomes like this. |
@slaren I misunderstood you completely, what you wrote is exactly what I've done. I did not use the llama.cpp lora conversion script. I just merged the lora into the model using https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload So far so good. Then I convert the merged model previously, into GGUF format with llama.cpp , that breaks the model and the lora fine tune, it does not produce the same outputs. The difference seems to be completely random. |
Does it work with the CPU backend? (if using a version of llama.cpp built with CUDA, run with |
@slaren Thats it! With GPU it ruined the lora, with CPU it works as intended! GREAT! |
Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul. |
@slaren Great! Thanks, been scratching my head at this for weeks. Much appreciated!!! |
@Sneakr whats the solution? |
@gamercoder153 add this before conversion command CUDA_VISIBLE_DEVICES=0 to run conversion on the CPU . Edit: IT's a temporary solution but does not fully fix the bfloat issue but working alteast. |
Beginning to notice a pattern around here... |
Is this more evidence BF16 should be added to the convert scripts? I've been converting BF16 to float32. Does that mitigate these issues? Of course it's not ideal, but if it works, I'll continue doing it until BF16 is natively available |
Update: |
I hope they solve the issue somehow |
I think if that was the case then the output would just be NaN incoherent garbage like it would be with Phi-2. My guess is that this is a difference in rounding error, not necessarily even from the precision of the weights but possibly from other operations as well. In any case, an insufficient numerical range could be tested by using FP32 instead of FP16 cuBLAS matrix multiplication. |
@slaren further testing, I exported to GGUF : Edit: Fine tuning was done in bfloat16 , I need to test float16 next. CUDA_VISIBLE_DEVICES="" empty string same outcome This is the result of loading the model directly through safetensors and then load it in GGUF - LM Studio . The model is fine tuned with lora adapter. |
Hey everyone, I managed to make a minimal reproduction of this issue with unsloth. Uses a single sample. Inference on Unsloth (notebook) - VideoCode_yPjvh8Fp0U.mp4Inference on llama.cpp - VideoWindowsTerminal_VnQLGD4gka.mp4Here is my training data (single sample){
"conversations": [
{
"role": "user",
"content": "!!llama.cpp!!"
},
{
"role": "assistant",
"content": "................,gNp.y@N,\n...............gP..@.@..\"B,\n..............@C...@.@-...$p\n.............]@....@.@.....@\n.............]@....@.@.....@\n..............@,...$U@....]P\n........,gg@P*\"\"....-....#@\n...,g@P*'.................*@,\n.gP\"........%Np@P..........'@\n]@..........................@\n'B,.........................]@\n..\"*MMNBgg,,.................]C\n...........\"`@`..............$P\n............@P...............Bp\n............@-................@.\n............]@................@P\n...........]@................\"@\n...........@-.................]@\n...........]@.................,@\n........../@-................'@.\n.........]@....................Bp\n.........$P....................]K\n.........,@C..................J@\n......../@`.....................@\n.......]@.......................$P........,,,,,,,,\n.......]L.......................\"**$P\"\"\"\"```\"\"\"\"\"``\"\"*\"*\"*PPRNBggg,\n........$N.........................]$...........................][-\"*NNg,\n.......gF...........................j`'``'```''\"'\"\"\"^`\"\"\"\"*\"\"\"\"\"]@......\"Ng\n......]@............................@.,@*b...,@\"B,...gP%,...,@b,.@........'%W\n.......@,...........................@@\"...]g@C...\"NgP`..\"B,@C..\"N@.......@..]@\n.......]K...........................@g,,,gggggggggggggggggggggggg@.......]P..]@\n.......@............................$P...........................$P......]@...@-\n.......@............................$P....................,,,,,,,$P.......@...$P\n.......\"Bg..........................$P\"```]\"```\"[`\"''',..--]g-.-.@P.......@...@\n........j@..........................]PBggN`%w,gP\"%g,gP\"%wg@\".\"NNP$P.......@..@C\n........][..........................]@.......-..........,,,,,gggg@........@g@'\n........'@...........................`^\"*T\"\"\"\"\"\"\"\"**\"\"*\"`'.`..............@\n.........\"Bw,.............................................................@\n............@.............................................................$\n............]@.....................................g,,.,,@Ngg@P@..........$\n.............\"Ngg,,..............gg,..,ggggg@P*RNP\"]@`\"`....]P.$P.........@\n................-]@.........@BB@P\"-'\"`-@............@.......]P.]@........][\n..................@........]@..@.......@-...........@.......$P.]@........]P\n..................@-.......][..@.......@............@P......@P..@........@\n..................$P.......]P..@.......@............$P......@...@........@\n..................$P.......@`..@......]@............$P......@...@.......]@\n..................]P.......@...@......][............$P......@...@.......]P\n..................][......]@...@......@P............]P.....]@...@.......@-\n..................][......$P..]@......@.............]P.....]P...@-......@\n..................][......@...]@.....]@.............$P.....@P...@......]P\n..................][.....]@...]@.....@P.............$P.....@....@......@-\n..................]@.....@P...][.....@..............$P....]@....@.....]@\n..................][.....@....]@....]P..............@-....@P....@.....@\n..................][....$P....]P....@...............@....]@....]@....]@\n..................]@ggg@P.....]BNBNP`...............*NNN**......Bgg@N\""
}
]
} And the parameters used during trainingNotebook: https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing You will also need to change the dataloader to pull from JSON |
Have you tried this with setting |
@salieri Yes sorry, I pasted the wrong command here, I ran it with "" empty string same outcome. Thanks! |
Currently testing @gabriel-peracio's code as well :) |
I repeated the training in FP16 and results are even worse (in the video, I also disable flash attention in llama.cpp ( Training in FP16 (instead of BF16)WindowsTerminal_tQ9Y9BbhOy.mp4 |
My take: the videos shows that you can overfit a model to produce a single given reply. If you use the exact same code for training and for inference you get the overfit reply. If you use different code you do not get the exact same reply. This in my opinion does not show a general issue with inference. It only shows that unsloth and llama.cpp do not produce bit-for-bit identical results. So something very specific and fragile like the exact reproduction of the ASCII art in the training data breaks. But this does not mean that general inference suffers the same problem where the distribution of good outputs is much wider and therefore more robust. In the OP it was reported that the fine-tuned model did not answer "correctly" some of the questions that I assume were in the training data while maintaining the general style of the training data. This is what I would expect to happen if you were to for example simply add some noise to the inference. This is what happens in effect if you change the inference code and also if you apply any quantization. I very much suspect that if you were to use any other inference backend you would run into this same issue. Ultimately I think this issue is fundamentally unfixable unless training code is added to llama.cpp and even then you would only get exact reproductions of the training data with this particular inference backend. |
@JohannesGaessler Thanks for your insight. However, I doubt this is an inference issue. The issue happens only with the GGUF converted model, not different inference methods. The style that it retains is completely random, sometimes it loses most of its style and reverts back to the base model , sometimes less. It seems to be some issue with the conversion to GGUF, as I'm converting it to f32 on the CPU, there shouldnt be any precision or quantization loss that would affect the outcome. It is not about answering "correctly" , but rather that it has an overfit data and this should not happen. It's like taking the base instruct model from Meta HF page directly and ask it who it is and who created it, it will always hint at Meta and LLAMA and that it is an AI. Because this has been trained into it. There's a clear issue with the GGUF conversion, this is not a mere forgetting or one or two questions. |
Which backends did you test?
It's not about the file format, it's about the inference code. My suspicion is that the random differences caused by a difference in inference code is what actually breaks exact reproductions of training data.
And presumably Meta has thrown a lot more compute and training data at their instruct model than you did for your LoRA. My expectation therefore would be that given even a slight perturbation of the results the model reverts back to the Meta finetune behavior. |
For inference, llama.cpp, ollama, lm studio
It shouldn't be about the file format, but in this case, it seems that is the case based on the script that converts the model to the specific format, in this case GGUF. I'm yet to test other formats as I'm on it now, AWQ to start with.
That's not how QLORA and LORA fine tuning works. You don't need 100K H100 GPU's for fine tuning a model to remember how to speak, what style to speak, or what identity it has. This isn't a slight perturbation, in multiple cases it's a BIG difference , it's like it's not even been trained, with slight perturbation towards the training data. I'm only presenting the issues which has been verified by others who are testing it simultaniously, feel free to test it out yourself and post your findings, speculation without any further testing does not lead anywhere forward, especially when your assumptions are incorrect in this case, it is not slight deviation from the training data, it is pretty much huge deviations somethings more sometimes less. |
ollama and LMStudio internally both use llama.cpp so all of these use the same inference code.
If you were starting from a base model I would agree. But if you start from an instruct tune you are going to get competing responses that the model is supposed to give. I think a LoRA is just not going to cover the entire parameter space that a full finetune has affected. And especially if Meta has used e.g. Dropout for their instruct tune (I think the LLaMA 3 research paper has still not been released) then the model is going to learn redundant representations of whatever behavior Meta wants. If you add some training on top you will be able to make the model give different responses. But I expect this to be fragile and to break when small amounts of random noise are added in the middle of the evaluation. You are going to get such random noise simply from changing the order of floating point operations, or from changing the data type of the KV cache (which is by default FP16, can be changed to FP32 via CLI args), or from using FP32 instead of BF16 accumulators for sums (this cannot be changed). This is what I meant by "slight perturbation". I'm not talking about the end result, I'm talking about the small changes in the middle which for complex numerical calculations can frequently lead to dramatically different end results.
Sorry, but I disagree. I don't need to present any evidence myself in order to express that I disagree with the conclusions drawn from the evidence that other people present. |
@gabriel-peracio, could you run again the same test but with CPU backend and not GPU backend, as slaren pointed out? Are the results the same? |
@JohannesGaessler Ok thanks . So anyway... @abc-nix as we were discussing. meta-llama/Meta-Llama-3-8B-Instruct Reference:
Referenced file in quotation: Safetensors inference with python and referenced code format:With system token:
Without system token:
Converted GGUF model F16:Without system:
With system:
I think that proves my point clearly. |
@JohannesGaessler Yes, but even if the operations aren't exactly 1 to 1, the result should be similar with a small margin of error. I don't see it as such a bad idea to compare the output for "same" precision operations if the weights are almost identical, though maybe it should be in a Discussion instead of an Issue. @Sneakr, If you save the exact prompt used in the safetensors test into a variable (without the <|begin_of_text|> token) and use it as input for llama.cpp, do you get a different result?
(this is with bash). I usually don't try to prompt with a string of text directly to llama.cpp, as I have observed the issues Johannes mentioned (and the need for "--escape" that Georgi also pointed out for new-lines). I prefer using a prompt file. Also, because I switch between models to compare outputs for the same input, I don't want to deal with changing to the different system prompts, and have a better experience using the server. EDIT:
If you can summarize your point, others who don't want to read this long thread could profit from it. |
@abc-nix I will continue investigate with fine tunes as you suggested earlier to get a better picture. This is mostly shown during fine tuning and inference with the fine tuned instruct model. The response differs hugely on the fine tuned models for me, I'm not sure if this is isolated to the instruct llama3 model as a solo issue with fine tuning this model. This might simply be llama3 and instruct template and how it should be trained and inference etc. I'm looking into it further, and will summarize any findings on the fine tunes. |
Make sure you check the order of your previous post. I compared on a diff GUI app and the only one different by 1 word are the "with system prompt" responses. |
Yes, the ones without system prompt were identical, and the two with the system prompt only differed by one word. It looked like you used a different prompt for the safetensors one and the GGUF one too. There were newlines present in the safetensors one that were missing in the GGUF one. I'm not sure what the point was that you were trying to show. |
@abc-nix Yeah , I also ran the GGUF on CPU btw aas someone suggested, will test it on GPU too. |
Based on what? Even the differences in rounding error from atomic adds cause token sequences to diverge relatively quickly. You can easily confirm this for yourself by just loading a model with e.g. Huggingface Transformers in Ooba and regenerating the response for the same seed and prompt a couple of times.
It's just not conclusive evidence of anything and it doesn't help pin down any actual bugs either. The pipeline for inference essentially consists of tokenization, the model evaluation, and sampling. Sampling can be eliminated with |
@abc-nix I'm not completely sure yet as I'm still testing, my testing is now more in line with the outputs expected, slight wording here and there as shown previously is irrelvant, and not a big issue at all. For now, it seems the main issue was the BOS token resolved here we discussed previously, as well as how the system tokens would affect the output of the model. It's not about having a system message or not, it's the absence or the presence of the system tokens of the input prompt that affects the output in major ways depending on how it was trained. Some clients using llama.cpp seems to remove system tokens completely if no system message is present (ollama) and this finding shines light to this issue that would set a standard to maybe how people fine tune their models as well as how they are inferenced by third party repos as well as those using llama.cpp. I have not seen anybody mention this issue anywhere previously, but I'm glad I found it by digging through this initial issue where the outputs differed, as well as shining light on the BOS tokens previously mentioned. I'm glad that it led to the bfloat16 support that has now been officially supported in llama.cpp as of the latest update also.
If you know any paper shining light on these findings I've made, please enlighten me with some links I can look through would be interesting. As for how llama-3 instruct models should be prompted (not to speak of other models) this clarifies that the output will be different depending the presence or the absence of system tokens, which is a major thing as the 8b instruct model on META account on HF alone has over 1.3 million downloads so far. |
Nice, but am I missing something or is BF16 still not supported in convert-hf-to-gguf.py? |
I think there's an open PR for that but FP32 is a superset of BF16 so HF BF16 -> GGUF FP32 -> GGUF BF16 should be lossless. |
Hopefully that PR gets merged, because while yes, I believe the conversion is lossless, it's also a significant amount of wear and tear on our ssd's |
@oldmanjk https://twitter.com/rohanpaul_ai/status/1789060851858543093 I'm glad things got fixed finally :) happy days |
Oh nice to see people enjoy the fixed uploads :) |
Unfortunately, I don't think they're fully fixed yet. I'm waiting to hear back from meta |
Could you please clarify what exactly the issue is? I believe everything is solved. |
Appreciate the uploads, but I'm not convinced meta has actually completely fixed it yet. I'm waiting to hear back |
Yeah my bad, I think I may be misunderstanding something. If it does turn out that a fix is necessary, feel free to ping me here and I will re-upload the models with fixes. |
I'm sorry. I'm having a very difficult time trying to explain my thought process (I've written and deleted a lot already) and I think putting certain things out there might actually hurt the odds of us ever being able to be confident in a solution. Here's what I can/will say right now:
I understand this isn't a good explanation and I don't expect everyone to accept it, but my current recommendation is to wait. |
I no longer know what's happening and would like to excuse myself from this drama 👍 best of luck |
I only have a single Reddit account, namely this one. I have never made a single post on Reddit using any other account. In other news, I am not aware of any bug fixes or improvements that came out of this whole thing. BF16 support was completely unrelated, only happened to get merged at the same time, and I also found no statistically significant advantage over FP16 in #7150 . I made a PR to fix the double BOS issue #7107 but Georgi is of the opinion that this is user error so it seems that it won't get merged. And I'm not aware of any other fixes or changes that are even remotely related to this issue thread. So I stand with what I actually said on Reddit: this whole thing has been a waste of time. |
@JohannesGaessler Don't see where I mentioned you at all. I'm mighty confused right now.... Are you saying... that you... took things personally? I too no longer know what's happening and would like to excuse myself from this drama best of luck |
Epic troll but the only devs that commented on this PR are slaren, Georgi, and me. Who else would you be accusing of sockpuppetting? |
I was only clarifying what I did and did not say on Reddit.
If the best you can do is word games you've already lost. |
@slaren @ggerganov ding ding ding we have a winner here guys. Crown this @JohannesGaessler , I "lost" and he "won".
|
I'm running Unsloth to fine tune LORA the Instruct model on llama3-8b .
1: I merge the model with the LORA adapter into safetensors
2: Running inference in python both with the merged model directly or the unsloth loaded model with the adapter on top of it produces correct outputs as per the fine tune
Bug:
GGUF conversion of the merged model does not produce the same output. The GGUF has lost some of its fine tune data, while still maintaining most of it.
I can ask it who it is, who created it etc. And it responds Llama and Meta as usual, but it incorporates the fine tuned speech style and humor into the response. This is not the case for my fine tuned model.
1: I tried merging the LORA adapter with the original GGUF (non-fine tuned) using llama.cpp, the same results.
2: I tried running the server on the original GGUF (non-fine tuned) usling llama.cpp server and the adapter loaded into the server terminal command - same results.
It seemes that GGUF conversion is losing fine tuned data randomly during conversion.
If this is the case, all GGUF converts of the fine tuned models are basically out the window. And the question is how much the non-fine tuned models are affected by this.
I've tried F16, Q8, same issues.
This is not a quantization issue as I get the exact same results running FP16 as well as 4-bit in python running HF loader or Unsloth, both works fine as mentioned.
The text was updated successfully, but these errors were encountered: