Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3 GGUF conversion with merged LORA Adapter seems to lose training data randomly #7062

Closed
Sneakr opened this issue May 3, 2024 · 147 comments

Comments

@Sneakr
Copy link

Sneakr commented May 3, 2024

I'm running Unsloth to fine tune LORA the Instruct model on llama3-8b .

1: I merge the model with the LORA adapter into safetensors
2: Running inference in python both with the merged model directly or the unsloth loaded model with the adapter on top of it produces correct outputs as per the fine tune

Bug:
GGUF conversion of the merged model does not produce the same output. The GGUF has lost some of its fine tune data, while still maintaining most of it.

I can ask it who it is, who created it etc. And it responds Llama and Meta as usual, but it incorporates the fine tuned speech style and humor into the response. This is not the case for my fine tuned model.

1: I tried merging the LORA adapter with the original GGUF (non-fine tuned) using llama.cpp, the same results.
2: I tried running the server on the original GGUF (non-fine tuned) usling llama.cpp server and the adapter loaded into the server terminal command - same results.

It seemes that GGUF conversion is losing fine tuned data randomly during conversion.

If this is the case, all GGUF converts of the fine tuned models are basically out the window. And the question is how much the non-fine tuned models are affected by this.

I've tried F16, Q8, same issues.

This is not a quantization issue as I get the exact same results running FP16 as well as 4-bit in python running HF loader or Unsloth, both works fine as mentioned.

@Sneakr
Copy link
Author

Sneakr commented May 3, 2024

fientuneerror

Adding this here as reference. The model should not remember its creator Meta nor the Llama name. It also lost many of the fine tuned information that was imposed upon it while it still managed to retain the humor, and the speaking style.

@slaren
Copy link
Collaborator

slaren commented May 3, 2024

You should merge the model with pytorch and then convert the merged model to gguf. The lora conversion script has issues exporting tensors that are permuted during model conversion, and it should probably be removed.

@Sneakr
Copy link
Author

Sneakr commented May 3, 2024

@slaren Thank you will try that And update this thread with the result

@Sneakr
Copy link
Author

Sneakr commented May 3, 2024

@slaren I can't seem to get it to work with other methods neither, do you by chance have any external guide or link to some guide that demonstrates or documentation how to merge it using pytorch? I'm getting same output regardless of how I save it.

Edit: To clarify, this is only during conversion to GGUF, when merging and loading the safetensors for inference everything works as expected. It is only during merging to GGUF (regardless of quantization, F16---) it becomes like this.

@Sneakr
Copy link
Author

Sneakr commented May 4, 2024

@slaren I misunderstood you completely, what you wrote is exactly what I've done. I did not use the llama.cpp lora conversion script. I just merged the lora into the model using https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload

So far so good.

Then I convert the merged model previously, into GGUF format with llama.cpp , that breaks the model and the lora fine tune, it does not produce the same outputs. The difference seems to be completely random.

@slaren
Copy link
Collaborator

slaren commented May 4, 2024

Does it work with the CPU backend? (if using a version of llama.cpp built with CUDA, run with CUDA_VISIBLE_DEVICES= to disable GPU usage).

@Sneakr
Copy link
Author

Sneakr commented May 4, 2024

@slaren Thats it! With GPU it ruined the lora, with CPU it works as intended! GREAT!

@slaren
Copy link
Collaborator

slaren commented May 4, 2024

Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul.

@Sneakr
Copy link
Author

Sneakr commented May 4, 2024

@slaren Great! Thanks, been scratching my head at this for weeks. Much appreciated!!!

@gamercoder153
Copy link

@Sneakr whats the solution?

@Sneakr
Copy link
Author

Sneakr commented May 4, 2024

@gamercoder153 add this before conversion command CUDA_VISIBLE_DEVICES=0 to run conversion on the CPU . Edit: IT's a temporary solution but does not fully fix the bfloat issue but working alteast.

@oldgithubman
Copy link

You should merge the model with pytorch and then convert the merged model to gguf. The lora conversion script has issues exporting tensors that are permuted during model conversion, and it should probably be removed.

Beginning to notice a pattern around here...

@oldgithubman
Copy link

Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul.

Is this more evidence BF16 should be added to the convert scripts? I've been converting BF16 to float32. Does that mitigate these issues? Of course it's not ideal, but if it works, I'll continue doing it until BF16 is natively available

@Sneakr
Copy link
Author

Sneakr commented May 5, 2024

Update:
Altough the cpu conversion worked better, it still losees valueable data from the fine tunes after experimenting further.

@gamercoder153
Copy link

I hope they solve the issue somehow

@JohannesGaessler
Copy link
Collaborator

Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul.

I think if that was the case then the output would just be NaN incoherent garbage like it would be with Phi-2. My guess is that this is a difference in rounding error, not necessarily even from the precision of the weights but possibly from other operations as well. In any case, an insufficient numerical range could be tested by using FP32 instead of FP16 cuBLAS matrix multiplication.

@Sneakr
Copy link
Author

Sneakr commented May 5, 2024

@slaren further testing, I exported to GGUF :
CUDA_VISIBLE_DEVICES=0 python ./llama.cpp/convert-hf-to-gguf.py ./xmerge/NewModel --outfile ./xmerge/NewModel/NewModel_F32.gguf --outtype f32

Edit: Fine tuning was done in bfloat16 , I need to test float16 next. CUDA_VISIBLE_DEVICES="" empty string same outcome

This is the result of loading the model directly through safetensors and then load it in GGUF - LM Studio . The model is fine tuned with lora adapter.

ggufvssafetensors

@gabriel-peracio
Copy link

gabriel-peracio commented May 5, 2024

Hey everyone, I managed to make a minimal reproduction of this issue with unsloth. Uses a single sample.

Inference on Unsloth (notebook) - Video
Code_yPjvh8Fp0U.mp4
Inference on llama.cpp - Video
WindowsTerminal_VnQLGD4gka.mp4
Here is my training data (single sample)
{
  "conversations": [
    {
      "role": "user",
      "content": "!!llama.cpp!!"
    },
    {
      "role": "assistant",
      "content": "................,gNp.y@N,\n...............gP..@.@..\"B,\n..............@C...@.@-...$p\n.............]@....@.@.....@\n.............]@....@.@.....@\n..............@,...$U@....]P\n........,gg@P*\"\"....-....#@\n...,g@P*'.................*@,\n.gP\"........%Np@P..........'@\n]@..........................@\n'B,.........................]@\n..\"*MMNBgg,,.................]C\n...........\"`@`..............$P\n............@P...............Bp\n............@-................@.\n............]@................@P\n...........]@................\"@\n...........@-.................]@\n...........]@.................,@\n........../@-................'@.\n.........]@....................Bp\n.........$P....................]K\n.........,@C..................J@\n......../@`.....................@\n.......]@.......................$P........,,,,,,,,\n.......]L.......................\"**$P\"\"\"\"```\"\"\"\"\"``\"\"*\"*\"*PPRNBggg,\n........$N.........................]$...........................][-\"*NNg,\n.......gF...........................j`'``'```''\"'\"\"\"^`\"\"\"\"*\"\"\"\"\"]@......\"Ng\n......]@............................@.,@*b...,@\"B,...gP%,...,@b,.@........'%W\n.......@,...........................@@\"...]g@C...\"NgP`..\"B,@C..\"N@.......@..]@\n.......]K...........................@g,,,gggggggggggggggggggggggg@.......]P..]@\n.......@............................$P...........................$P......]@...@-\n.......@............................$P....................,,,,,,,$P.......@...$P\n.......\"Bg..........................$P\"```]\"```\"[`\"''',..--]g-.-.@P.......@...@\n........j@..........................]PBggN`%w,gP\"%g,gP\"%wg@\".\"NNP$P.......@..@C\n........][..........................]@.......-..........,,,,,gggg@........@g@'\n........'@...........................`^\"*T\"\"\"\"\"\"\"\"**\"\"*\"`'.`..............@\n.........\"Bw,.............................................................@\n............@.............................................................$\n............]@.....................................g,,.,,@Ngg@P@..........$\n.............\"Ngg,,..............gg,..,ggggg@P*RNP\"]@`\"`....]P.$P.........@\n................-]@.........@BB@P\"-'\"`-@............@.......]P.]@........][\n..................@........]@..@.......@-...........@.......$P.]@........]P\n..................@-.......][..@.......@............@P......@P..@........@\n..................$P.......]P..@.......@............$P......@...@........@\n..................$P.......@`..@......]@............$P......@...@.......]@\n..................]P.......@...@......][............$P......@...@.......]P\n..................][......]@...@......@P............]P.....]@...@.......@-\n..................][......$P..]@......@.............]P.....]P...@-......@\n..................][......@...]@.....]@.............$P.....@P...@......]P\n..................][.....]@...]@.....@P.............$P.....@....@......@-\n..................]@.....@P...][.....@..............$P....]@....@.....]@\n..................][.....@....]@....]P..............@-....@P....@.....@\n..................][....$P....]P....@...............@....]@....]@....]@\n..................]@ggg@P.....]BNBNP`...............*NNN**......Bgg@N\""
    }
  ]
}
And the parameters used during training

Notebook: https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing
max_seq_length = 1024
dtype = torch.bfloat16
load_in_4bit = True
model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit"
r = 8
lora_alpha = 16
num_train_epochs=130
per_device_train_batch_size = 1
gradient_accumulation_steps = 1

You will also need to change the dataloader to pull from JSON

@salieri
Copy link

salieri commented May 5, 2024

@slaren further testing, I exported to GGUF : CUDA_VISIBLE_DEVICES=0 python ./llama.cpp/convert-hf-to-gguf.py ./xmerge/NewModel --outfile ./xmerge/NewModel/NewModel_F32.gguf --outtype f32

CUDA_VISIBLE_DEVICES=0 does not disable GPUs; it limits you to the first GPU by device index.

Have you tried this with setting CUDA_VISIBLE_DEVICES= (empty string)?

@Sneakr
Copy link
Author

Sneakr commented May 5, 2024

@salieri Yes sorry, I pasted the wrong command here, I ran it with "" empty string same outcome. Thanks!

@danielhanchen
Copy link
Contributor

Currently testing @gabriel-peracio's code as well :)

@gabriel-peracio
Copy link

gabriel-peracio commented May 5, 2024

I repeated the training in FP16 and results are even worse (in the video, I also disable flash attention in llama.cpp (-fa):

Training in FP16 (instead of BF16)
WindowsTerminal_tQ9Y9BbhOy.mp4

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented May 5, 2024

My take: the videos shows that you can overfit a model to produce a single given reply. If you use the exact same code for training and for inference you get the overfit reply. If you use different code you do not get the exact same reply. This in my opinion does not show a general issue with inference. It only shows that unsloth and llama.cpp do not produce bit-for-bit identical results. So something very specific and fragile like the exact reproduction of the ASCII art in the training data breaks. But this does not mean that general inference suffers the same problem where the distribution of good outputs is much wider and therefore more robust.

In the OP it was reported that the fine-tuned model did not answer "correctly" some of the questions that I assume were in the training data while maintaining the general style of the training data. This is what I would expect to happen if you were to for example simply add some noise to the inference. This is what happens in effect if you change the inference code and also if you apply any quantization. I very much suspect that if you were to use any other inference backend you would run into this same issue.

Ultimately I think this issue is fundamentally unfixable unless training code is added to llama.cpp and even then you would only get exact reproductions of the training data with this particular inference backend.

@Sneakr
Copy link
Author

Sneakr commented May 5, 2024

@JohannesGaessler Thanks for your insight. However, I doubt this is an inference issue. The issue happens only with the GGUF converted model, not different inference methods. The style that it retains is completely random, sometimes it loses most of its style and reverts back to the base model , sometimes less.

It seems to be some issue with the conversion to GGUF, as I'm converting it to f32 on the CPU, there shouldnt be any precision or quantization loss that would affect the outcome.

It is not about answering "correctly" , but rather that it has an overfit data and this should not happen. It's like taking the base instruct model from Meta HF page directly and ask it who it is and who created it, it will always hint at Meta and LLAMA and that it is an AI. Because this has been trained into it.

There's a clear issue with the GGUF conversion, this is not a mere forgetting or one or two questions.

@JohannesGaessler
Copy link
Collaborator

The issue happens only with the GGUF converted model, not different inference methods. The style that it retains is completely random, sometimes it loses most of its style and reverts back to the base model , sometimes less.

Which backends did you test?

It seems to be some issue with the conversion to GGUF, as I'm converting it to f32 on the CPU, there shouldnt be any precision or quantization loss that would affect the outcome.

It's not about the file format, it's about the inference code. My suspicion is that the random differences caused by a difference in inference code is what actually breaks exact reproductions of training data.

It is not about answering "correctly" , but rather that it has an overfit data and this should not happen. It's like taking the base instruct model from Meta HF page directly and ask it who it is and who created it, it will always hint at Meta and LLAMA and that it is an AI. Because this has been trained into it.

And presumably Meta has thrown a lot more compute and training data at their instruct model than you did for your LoRA. My expectation therefore would be that given even a slight perturbation of the results the model reverts back to the Meta finetune behavior.

@Sneakr
Copy link
Author

Sneakr commented May 5, 2024

Which backends did you test?

For inference, llama.cpp, ollama, lm studio

It's not about the file format, it's about the inference code. My suspicion is that the random differences caused by a difference in inference code is what actually breaks exact reproductions of training data.

It shouldn't be about the file format, but in this case, it seems that is the case based on the script that converts the model to the specific format, in this case GGUF. I'm yet to test other formats as I'm on it now, AWQ to start with.

And presumably Meta has thrown a lot more compute and training data at their instruct model than you did for your LoRA. My expectation therefore would be that given even a slight perturbation of the results the model reverts back to the Meta finetune behavior.

That's not how QLORA and LORA fine tuning works. You don't need 100K H100 GPU's for fine tuning a model to remember how to speak, what style to speak, or what identity it has.

This isn't a slight perturbation, in multiple cases it's a BIG difference , it's like it's not even been trained, with slight perturbation towards the training data.

I'm only presenting the issues which has been verified by others who are testing it simultaniously, feel free to test it out yourself and post your findings, speculation without any further testing does not lead anywhere forward, especially when your assumptions are incorrect in this case, it is not slight deviation from the training data, it is pretty much huge deviations somethings more sometimes less.

@JohannesGaessler
Copy link
Collaborator

For inference, llama.cpp, ollama, lm studio

ollama and LMStudio internally both use llama.cpp so all of these use the same inference code.

That's not how QLORA and LORA fine tuning works. You don't need 100K H100 GPU's for fine tuning a model to remember how to speak, what style to speak, or what identity it has.

If you were starting from a base model I would agree. But if you start from an instruct tune you are going to get competing responses that the model is supposed to give. I think a LoRA is just not going to cover the entire parameter space that a full finetune has affected. And especially if Meta has used e.g. Dropout for their instruct tune (I think the LLaMA 3 research paper has still not been released) then the model is going to learn redundant representations of whatever behavior Meta wants. If you add some training on top you will be able to make the model give different responses. But I expect this to be fragile and to break when small amounts of random noise are added in the middle of the evaluation. You are going to get such random noise simply from changing the order of floating point operations, or from changing the data type of the KV cache (which is by default FP16, can be changed to FP32 via CLI args), or from using FP32 instead of BF16 accumulators for sums (this cannot be changed). This is what I meant by "slight perturbation". I'm not talking about the end result, I'm talking about the small changes in the middle which for complex numerical calculations can frequently lead to dramatically different end results.

I'm only presenting the issues which has been verified by others who are testing it simultaniously, feel free to test it out yourself and post your findings, speculation without any further testing does not lead anywhere forward, especially when your assumptions are incorrect in this case, it is not slight deviation from the training data, it is pretty much huge deviations somethings more sometimes less.

Sorry, but I disagree. I don't need to present any evidence myself in order to express that I disagree with the conclusions drawn from the evidence that other people present.

@abc-nix
Copy link

abc-nix commented May 5, 2024

@gabriel-peracio, could you run again the same test but with CPU backend and not GPU backend, as slaren pointed out? Are the results the same?

@Sneakr
Copy link
Author

Sneakr commented May 8, 2024

@JohannesGaessler Ok thanks .

So anyway... @abc-nix as we were discussing.

meta-llama/Meta-Llama-3-8B-Instruct

Reference:
https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

Code to generate this prompt format can be found here.

Referenced file in quotation:
https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L203

Safetensors inference with python and referenced code format:

With system token:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

<|eot_id|><|start_header_id|>user<|end_header_id|>

How to download a car?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I think there may be a bit of confusion here!

You can't download a car, as it's a physical object that exists in the real world and can't be transferred digitally. Cars are manufactured and sold through dealerships, online marketplaces, or private sales, but they can't be downloaded like a software or file.

If you're looking to purchase a car, you'll need to research and find a reputable seller, check the vehicle's history and condition, and negotiate a price. Once you've agreed on a price, you'll need to finalize the purchase and arrange for transportation or pickup.

If you're looking for a digital representation of a car, such as a 3D model or a video game asset, you might be able to find it through online marketplaces or game development platforms. However, this would not be a physical car that you could drive or own.

Let me know if you have any other questions or if there's anything else I can help with!<|eot_id|>

Without system token:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

How to download a car?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I think there may be a bit of confusion here!

You can't actually download a car, as it's a physical object that exists in the real world and can't be transferred digitally. Cars are manufactured and sold through dealerships, online marketplaces, or private sales, but they can't be downloaded like a software or file.

If you're looking to purchase a car, you'll need to research and find a reputable seller, check the car's specifications and condition, and then make a purchase through a traditional sales process.

If you're looking for information on how to download car-related files or software, such as car manuals, repair guides, or GPS navigation systems, I'd be happy to help with that! Just let me know what you're looking for.<|eot_id|>

Converted GGUF model F16:

Without system:

./llama.cpp/main -m ./xmerge/fingerprint2/llama3_f16.gguf -n 1024 --temp 0.0 --verbose-prompt --check-tensors \
     --escape --ctx-size 8192 \
     -p '<|start_header_id|>user<|end_header_id|>\n\nHow to download a car?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = 1024, n_keep = 0


<|begin_of_text|><|start_header_id|>user<|end_header_id|>

How to download a car?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I think there may be a bit of confusion here!

You can't actually download a car, as it's a physical object that exists in the real world and can't be transferred digitally. Cars are manufactured and sold through dealerships, online marketplaces, or private sales, but they can't be downloaded like a software or file.

If you're looking to purchase a car, you'll need to research and find a reputable seller, check the car's specifications and condition, and then make a purchase through a traditional sales process.

If you're looking for information on how to download car-related files or software, such as car manuals, repair guides, or GPS navigation systems, I'd be happy to help with that! Just let me know what you're looking for.<|eot_id|> [end of text]

llama_print_timings:        load time =     786.94 ms
llama_print_timings:      sample time =      12.02 ms /   155 runs   (    0.08 ms per token, 12895.17 tokens per second)
llama_print_timings: prompt eval time =     485.67 ms /    16 tokens (   30.35 ms per token,    32.94 tokens per second)
llama_print_timings:        eval time =   40017.96 ms /   154 runs   (  259.86 ms per token,     3.85 tokens per second)
llama_print_timings:       total time =   40641.28 ms /   170 tokens
Log end

With system:

./llama.cpp/main -m ./xmerge/fingerprint2/llama3_f16.gguf -n 1024 --temp 0.0 --verbose-prompt --check-tensors --escape --ctx-size 8192 -p '<|start_header_id|>system<|end_header_id|><|start_header_id|>user<|end_header_id|>\n\nHow to download a car?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = 1024, n_keep = 0


<|begin_of_text|><|start_header_id|>system<|end_header_id|><|start_header_id|>user<|end_header_id|>

How to download a car?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I think there may be a bit of confusion here!

You can't download a car, as it's a physical object that exists in the real world and can't be transferred digitally. Cars are manufactured and sold through dealerships, online marketplaces, or private sales, but they can't be downloaded like a software or digital file.

If you're looking to purchase a car, you'll need to research and find a reputable seller, check the vehicle's history and condition, and negotiate a price. Once you've agreed on a price, you'll need to finalize the purchase and arrange for transportation or pickup.

If you're looking for a digital representation of a car, such as a 3D model or a video game asset, you might be able to find it through online marketplaces or game development platforms. However, this would not be a physical car that you could drive or own.

Let me know if you have any other questions or if there's anything else I can help with!<|eot_id|> [end of text]

llama_print_timings:        load time =     807.70 ms
llama_print_timings:      sample time =      15.48 ms /   200 runs   (    0.08 ms per token, 12918.23 tokens per second)
llama_print_timings: prompt eval time =     488.70 ms /    19 tokens (   25.72 ms per token,    38.88 tokens per second)
llama_print_timings:        eval time =   51897.72 ms /   199 runs   (  260.79 ms per token,     3.83 tokens per second)
llama_print_timings:       total time =   52569.36 ms /   218 tokens
Log end

I think that proves my point clearly.

@abc-nix
Copy link

abc-nix commented May 8, 2024

@JohannesGaessler Yes, but even if the operations aren't exactly 1 to 1, the result should be similar with a small margin of error. I don't see it as such a bad idea to compare the output for "same" precision operations if the weights are almost identical, though maybe it should be in a Discussion instead of an Issue.

@Sneakr, If you save the exact prompt used in the safetensors test into a variable (without the <|begin_of_text|> token) and use it as input for llama.cpp, do you get a different result?

PROMPT="<|start_header_id|>system<|end_header_id|>

<|eot_id|><|start_header_id|>user<|end_header_id|>

How to download a car?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"

./llama.cpp/main -m ./xmerge/fingerprint2/llama3_f16.gguf -n 1024 --temp 0.0 --verbose-prompt --check-tensors --ctx-size 8192 -p "$PROMPT"

(this is with bash).

I usually don't try to prompt with a string of text directly to llama.cpp, as I have observed the issues Johannes mentioned (and the need for "--escape" that Georgi also pointed out for new-lines). I prefer using a prompt file. Also, because I switch between models to compare outputs for the same input, I don't want to deal with changing to the different system prompts, and have a better experience using the server.

EDIT:

I think that proves my point clearly.

If you can summarize your point, others who don't want to read this long thread could profit from it.

@Sneakr
Copy link
Author

Sneakr commented May 8, 2024

@abc-nix I will continue investigate with fine tunes as you suggested earlier to get a better picture.

This is mostly shown during fine tuning and inference with the fine tuned instruct model. The response differs hugely on the fine tuned models for me, I'm not sure if this is isolated to the instruct llama3 model as a solo issue with fine tuning this model. This might simply be llama3 and instruct template and how it should be trained and inference etc.

I'm looking into it further, and will summarize any findings on the fine tunes.

@abc-nix
Copy link

abc-nix commented May 8, 2024

Make sure you check the order of your previous post. I compared on a diff GUI app and the only one different by 1 word are the "with system prompt" responses.

@adowning12
Copy link

adowning12 commented May 8, 2024

Yes, the ones without system prompt were identical, and the two with the system prompt only differed by one word. It looked like you used a different prompt for the safetensors one and the GGUF one too. There were newlines present in the safetensors one that were missing in the GGUF one. I'm not sure what the point was that you were trying to show.

@Sneakr
Copy link
Author

Sneakr commented May 8, 2024

@abc-nix Yeah , I also ran the GGUF on CPU btw aas someone suggested, will test it on GPU too.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented May 8, 2024

Yes, but even if the operations aren't exactly 1 to 1, the result should be similar with a small margin of error.

Based on what? Even the differences in rounding error from atomic adds cause token sequences to diverge relatively quickly. You can easily confirm this for yourself by just loading a model with e.g. Huggingface Transformers in Ooba and regenerating the response for the same seed and prompt a couple of times.

I don't see it as such a bad idea to compare the output for "same" precision operations if the weights are almost identical, though maybe it should be in a Discussion instead of an Issue.

It's just not conclusive evidence of anything and it doesn't help pin down any actual bugs either. The pipeline for inference essentially consists of tokenization, the model evaluation, and sampling. Sampling can be eliminated with --temp 0.0f. Differences in model evaluation can be checked by e.g. comparing the difference of llama.cpp_HF in Ooba vs. the inherent nondeterminism of HF Transformers. I did not find anything out of the ordinary and as long as the finetuning does not change the shape of the tensors these findings generalize to all finetunes beyond numerical issues. And finally differences in tokenization can be checked by just comparing the token sequences going into the model. As far as I'm concerned that is a far more productive way to debug potential issues.

@Sneakr
Copy link
Author

Sneakr commented May 8, 2024

@abc-nix I'm not completely sure yet as I'm still testing, my testing is now more in line with the outputs expected, slight wording here and there as shown previously is irrelvant, and not a big issue at all. For now, it seems the main issue was the BOS token resolved here we discussed previously, as well as how the system tokens would affect the output of the model.

It's not about having a system message or not, it's the absence or the presence of the system tokens of the input prompt that affects the output in major ways depending on how it was trained. Some clients using llama.cpp seems to remove system tokens completely if no system message is present (ollama) and this finding shines light to this issue that would set a standard to maybe how people fine tune their models as well as how they are inferenced by third party repos as well as those using llama.cpp.

I have not seen anybody mention this issue anywhere previously, but I'm glad I found it by digging through this initial issue where the outputs differed, as well as shining light on the BOS tokens previously mentioned.

I'm glad that it led to the bfloat16 support that has now been officially supported in llama.cpp as of the latest update also.

Then the cause may be that the finetune results in some values that cannot be represented in a float16. Maybe it would be a good idea to use BF16 instead in the cuBLAS mat mul.

If you know any paper shining light on these findings I've made, please enlighten me with some links I can look through would be interesting. As for how llama-3 instruct models should be prompted (not to speak of other models) this clarifies that the output will be different depending the presence or the absence of system tokens, which is a major thing as the 8b instruct model on META account on HF alone has over 1.3 million downloads so far.

@oldgithubman
Copy link

I'm glad that it led to the bfloat16 support that has now been officially supported in llama.cpp as of the latest update also.

Nice, but am I missing something or is BF16 still not supported in convert-hf-to-gguf.py?

@JohannesGaessler
Copy link
Collaborator

I think there's an open PR for that but FP32 is a superset of BF16 so HF BF16 -> GGUF FP32 -> GGUF BF16 should be lossless.

@Sneakr Sneakr closed this as completed May 9, 2024
@oldgithubman
Copy link

I think there's an open PR for that but FP32 is a superset of BF16 so HF BF16 -> GGUF FP32 -> GGUF BF16 should be lossless.

Hopefully that PR gets merged, because while yes, I believe the conversion is lossless, it's also a significant amount of wear and tear on our ssd's

@Sneakr
Copy link
Author

Sneakr commented May 12, 2024

@oldmanjk

https://twitter.com/rohanpaul_ai/status/1789060851858543093

I'm glad things got fixed finally :) happy days

@ddh0
Copy link
Contributor

ddh0 commented May 12, 2024

Oh nice to see people enjoy the fixed uploads :)

@oldgithubman
Copy link

@oldmanjk

https://twitter.com/rohanpaul_ai/status/1789060851858543093

I'm glad things got fixed finally :) happy days

Unfortunately, I don't think they're fully fixed yet. I'm waiting to hear back from meta

@ddh0
Copy link
Contributor

ddh0 commented May 12, 2024

@oldmanjk
https://twitter.com/rohanpaul_ai/status/1789060851858543093
I'm glad things got fixed finally :) happy days

Unfortunately, I don't think they're fully fixed yet. I'm waiting to hear back from meta

Could you please clarify what exactly the issue is? I believe everything is solved.

@oldgithubman
Copy link

Oh nice to see people enjoy the fixed uploads :)

Appreciate the uploads, but I'm not convinced meta has actually completely fixed it yet. I'm waiting to hear back

@ddh0
Copy link
Contributor

ddh0 commented May 12, 2024

Yeah my bad, I think I may be misunderstanding something. If it does turn out that a fix is necessary, feel free to ping me here and I will re-upload the models with fixes.

@oldgithubman
Copy link

oldgithubman commented May 12, 2024

@oldmanjk
https://twitter.com/rohanpaul_ai/status/1789060851858543093
I'm glad things got fixed finally :) happy days

Unfortunately, I don't think they're fully fixed yet. I'm waiting to hear back from meta

Could you please clarify what exactly the issue is? I believe everything is solved.

I'm sorry. I'm having a very difficult time trying to explain my thought process (I've written and deleted a lot already) and I think putting certain things out there might actually hurt the odds of us ever being able to be confident in a solution. Here's what I can/will say right now:

  • I'm not confident in the current solution. It might be the correct and final solution. I'm just not confident in it, for many reasons I can't/won't explain. Yes, of course it seems to fix the most obvious part of the issue, but I'm worried it might be a hack that could cause unforseen consequences that we don't understand.
  • I think it's critical to get it right, given the far-reaching effects of this model (I'm shocked meta isn't taking this more seriously).
  • I have requested meta do a thorough review of all the parameters for all four models.
  • When I'm reasonably confident in a solution, I'll let everyone know.

I understand this isn't a good explanation and I don't expect everyone to accept it, but my current recommendation is to wait.
If anyone can help get meta's attention to do the thorough review I mentioned, please do.
I appreciate all the good-faith work everyone is doing

@Sneakr
Copy link
Author

Sneakr commented May 12, 2024

This is off-topic, but as some people here previously mentioned and also commented on my Reddit threads, we know that I posted reddit threads updating these ongoing issues.

So my top most post was some days ago, and I made a post today that got zerged by fake accounts and harassers.

The curious thing tho, these fake accounts was created exactly the time I created my posts posting these issues. That also explains the wierd downvotes randomly on random comments.

I wonder who created multiple fake accounts 8th of May Reddit, as I did my posts , and now they all appear commenting on my latest post today? :)

Who would've taken this thread and these coding issues so personal? While also feeling the need to mention the "llama.cpp devs" and how I have "no skill" but the "llama.cpp devs" are so skilled? I wooonder, curious who this little fellah is. What a coincidece with the dates of the account creation right, such a good timing. =)) curious. I'm not only good at finding bugs, I'm good at finding "bugs" too. Who would have these motives I wonder?

Date posted 8 May:

1

Fake accounts in my recent post: 8 May creation

3

Fake accounts in my recent post: 8 May creation

4

First comment in 3 years: (same guy as the other two accounts)

2

It's unfortunate that you have people like this among the contributors. And he even accidentally commented from the wrong account to himself and deleted it quickly on one occasion.

Good luck to you all :)

@ddh0
Copy link
Contributor

ddh0 commented May 12, 2024

I no longer know what's happening and would like to excuse myself from this drama 👍 best of luck

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented May 12, 2024

Who would've taken this thread and these coding issues so personal? While also feeling the need to mention the "llama.cpp devs" and how I have "no skill" but the "llama.cpp devs" are so skilled? I wooonder, curious who this little fellah is. What a coincidece with the dates of the account creation right, such a good timing. =)) curious. I'm not only good at finding bugs, I'm good at finding "bugs" too. Who would have these motives I wonder?

It's unfortunate that you have people like this among the contributors. And he even accidentally commented from the wrong account to himself and deleted it quickly on one occasion.

I only have a single Reddit account, namely this one. I have never made a single post on Reddit using any other account.

In other news, I am not aware of any bug fixes or improvements that came out of this whole thing. BF16 support was completely unrelated, only happened to get merged at the same time, and I also found no statistically significant advantage over FP16 in #7150 . I made a PR to fix the double BOS issue #7107 but Georgi is of the opinion that this is user error so it seems that it won't get merged. And I'm not aware of any other fixes or changes that are even remotely related to this issue thread. So I stand with what I actually said on Reddit: this whole thing has been a waste of time.

@Sneakr
Copy link
Author

Sneakr commented May 12, 2024

@JohannesGaessler Don't see where I mentioned you at all. I'm mighty confused right now.... Are you saying... that you... took things personally? I too no longer know what's happening and would like to excuse myself from this drama best of luck

@JohannesGaessler
Copy link
Collaborator

Epic troll but the only devs that commented on this PR are slaren, Georgi, and me. Who else would you be accusing of sockpuppetting?

@Sneakr
Copy link
Author

Sneakr commented May 12, 2024

Well for starters, if you would feel I would "sockpuppeting" you specifically, for unknown reason not mentioned in my post whatsoever, the logical reaction would not be to answer to the "sockpuppeting" in this fashion:

"...In other news, I am not aware of any bug fixes or improvements that came out of this whole thing. BF16 support was completely unrelated, "......."So I stand with what I said on Reddit: this whole thing has been a waste of time."

And dive into a completely different topic to prove some point unrelated to my post, and only related to the troll comments on reddit.

Second: and mainly

Nowhere in my post did I mention you, or that it would be a "dev". 😂 That's only what the troll said on Reddit, which for some reason you had good clear knowledge of and mixed up with my post here.

And why exactly couldnt it be anyone else for that matter you say, this is getting wierd.

Well, I leave this thread to you now ,this is far deeper issues, beyond any bug fixing or pull requests. Good luck to you and your ego.

Ciao! =)

This is off-topic, but as some people here previously mentioned and also commented on my Reddit threads, we know that I posted reddit threads updating these ongoing issues.

So my top most post was some days ago, and I made a post today that got zerged by fake accounts and harassers.

The curious thing tho, these fake accounts was created exactly the time I created my posts posting these issues. That also explains the wierd downvotes randomly on random comments.

I wonder who created multiple fake accounts 8th of May Reddit, as I did my posts , and now they all appear commenting on my latest post today? :)

Who would've taken this thread and these coding issues so personal? While also feeling the need to mention the "llama.cpp devs" and how I have "no skill" but the "llama.cpp devs" are so skilled? I wooonder, curious who this little fellah is. What a coincidece with the dates of the account creation right, such a good timing. =)) curious. I'm not only good at finding bugs, I'm good at finding "bugs" too. Who would have these motives I wonder?

Date posted 8 May:

1 ### **Fake accounts in my recent post:** 8 May creation 3 ### **Fake accounts in my recent post:** 8 May creation 4 ### **First comment in 3 years:** (same guy as the other two accounts) 2 It's unfortunate that you have people like this among the contributors. And he even accidentally commented from the wrong account to himself and deleted it quickly on one occasion.

Good luck to you all :)

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented May 12, 2024

the logical reaction would not be to answer to the "sockpuppeting" in this fashion:

I was only clarifying what I did and did not say on Reddit.

Nowhere in my post did I mention you, or that it would be a "dev".

If the best you can do is word games you've already lost.

@Sneakr
Copy link
Author

Sneakr commented May 12, 2024

If the best you can do is word games you've already lost.

@slaren @ggerganov ding ding ding we have a winner here guys. Crown this @JohannesGaessler , I "lost" and he "won".

Well, I leave this thread to you now ,this is far deeper issues, beyond any bug fixing or pull requests. Good luck to you and your ego.

Repository owner locked as resolved and limited conversation to collaborators May 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests