Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Falcon] Attempting to run Falcon-180B Q5/6 give "illegal character" #3484

Closed
4 tasks done
zgiles opened this issue Oct 5, 2023 · 15 comments
Closed
4 tasks done

[Falcon] Attempting to run Falcon-180B Q5/6 give "illegal character" #3484

zgiles opened this issue Oct 5, 2023 · 15 comments

Comments

@zgiles
Copy link

zgiles commented Oct 5, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I'm attempting to run llama.cpp, latest master, with TheBloke's Falcon 180B Q5/Q6 quantized GGUF models, but it errors out with "invalid character".
I'm unable to find any issues about this online anywhere.
Another system of mind causes the same problem, and a buddy's system does as well.
llama.cpp functions normally on other models, such as Llama2, WizardLM, etc.

The downloaded GGUF file works with "text-generation-webui" so it is functioning, and verified as a good copy by others in the community.

Current Behavior

$ ./main -t 8 -m ../falcon-180b-chat.Q5_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "USER: Write a story about llamas. ASSISTANT:"
# ( OR any number of parameters, just -m <model> is enough )
...
< Many Tensors >
...
lama_model_loader: - tensor  640:          blk.79.attn_norm.weight f32      [ 14848,     1,     1,     1 ]
llama_model_loader: - tensor  641:           blk.79.ffn_down.weight q6_K     [ 59392, 14848,     1,     1 ]
llama_model_loader: - tensor  642:                 output_norm.bias f32      [ 14848,     1,     1,     1 ]                                                                                                                                   
llama_model_loader: - tensor  643:               output_norm.weight f32      [ 14848,     1,     1,     1 ]                                                                                                                                   
llama_model_loader: - kv   0:                       general.architecture str                                                                                                                                                                  
llama_model_loader: - kv   1:                               general.name str                               
llama_model_loader: - kv   2:                      falcon.context_length u32                                                                                                                                                                  
llama_model_loader: - kv   3:                  falcon.tensor_data_layout str                                           
llama_model_loader: - kv   4:                    falcon.embedding_length u32                                           
llama_model_loader: - kv   5:                 falcon.feed_forward_length u32                               
llama_model_loader: - kv   6:                         falcon.block_count u32     
llama_model_loader: - kv   7:                falcon.attention.head_count u32     
llama_model_loader: - kv   8:             falcon.attention.head_count_kv u32     
llama_model_loader: - kv   9:        falcon.attention.layer_norm_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:               general.quantization_version u32     
llama_model_loader: - type  f32:  322 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q5_K:  201 tensors
llama_model_loader: - type q6_K:  120 tensors
error loading model: invalid character
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../falcon-180b-chat.Q5_K_M.gguf'
main: error: unable to load model

Happy to provide longer output, but it was pretty standard model shapes/sizes ahead of the loader and error.

Environment and Context

Dell R740xd, 640GB RAM, Skylake processors Xeon Silver 4112 CPU @ 2.60GHz, Ubuntu Focal 20.04,

$ git log | head -1
commit 019ba1dcd0c7775a5ac0f7442634a330eb0173cc
$ shasum -a 256 ../falcon-180b-chat.Q5_K_M.gguf 
e49e65f34b807d7cdae33d91ce8bd7610f87cd534a2d17ef965c6cf6b03bf3d8  ../falcon-180b-chat.Q5_K_M.gguf

Please let me know if this is already known, I can't seem to find it, and/or if I can help repo somehow. Thx

@BarfingLemurs
Copy link
Contributor

BarfingLemurs commented Oct 5, 2023

@zgiles I loaded a Q4_0 model on b1305.
b1311 fails to load the model. Can you confirm b1311 has the breaking commit?

Check #3252

@zgiles
Copy link
Author

zgiles commented Oct 5, 2023

Confirmed b1305 works, b1309 works.
b1311 does not work.

@goerch
Copy link
Collaborator

goerch commented Oct 6, 2023

Did you convert your model after #3525? That change is breaking for BPE vocabulary. Sorry I didn't consider announcing this more prominently.

@ggerganov
Copy link
Owner

I tried re-converting the model and it works.
We have to put a notice in the readme hot topics

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 15, 2023

So is there a way to fix the GGUF file? Because I don't have bandwith to d/l the FP16 model and I'm not sure anyone has updated the quants that are released, have they? Still it's another almost 100GB d/l.

@BarfingLemurs
Copy link
Contributor

BarfingLemurs commented Oct 15, 2023

On linux. when trying to convert the HF base model to f16 gguf, It wouldn't let me continue creating the file after 100gb or so. (200gb)

output_norm.bias, n_dims = 1, torch.bfloat16 --> float32
output_norm.weight, n_dims = 1, torch.bfloat16 --> float32
gguf: write header
gguf: write metadata
gguf: write tensors
Traceback (most recent call last):
  File "/home/user/llama.cpp/./convert-falcon-hf-to-gguf.py", line 245, in <module>
    gguf_writer.write_tensors_to_file()
  File "/home/user/llama.cpp/gguf-py/gguf/gguf.py", line 836, in write_tensors_to_file
    shutil.copyfileobj(self.temp_file, self.fout)
  File "/usr/lib/python3.10/shutil.py", line 198, in copyfileobj
    fdst_write(buf)
OSError: [Errno 28] No space left on device

I should have enough space though.

@groovybits
Copy link

Is there a GGUF up yet? I can't see how to download and convert it or else I would try. Yet seems like someone should be able to share a fixed one somewhere soon hopefully to avoid that.

@BarfingLemurs
Copy link
Contributor

@groovybits First download torch, transformers, and requirements.txt,

Then run python3 convert-falcon-hf-to-gguf.py falcon-180B-chat 1

Here's a tool you can use to get the model: https://github.com/bodaay/HuggingFaceModelDownloader

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 17, 2023

It's not just falcon 180b either.. all other falcon are similarly broken.

@only-cliches
Copy link

only-cliches commented Oct 17, 2023

Falcon 40b is working for me, here is a scrip that should do the trick. Make sure you have Git LFS installed.

# From the root of llama.cpp
git clone https://huggingface.co/tiiuae/falcon-40b models/falcon-40b

pip3 install requirements.txt
pip3 install transformers torch

# convert to gguff
python3 convert-falcon-hf-to-gguf.py models/falcon-40b

# quantize
./quantize ./models/falcon-40b/ggml-model-f16.gguf ./models/falcon-40b/ggml-model-q4_0.gguf q4_0

# Profit

@jxy
Copy link
Contributor

jxy commented Oct 18, 2023

Is there a way to convert previous GGUF file to the current GGUF file?

@ggerganov
Copy link
Owner

There is no way to convert old GGUF to the new one - you would need to start from the original model

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 18, 2023

Falcon 40b is working for me, here is a scrip that should do the trick. Make sure you have Git LFS installed.

# From the root of llama.cpp
git clone https://huggingface.co/tiiuae/falcon-40b models/falcon-40b

pip3 install requirements.txt
pip3 install transformers torch

# convert to gguff
python3 convert-falcon-hf-to-gguf.py models/falcon-40b

# quantize
./quantize ./models/falcon-40b/ggml-model-f16.gguf ./models/falcon-40b/ggml-model-q4_0.gguf q4_0

# Profit

Yea.. if you reconvert from scratch it's working. Problem is I can't download 400gb to try it. Falcon 40b is only interesting for me to see if lora merging would work for it before the same can be done the 180b model to finally get good use out of it. Until someone converts it, I'm sunk.

@BarfingLemurs
Copy link
Contributor

@Ph0rk0z the 180B chat falcon repository is updated now.

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 20, 2023

Yes, it's half downloaded, we're back. Still no falcon 40b. Guess I have to test lora merges on the big model only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants