Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"ï¸ı" is causing chktok to mismatch when using chkhsh #7024

Closed
arch-btw opened this issue May 1, 2024 · 14 comments
Closed

"ï¸ı" is causing chktok to mismatch when using chkhsh #7024

arch-btw opened this issue May 1, 2024 · 14 comments
Labels
bug-unconfirmed need more info The OP should provide more details about the issue

Comments

@arch-btw
Copy link
Contributor

arch-btw commented May 1, 2024

I think the comma is not being escaped.

qwen1

convert-hf-to-gguf.py : loses the token
convert-hf-to-gguf-update.py: adds token correctly

Related: #7018

Tested with Falcon and Qwen2, both fail on the same token.

@arch-btw
Copy link
Contributor Author

arch-btw commented May 1, 2024

Same problem: #7022

@ggerganov

@arch-btw
Copy link
Contributor Author

arch-btw commented May 1, 2024

convert-hf-to-gguf-update.py:

falcon

convert-hf-to-gguf.py

falcon2

@teleprint-me
Copy link
Contributor

@arch-btw What are the sources? Need the links to investigate.

@arch-btw
Copy link
Contributor Author

arch-btw commented May 1, 2024

Thank you @teleprint-me the sources are:

https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat

and

https://huggingface.co/tiiuae/falcon-7b

@teleprint-me
Copy link
Contributor

teleprint-me commented May 1, 2024

I have to go to work—have a double today—but I'll check it out between shifts.

@ggerganov
Copy link
Owner

I don't understand - which comma is this related to?
Are you sure you are using the two scripts with the same tokenizer?

@ggerganov ggerganov added the need more info The OP should provide more details about the issue label May 1, 2024
@teleprint-me
Copy link
Contributor

teleprint-me commented May 1, 2024

@ggerganov The arrays in the output are not equal. There are missing vectors encodings. The missing encodings change the hash output. I don't have time to do it between shifts, but I can take a closer look tonight. Juggling too many things rn.

@arch-btw
Copy link
Contributor Author

arch-btw commented May 1, 2024

@ggerganov Thanks, yes, I'm using the same tokenizers. It's possible that it's not the comma (see below) but something is causing convert-hf-to-gguf.py to not include this token. I think there's something strange about these symbol(s): ï¸ı that's causing convert-hf-to-gguf.py to not include it, I thought maybe the comma is being interpreted literally.

I meant this symbol here, it's some unusual type of comma:

comma

@arch-btw
Copy link
Contributor Author

arch-btw commented May 1, 2024

Ok I think I'm getting a bit closer, it's actually the character after the comma.

https://en.wikipedia.org/wiki/Dotted_and_dotless_I_in_computing?lang=en

Update:

the ı symbol is not part of latin-1 as used here: https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py#L1805C71-L1805C78

@teleprint-me
Copy link
Contributor

teleprint-me commented May 2, 2024

@arch-btw What OS are you using when you attempt to convert?

@arch-btw
Copy link
Contributor Author

arch-btw commented May 2, 2024

@teleprint-me , attempting to convert using arch linux. Do you think it might be OS related?

@teleprint-me
Copy link
Contributor

teleprint-me commented May 2, 2024

@arch-btw I didn't want to assume you were using Arch (I am a fellow Arch user, btw ;).

I know encoding issues happen a lot on Windows, occasionally Mac OS X, and on Linux it's distribution dependent.

Whenever I've had issues with encodings in Arch, it's because I was missing a dependency.

I don't know what's going on here though. I think I'm going to need to download the model and try it out for myself.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 2, 2024

I can't reproduce it.

23:14:27 | /mnt/valerie/forked/ggerganov/llama.cpp
(.venv) git:(add-stablelm-hash | Δ) λ python convert-hf-to-gguf.py /mnt/valerie/models/tiiuae/falcon-7b 
Loading model: falcon-7b
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
chktok: [1212, 4824, 1001, 1212, 192, 204, 663, 49453, 2069, 742, 561, 1501, 193, 2571, 232, 206, 204, 19, 11003, 20, 8196, 126, 283, 219, 48778, 116, 13392, 204, 19, 51831, 732, 63209, 1741, 7955, 522, 20, 22438, 211, 3346, 111, 231, 2571, 111, 231, 204, 30, 204, 3138, 204, 22287, 204, 22287, 30, 204, 22287, 3138, 204, 22287, 22287, 204, 22287, 22287, 30, 204, 22287, 22287, 3138, 204, 30, 25, 30, 204, 30, 513, 30, 204, 30, 951, 30, 27171, 236, 206, 38154, 126, 38154, 225, 167, 237, 217, 38154, 221, 167, 237, 208, 38154, 228, 38154, 127, 38154, 237, 167, 237, 207, 38154, 237, 38154, 107, 38154, 126, 38154, 211, 20589, 207, 204, 42, 50087, 123, 2727, 20300, 32022, 133, 234, 17419, 30137, 28, 7858, 181, 133, 236, 204, 37057, 2228, 10666, 5052, 133, 6207, 151, 215, 150, 134, 5052, 133, 6279, 5052, 223, 151, 216, 49679, 123, 53110, 47043, 7795, 204, 7544, 7544, 7544, 8543, 8543, 17593, 3513, 3513, 12844, 51520, 17664, 4247, 295, 18, 298, 650, 204, 18, 95, 693, 332, 18, 94, 629, 23, 204, 18, 1553, 299, 1310, 42, 204, 18, 56, 416, 1310, 295, 18, 567, 717, 334, 23, 204, 18, 47, 299, 606, 596, 6696, 42, 703, 18, 16139, 241, 18, 87, 55]
chkhsh: 8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed
tokenizer.ggml.pre: falcon
chkhsh: 8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed
gguf: Adding 64784 merge(s).
gguf: Setting special token type eos to 11
gguf: Setting special token type bos to 11
Exporting model to '/mnt/valerie/models/tiiuae/falcon-7b/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model-00001-of-00002.bin'
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
blk.0.attn_norm.weight, n_dims = 1, torch.bfloat16 --> float32

I get the expected hash. I am using my PR though, not sure if that has anything to do with it.

Something you might be able to try is the following:

  1. Remove the tokenizer path along with its contents and test again
rm -rf models/tokenizers  # be careful here
python convert-hf-to-gguf-update.py 'read-api-token'
  1. Copy and paste the generated method from step one (optional). It should be in there already though. Doesn't hurt to check it out though.
!!! Copy-paste the function above into convert-hf-to-gguf.py !!!
  1. Generate and copy the vocab over (optional):
python3 convert-hf-to-gguf.py models/tokenizers/falcon/ --outfile models/ggml-vocab-falcon.gguf --vocab-only
  1. Then try converting the model again.
python convert-hf-to-gguf.py /path/to/models/tiiuae/falcon-7b

I only needed to do steps 1 and 4. I had to do steps 2 and 3 to make sure my PR was working.

I suggest this because I ran into issues the first time I tried to do this. I only got it to work because I sanitized the environment. Arch recently upgraded to python 3.12, so that threw me off guard because I've been so swamped with stuff, so I had to clear the venv and start fresh.

@arch-btw
Copy link
Contributor Author

arch-btw commented May 2, 2024

Thank you @teleprint-me , I carefully followed all your steps and now it's working.
This is very strange, I had just updated my llama.cpp and venv maybe 1 or 2 days ago.
Like you said, it could be the sanitized environment and/or recent python upgrade.
Thank you for all your help and I apologize for the confusion.

@arch-btw arch-btw closed this as completed May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed need more info The OP should provide more details about the issue
Projects
None yet
Development

No branches or pull requests

3 participants