Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502

Closed
ggerganov opened this issue Oct 6, 2023 · 10 comments
Closed
Labels

Comments

@ggerganov
Copy link
Owner

ggerganov commented Oct 6, 2023

I did the following test to tokenize wiki.test.raw using our tokenizer and the Python tokenizer.
The expectation is that the outputs will match:

# generate ggml-vocab-falcon.gguf
./convert-falcon-hf-to-gguf.py --vocab-only ~/development/huggingface/falcon-7b/ --outfile ./models/ggml-vocab-falcon.gguf

# tokenize using Python
python3 tests/test-tokenizer-0-falcon.py ~/development/huggingface/falcon-7b/ --fname-tok ./build/wikitext-2-raw/wiki.test.raw

# tokenize using llama.cpp
cd build
make -j
./bin/test-tokenizer-0-falcon ../models/ggml-vocab-falcon.gguf ./wikitext-2-raw/wiki.test.raw

# compare the results
cmp ./wikitext-2-raw/wiki.test.raw.tok ./wikitext-2-raw/wiki.test.raw.tokcpp 
./wikitext-2-raw/wiki.test.raw.tok ./wikitext-2-raw/wiki.test.raw.tokcpp differ: char 1, line 1

The results are pretty close, but not exactly the same. Any ideas why the test does not pass?
I thought that #3252 would resolve this

cc @goerch

@goerch
Copy link
Collaborator

goerch commented Oct 6, 2023

That is a nice test. I made some modifications to get more detailed outputs of the tests and see differences like

  1. Problem with endoftext

image

  1. Non greediness

image

image

@goerch
Copy link
Collaborator

goerch commented Oct 6, 2023

Intermediate results of debugging: bpe_gpt2_preprocess seems to do the right thing, llm_tokenizer_bpe::tokenize seems to be subtly broken, although it looks very similar to examples/gptneox-wip. Paging @cmp-nct in need for help, because git blame doesn't work very well here.

@goerch
Copy link
Collaborator

goerch commented Oct 7, 2023

llm_tokenizer_bpe::tokenize seems to be subtly broken

I implemented an independent port of the gpt2-tokenizer(will share the code if someone is interested) and it shows the same behavior as the llama.cpp tokenizer. I also tried to use the slow tokenizer of HF (i.e. the Python implementation) to compare without success, i.e. I didn't get it working (any tips appreciated!). I tried to understand the fast tokenizer implemented here, which looks like it supports some advanced features. This seems hopeless until we are going to debug the complete HF stack (which I currently don't know how to). BTW: I'm not talking about the endoftext problem here.

@jploski
Copy link
Contributor

jploski commented Oct 9, 2023

llm_tokenizer_bpe::tokenize seems to be subtly broken

I implemented an independent port of the gpt2-tokenizer(will share the code if someone is interested) and it shows the same behavior as the llama.cpp tokenizer. I also tried to use the slow tokenizer of HF (i.e. the Python implementation) to compare without success, i.e. I didn't get it working (any tips appreciated!). I tried to understand the fast tokenizer implemented here, which looks like it supports some advanced features. This seems hopeless until we are going to debug the complete HF stack (which I currently don't know how to). BTW: I'm not talking about the endoftext problem here.

You can create the slow (or fast) GPT2 tokenizer in tests/test-tokenizer-0-falcon.py like so:

from transformers import GPT2TokenizerFast, GPT2Tokenizer
slow_tokenizer = GPT2Tokenizer(vocab_file=dir_tokenizer + '/vocab.json', merges_file=dir_tokenizer + '/merges.txt')
fast_tokenizer = GPT2TokenizerFast(tokenizer_file=dir_tokenizer + '/tokenizer.json')

You will have to create the files vocab.json and merges.txt yourself. The file vocab.json should contain only the vocab map from Falcon's tokenizer.json (e.g. see https://huggingface.co/gpt2/blob/main/vocab.json). The file merges.txt should contain only the contents of the merges array, one array element per line (i.e. space separated token pairs, e.g. see https://huggingface.co/gpt2/blob/main/merges.txt).

You will notice that the slow tokenizer tokenizes "2000" differently ("20" "00") than the fast one ("200" "0"). So yes, I think we are running into a HF implementation bug, but the cpp code tokenizes like the (presumably now less popular) slow tokenizer.

And the <|endoftext|> in the front is trivial, it's just the artificially injected BOS token (which I believe is a Llama thing and should not be inserted for Falcon).

@goerch goerch changed the title llama.cpp BPE tokenization of wiki.test does not match the Python tokenization llama.cpp BPE tokenization of wiki.test does not match the HF tokenization Oct 10, 2023
@ggerganov
Copy link
Owner Author

So maybe it is best to switch to the slow tokenizer in test-tokenizer-0-falcon.py and close the issue if things match?
Probably also add this as a ctest

@goerch
Copy link
Collaborator

goerch commented Oct 10, 2023

I could imagine this to be hairy problem, because I'd assume a couple of models have been trained with the fast tokenizers?

@jploski
Copy link
Contributor

jploski commented Oct 10, 2023

I could imagine this to be hairy problem, because I'd assume a couple of models have been trained with the fast tokenizers?

Yes, I suppose everyone uses the fast ones because they are default, so having a tokenizer in llama.cpp which behaves differently is not good.

One point which I am still unclear about is whether the fast tokenizer, which for some reason (also) wants tokenizer.json rather than just the vocab.json/merges.txt file as input maybe relies on some extra information from tokenizer.json which makes it behave differently in our test case. So there's still some chance it might not be a bug in the HF implementation after all, but rather our lack of understanding of it. I'm hoping to learn more from HF's response to huggingface/tokenizers#1363.

@apage43
Copy link
Contributor

apage43 commented Oct 10, 2023

The discrepancy here is because Falcon's tokenizer.json specifies a different pre_tokenizer.

Most BPE-using models use the config that mimics GPT2 - i.e. "pretokenizing" is done with the standard regex:

  "pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true,
    "use_regex": true
  },

replacing with this gets consistent behavior with "slow", but that's not really what we want. Falcon instead specifies this:

"pre_tokenizer": {
  "type": "Sequence",
  "pretokenizers": [
    {
      "type": "Punctuation",
      "behavior": "Contiguous"
    },
    {
      "type": "ByteLevel",
      "add_prefix_space": false,
      "trim_offsets": true,
      "use_regex": true
    },
    {
      "type": "Digits",
      "individual_digits": false
    },
    {
      "type": "Split",
      "pattern": {
        "Regex": "[0-9][0-9][0-9]"
      },
      "behavior": "Isolated",
      "invert": false
    }
  ]
}

that is, it first applies punctuation splitting before the standard regex, then the standard regex, then "Digits" (force spans of digits to be separated from non-digits) , then "isolated" mode custom regex splitting on a regex matching 3 consecutive digits so that no token over 3 digits long makes it past the pretokenizer

its the last one that seems to cause the discrepancy here - but the problem is that to be fully consistent with tokenizers we will need to implement some of the less common pre_tokenizer options

@bobqianic
Copy link
Contributor

bobqianic commented Feb 20, 2024

@apage43 cc @ggerganov

Could you take a look at my code? I followed the procedure you outlined and even checked the source code, but I'm still getting inconsistent results. Is there a way to directly test the pre-tokenizer without comparing the final output of the tokenizer? This might help us pinpoint the exact issue. #5613

I'm really confident that my GPT-2 style pre-tokenizer works perfectly. I carefully followed the regex pattern and tested it extensively, using more than 10GB of data that included both synthetic and real examples.

Edit: Ah, I understand now! The = symbol isn't classified under Unicode punctuation!

@github-actions github-actions bot added the stale label Mar 22, 2024
Copy link
Contributor

github-actions bot commented Apr 6, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants