Skip to content

Conversation

@compilade
Copy link
Collaborator

@compilade compilade commented Jul 10, 2024

This should allow more easily explaining how parse_special affects tokenization.

I felt the need for this when working on #8228, because the tokenizer tests use parse_special = true, but when parse_special = false, some tokenizers have problems which were otherwise not really easy to visualize.

For example, with OLMo (which uses a tokenizer very similar to GPT-NeoX), there's a problem with tokenization of consecutive spaces:

$ ./build/bin/llama-tokenize --log-disable -m models/ggml-vocab-olmo.gguf -p "Hello   world of    spaces"
 12092 -> 'Hello'
 50275 -> '   '
 10186 -> 'world'
   273 -> ' of'
 50274 -> '    '
 31748 -> 'spaces'

$ ./build/bin/llama-tokenize --log-disable -m models/ggml-vocab-olmo.gguf -p "Hello   world of    spaces" --no-parse-special
 12092 -> 'Hello'
   245 -> '  '
  1533 -> ' world'
   273 -> ' of'
   341 -> '   '
  8470 -> ' spaces'

Notice how when parse_special = false, the spaces don't get tokenized correctly (space prefixes, and totally different token ids for spaces), because the user-defined multi-space tokens no longer have priority in the pre-tokenization (but they should!).

This is one of the problems fixed in #8228

This should allow more easily explaining
how parse_special affects tokenization.
@compilade compilade added enhancement New feature or request Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix examples and removed examples labels Jul 10, 2024
@ggerganov ggerganov merged commit 9a55ffe into master Jul 11, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 12, 2024
This should allow more easily explaining
how parse_special affects tokenization.
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 12, 2024
This should allow more easily explaining
how parse_special affects tokenization.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
This should allow more easily explaining
how parse_special affects tokenization.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
This should allow more easily explaining
how parse_special affects tokenization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request examples Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants