Fix the tokenizer #2023

slaren · 2023-06-27T14:40:32Z

We should fix the issues found in the llama tokenizer by @vjeux. It is explained in detail here: #252 (comment)

Might be a good first issue.

JWNoctis · 2023-06-28T03:29:30Z

Related to #1931 - Some other changes to EOS token behaviour, maybe hidden behind a command line switch, would be nice to have too.

Right now models using EOS token as separator (e.g. Vicuna 1.1/1.3 from lmsys, Nous-Hermes-13b) are somewhat broken (current code doesn't tokenize EOS, and replaces generated EOS with newline in interactive mode).

howard0su · 2023-06-30T15:01:32Z

I will try this in the weekend.

huichen · 2023-07-04T02:21:39Z

Have you thought about supporting text normalization in tokenizer like https://github.com/google/sentencepiece/blob/master/doc/normalization.md ?

This is essential to get correct encoding (NFKC) for unicode languages like Chinese.

For example encoding and then decoding '（' wouldn't yield the original string.

goerch · 2023-09-09T19:16:58Z

Have you thought about supporting text normalization in tokenizer like https://github.com/google/sentencepiece/blob/master/doc/normalization.md ?

Spot on: somebody should have thought about this in the past probably. Not sure about the best way forward here.

Might be a good first issue.

I don't think so.

goerch · 2023-09-16T00:14:23Z

@slaren : as soon as #3170 is merged I'm happy (for now) with the character encoding/decoding behavior of the sentencepiece tokenizer for LlaMa, although we only check invariants and I only looked into literal representations sporadically. Do you see a need or have ideas how to improve test coverage?

@huichen : I just checked your example of '（' in test-tokenizer-0-llama and it seems to work for me. OTOH hand I could imagine deviations to the sentencepiece tokenizer for non normalized input strings (even for normalized ones as sentencepiece seems to do some kind of unique normalization according to the documentation you referenced). Do you have better examples?

@ggerganov: Should we test the character behavior of the Falcon tokenizer the same way as for the Llama one? Do we have a strategy how to cope with Unicode normalization if necessary?

slaren · 2023-09-16T12:11:29Z

@goerch If I am not mistaken, test-tokenizer-1-llama.cpp covers the issue described here very closely, so maybe this should be closed now? If there are still different issues with the llama tokenizer, then it would be better to open a new issue. I suspect that the biggest issue with the tokenizer at this point is the handling of special tokens, but I haven't been following the recent developments very closely.

slaren added the bug Something isn't working label Jun 27, 2023

howard0su self-assigned this Jun 30, 2023

This was referenced Jul 2, 2023

[llama] Add resegment post processing of tokenizer #2072

Draft

Add BPE dropout support, use it in training. #2073

Closed

goerch added a commit to goerch/llama.cpp that referenced this issue Jul 21, 2023

Fix for ggerganov#2023

ac793a2

goerch added a commit to goerch/llama.cpp that referenced this issue Jul 21, 2023

Merge branch 'ggerganov:master' into fix-ggerganov#2023

c04a42d

goerch mentioned this issue Jul 21, 2023

Tokenization is not equal to Meta's tokenization. #2310

Closed

jxy mentioned this issue Jul 26, 2023

server: allow json array in prompt or content for direct token input #2306

Merged

slaren closed this as completed Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the tokenizer #2023

Fix the tokenizer #2023

slaren commented Jun 27, 2023 •

edited

Loading

JWNoctis commented Jun 28, 2023 •

edited

Loading

howard0su commented Jun 30, 2023

huichen commented Jul 4, 2023 •

edited

Loading

goerch commented Sep 9, 2023 •

edited

Loading

goerch commented Sep 16, 2023

slaren commented Sep 16, 2023

Fix the tokenizer #2023

Fix the tokenizer #2023

Comments

slaren commented Jun 27, 2023 • edited Loading

JWNoctis commented Jun 28, 2023 • edited Loading

howard0su commented Jun 30, 2023

huichen commented Jul 4, 2023 • edited Loading

goerch commented Sep 9, 2023 • edited Loading

goerch commented Sep 16, 2023

slaren commented Sep 16, 2023

slaren commented Jun 27, 2023 •

edited

Loading

JWNoctis commented Jun 28, 2023 •

edited

Loading

huichen commented Jul 4, 2023 •

edited

Loading

goerch commented Sep 9, 2023 •

edited

Loading