Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support #11

Closed
wizd opened this issue Mar 11, 2023 · 38 comments · Fixed by beiller/llama.cpp#2 or #79
Closed

Unicode support #11

wizd opened this issue Mar 11, 2023 · 38 comments · Fixed by beiller/llama.cpp#2 or #79
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@wizd
Copy link

wizd commented Mar 11, 2023

Thannk you for creating such a great inference engine which has 10x speedup.
Please add Unocode support to display other language properly.

Screenshot 2023-03-11 at 7 12 50 PM

@wizd wizd changed the title Support prompt in Unicode Unicode support Mar 11, 2023
@beiller
Copy link
Contributor

beiller commented Mar 11, 2023

I tried to determine how to implement unicode and I am not getting far. It seems to work from all I am seeing, but the output has random characters yes.

Here is a prompt in text format for easier copy/paste

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'

     1 -> ''
 30313 -> '人'
 30486 -> '生'
 30199 -> 'の'
 31474 -> '意'

This seems correct above since I dumped out the tokens parsing code

llama_model_load: vocab[30313] = '人'
llama_model_load: vocab[30486] = '生'
llama_model_load: vocab[30199] = 'の'
llama_model_load: vocab[31474] = '意'

And the output I get is

人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]

So it is outputting some characters but some �

llama_model_load: vocab[30140] = '�'

@beiller
Copy link
Contributor

beiller commented Mar 11, 2023

I find a list of unprintable tokens from ID 131 to 258. If I remove those from vocab a prompt can generate in Japanese it seems but I dont know Japanese!

llama.cpp % ./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 --repeat_penalty 1.0 -n 512 -p $'人生の意味は'

Response

人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう

Google translate

The meaning of life is that one person is one person. Since Abe was standing there, it was only possible to be one person after leaving, but that's right.

Is it possible?

beiller added a commit to beiller/llama.cpp that referenced this issue Mar 11, 2023
Fixes ggerganov#11 

This fixes a Japanese prompt I was attempting to run

EG:

`./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'`

Output before change:

`人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]`

So it is outputting some characters but some �

Output after change:

`人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう`
beiller added a commit to beiller/llama.cpp that referenced this issue Mar 11, 2023
Fixes ggerganov#11 

This fixes a Japanese prompt I was attempting to run

EG:

`./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'`

Output before change:

`人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]`

So it is outputting some characters but some �

Output after change:

`人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう`
This was referenced Mar 11, 2023
@blackhole89
Copy link
Collaborator

Response

人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう

The Japanese text you quote here is fairly agrammatical, in a way that suggests (on top of some other issues that I figure are simply due to LLaMa not having learned the language very well) that some words are simply missing. Where were the unprintable tokens that you removed from this?

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

I removed "", "�", "��" from the grammar, not from a sentence that's not how it works. There is a large chunk of the "token dictionary" in the model that points to unprintable character �. I remove those tokens from the dictionary of tokens the program is using. I suspect the model learns some corrupted text maybe during training so if it sees japanese characters it is confusing it with some garbled text it has come across, thus making unprintable characters a likely candidate for the next word. Just my hypothesis.

Here is the pull request, the code change I made to make this work.

https://github.com/ggerganov/llama.cpp/pull/26/files

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

For anyone interested here is the chunk in the 13B model file. Not sure if all models contain the same token grammars

� vocab[131] (EFBFBD)
� vocab[132] (EFBFBD)
� vocab[133] (EFBFBD)
� vocab[134] (EFBFBD)
� vocab[135] (EFBFBD)
� vocab[136] (EFBFBD)
� vocab[137] (EFBFBD)
� vocab[138] (EFBFBD)
� vocab[139] (EFBFBD)
� vocab[140] (EFBFBD)
� vocab[141] (EFBFBD)
� vocab[142] (EFBFBD)
� vocab[143] (EFBFBD)
� vocab[144] (EFBFBD)
� vocab[145] (EFBFBD)
� vocab[146] (EFBFBD)
� vocab[147] (EFBFBD)
� vocab[148] (EFBFBD)
� vocab[149] (EFBFBD)
� vocab[150] (EFBFBD)
� vocab[151] (EFBFBD)
� vocab[152] (EFBFBD)
� vocab[153] (EFBFBD)
� vocab[154] (EFBFBD)
� vocab[155] (EFBFBD)
� vocab[156] (EFBFBD)
� vocab[157] (EFBFBD)
� vocab[158] (EFBFBD)
� vocab[159] (EFBFBD)
� vocab[160] (EFBFBD)
� vocab[161] (EFBFBD)
� vocab[162] (EFBFBD)
� vocab[163] (EFBFBD)
� vocab[164] (EFBFBD)
� vocab[165] (EFBFBD)
� vocab[166] (EFBFBD)
� vocab[167] (EFBFBD)
� vocab[168] (EFBFBD)
� vocab[169] (EFBFBD)
� vocab[170] (EFBFBD)
� vocab[171] (EFBFBD)
� vocab[172] (EFBFBD)
� vocab[173] (EFBFBD)
� vocab[174] (EFBFBD)
� vocab[175] (EFBFBD)
� vocab[176] (EFBFBD)
� vocab[177] (EFBFBD)
� vocab[178] (EFBFBD)
� vocab[179] (EFBFBD)
� vocab[180] (EFBFBD)
� vocab[181] (EFBFBD)
� vocab[182] (EFBFBD)
� vocab[183] (EFBFBD)
� vocab[184] (EFBFBD)
� vocab[185] (EFBFBD)
� vocab[186] (EFBFBD)
� vocab[187] (EFBFBD)
� vocab[188] (EFBFBD)
� vocab[189] (EFBFBD)
� vocab[190] (EFBFBD)
� vocab[191] (EFBFBD)
� vocab[192] (EFBFBD)
� vocab[193] (EFBFBD)
� vocab[194] (EFBFBD)
� vocab[195] (EFBFBD)
� vocab[196] (EFBFBD)
� vocab[197] (EFBFBD)
� vocab[198] (EFBFBD)
� vocab[199] (EFBFBD)
� vocab[200] (EFBFBD)
� vocab[201] (EFBFBD)
� vocab[202] (EFBFBD)
� vocab[203] (EFBFBD)
� vocab[204] (EFBFBD)
� vocab[205] (EFBFBD)
� vocab[206] (EFBFBD)
� vocab[207] (EFBFBD)
� vocab[208] (EFBFBD)
� vocab[209] (EFBFBD)
� vocab[210] (EFBFBD)
� vocab[211] (EFBFBD)
� vocab[212] (EFBFBD)
� vocab[213] (EFBFBD)
� vocab[214] (EFBFBD)
� vocab[215] (EFBFBD)
� vocab[216] (EFBFBD)
� vocab[217] (EFBFBD)
� vocab[218] (EFBFBD)
� vocab[219] (EFBFBD)
� vocab[220] (EFBFBD)
� vocab[221] (EFBFBD)
� vocab[222] (EFBFBD)
� vocab[223] (EFBFBD)
� vocab[224] (EFBFBD)
� vocab[225] (EFBFBD)
� vocab[226] (EFBFBD)
� vocab[227] (EFBFBD)
� vocab[228] (EFBFBD)
� vocab[229] (EFBFBD)
� vocab[230] (EFBFBD)
� vocab[231] (EFBFBD)
� vocab[232] (EFBFBD)
� vocab[233] (EFBFBD)
� vocab[234] (EFBFBD)
� vocab[235] (EFBFBD)
� vocab[236] (EFBFBD)
� vocab[237] (EFBFBD)
� vocab[238] (EFBFBD)
� vocab[239] (EFBFBD)
� vocab[240] (EFBFBD)
� vocab[241] (EFBFBD)
� vocab[242] (EFBFBD)
� vocab[243] (EFBFBD)
� vocab[244] (EFBFBD)
� vocab[245] (EFBFBD)
� vocab[246] (EFBFBD)
� vocab[247] (EFBFBD)
� vocab[248] (EFBFBD)
� vocab[249] (EFBFBD)
� vocab[250] (EFBFBD)
� vocab[251] (EFBFBD)
� vocab[252] (EFBFBD)
� vocab[253] (EFBFBD)
� vocab[254] (EFBFBD)
� vocab[255] (EFBFBD)
� vocab[256] (EFBFBD)
� vocab[257] (EFBFBD)
� vocab[258] (EFBFBD)
�� vocab[26308] (EFBFBDEFBFBD)
 vocab[31634] (EFBFBC)

Many token IDs point to 0xEFBFBD which is unprintable unicode

@wizd
Copy link
Author

wizd commented Mar 12, 2023

Nice find!

Due to the constently changing encoding history of CJK (Chinese, Japanese, Korean), there is big chance that the training model got wrong encoding of non-ascii language. Simply removing it is good.

@wizd
Copy link
Author

wizd commented Mar 12, 2023

Some more test shows that we can't simply remove the unprintable token. There should be some way to find the right encoding of it. otherwise the generated text becomes unreadable.

Screenshot 2023-03-12 at 11 20 50 AM

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

Im not sure you have applied the code change. I cannot try your prompt since its an image mind pasting? But I think you have to checkout from my repo because my code is not currently merged here yet.

https://github.com/beiller/llama.cpp

Try clone / build in a different folder. It also includes the repeat penalty change. Again my approach is not about removing the characters, its a code change that will output something very different (and more accurate)

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

Heres some more examples:

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -p $'人生の意味は'

Outputs:

...
main: prompt: '人生の意味は'
main: number of tokens in prompt = 5
     1 -> ''
 30313 -> '人'
 30486 -> '生'
 30199 -> 'の'
 31474 -> '意'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


人生の意はあまり
物学に世界をする発見 : 空気中でニトルを放つシステムが全球ワー... [end of text]
人生の意から
しようごいます。
I’ve been getting these error messages on my blog for a while now, but I didn't pay too much attention to them until recently when there were quite few. Maybe it was because last weekend (in Japan), the phone lines at my company went down and that screwed up all of our accounts that require login through Internet Explorer 7+ (which is set as default).
Unfortunately, I couldn't afford much time to fix them since they were so many. So now there are even more errors for you guys
人生の意とか、やりめているんだ [ 1 ]
部事情はもうそこまでのキャバラ
く立みなしに上下。自分がよききたかったと、子あればえるだけ作らない人は多数だ [ 2 ]
【キャバラ】 (ビジネスの了)
く
人生の意は知らない。 我が人生は事の意や子をう きた人の自分である人生に存するから よく、意のい期に実力から人の意を心したり、知らない人生の意は子をうしていることが、

@wizd
Copy link
Author

wizd commented Mar 12, 2023

Thank you. I build your

Im not sure you have applied the code change. I cannot try your prompt since its an image mind pasting? But I think you have to checkout from my repo because my code is not currently merged here yet.

https://github.com/beiller/llama.cpp

Try clone / build in a different folder. It also includes the repeat penalty change. Again my approach is not about removing the characters, its a code change that will output something very different (and more accurate)

Thank you. I build your repo and test again, the unprintable character is gone, but the meaning of generated text is gone either like bellow.
Screenshot 2023-03-12 at 11 37 21 AM

@wizd
Copy link
Author

wizd commented Mar 12, 2023

There is another bug, truncate of prompt if it is Chinese like in #11 (comment)

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

I just tried Chinese as well and yes its truncated. Its possible that it doesn't understand other languages. It seems to be missing some Chinese character tokens such as entirely!

Further up the code chain, in the model conversion code I see the following. Before I write more @ggerganov thank you so much for putting this all together. I wonder if some tokens are getting lost. But maybe not since there is 32000 tokens (and that appears to be how Google's tokenizer works). I will try to research and see if some tokens are "lost in translation"!

    # Is this correct??
    for i in range(32000):
        # TODO: this is probably wrong - not sure how this tokenizer works
        text = tokenizer.decode([29889, i]).encode('utf-8')
        # remove the first byte (it's always '.')
        text = text[1:]
        fout.write(struct.pack("i", len(text)))
        fout.write(text)
        

@ggerganov you are too hard on yourself. How can you be wrong when so many tokens are present :P

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

I found the problem via some scripting. The tokenizer works differently than we are using it. Also, token 29889 is . so that is why @ggerganov has to remove the . character he is tokenizing it but that is not affecting anything so that is good!

from sentencepiece import SentencePieceProcessor

fname_tokenizer = "models/tokenizer.model"

tokenizer = SentencePieceProcessor(fname_tokenizer)

print(tokenizer.decode([29889]))
>>>.

result1 = tokenizer.encode("篇")
print(f'token: {result1}')
>>>[29871, 234, 178, 138]

decode1 = tokenizer.decode(result1)
print(f'decoded: {decode1}')
>>>decoded: 篇

So it seems we need to leverage this tokenizer in the C++ code, the current method of tokenizing is not correct. I can attemp it, it will require adding sentencepiece.

The crux of the issue if I can try to explain, is the C++ tries to find the best matching token (single token) in the input text. But as we see here, the actual token for this character needs to be multiple tokens! Strange. I think the tokens can be removed from the model files in the conversion script and we should just use sentencepiece C++ code. Thoughts??

@wizd
Copy link
Author

wizd commented Mar 12, 2023

dump the tokenizer.model file to text by
`import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load('tokenizer.model')

vocab_list = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]

with open('vocab.txt', 'w', encoding='utf-8') as f:
f.write('\n'.join(vocab_list))`

did not found some char like '篇', '雨','许'

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

@wizd see my comment its more complex it seems and will "translate" to multiple tokens but it will actually support Chinese I believe with some big refactors :P I may have time to make it work we will see!

Edit

I think we just found out what that big chunk of unprintable characters is for :)

[29871, 234, 178, 138] translates to:
0x20, 0xEFBFBD, 0xEFBFBD, 0xEFBFBD
AKA ���

But in actuality it should be:

@wizd
Copy link
Author

wizd commented Mar 12, 2023

trying to understand it...
https://unicode.scarfboy.com/?s=%E7%AF%87

@wizd
Copy link
Author

wizd commented Mar 12, 2023

seems we should use this library to tokenize: https://github.com/google/sentencepiece

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

@wizd yes that is correct. And the code also assumes a 1 "word" to 1 token mapping which isn't the case. Also "word" is not a word its more like a word piece.

@wizd
Copy link
Author

wizd commented Mar 12, 2023

Yep. and need to make the code UTF-8 aware: https://github.com/facebookresearch/llama/blob/main/FAQ.md#4-other-languages

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

The code has no problem with UTF-8 so far. I am working on a very hacky solution right now :)

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

I actually got it working in a very hacky way. Example:

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 16 --repeat_last_n 16 -p $'J\'aime le chocolat = I like chocolate\n祝你一天过得愉快 ='

Output:

J'aime le chocolat = I like chocolate
祝你一天过得愉快 = Have a happy holiday
我喜欢 来自中国

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 -p $'什么是生命的意义?生命的意义在于'

Output (Admittedly cherry picked, sometimes it contains half english):

什么是生命的意义?生命的意义在于存活,这就上个理论说了所有。人类对现实世界做了出不可思当的窗口式巡回行为,而又没能找到真正天生份目前且无需死去才是最大缺少之处。

@wizd
Copy link
Author

wizd commented Mar 12, 2023

wow, you are so cool! @beiller

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

Another interesting outcome, it actually can output emojis now!

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 -p $'Have you heard this funny joke? '

Have you heard this funny joke? 😆
Are There Any Real Vegans In This House??…If Not, Now’s Your Chance To Change That.
This Is About What Would Happen If You Drank A Gallon of Milk Every Day For One Year

Sadly the joke was not funny or even a joke.

@wizd
Copy link
Author

wizd commented Mar 12, 2023

I actually got it working in a very hacky way. Example:

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 16 --repeat_last_n 16 -p $'J\'aime le chocolat = I like chocolate\n祝你一天过得愉快 ='

Output:

J'aime le chocolat = I like chocolate
祝你一天过得愉快 = Have a happy holiday
我喜欢 来自中国

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 --repeat_last_n 64 -p $'什么是生命的意义?生命的意义在于'

Output (Admittedly cherry picked, sometimes it contains half english):

什么是生命的意义?生命的意义在于存活,这就上个理论说了所有。人类对现实世界做了出不可思当的窗口式巡回行为,而又没能找到真正天生份目前且无需死去才是最大缺少之处。

this output has misunderstanding, maybe still encoding issue?

@wizd
Copy link
Author

wizd commented Mar 12, 2023

maybe we can use some fact check to verify the output. e.g.

关于爱因斯坦的生平。他出生于
About the life of Einstein. He was born in

if the output is wrong, we can catch it easily.

@ggerganov ggerganov added bug Something isn't working help wanted Extra attention is needed labels Mar 12, 2023
@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

Response

关于爱因斯坦的生平。他出生于1856年,但已经成为了一个名人在1902年之后,是世界上最知名的数学家和科学家

English response (died in the future lol)

About the life of Einstein. He was born in 1879, he died on April 10th ,2054; his parents were Hermann and Pauline Winteler-Einstein . His first job as a clerk at Bern patent office where

I think the strange responses are due to me using smaller 13B model. Maybe bigger model is more accurate. I think the unicode issue is resolved (very hacky tho).

@ggerganov I can fix it but I suck with c++ and won't know how to include the sentencepiece c++ library and have it compile and link. I need help with that part. I had the C++ call python functions for my solve here since it was already installed but that is a travesty I'm sure.

@ggerganov
Copy link
Owner

@beiller
Adding sentencepiece to the project will be last resort - don't want to bloat the repo too much.
Hopefully, we can figure out a concise C++ implementation

@wizzard0
Copy link
Contributor

wizzard0 commented Mar 12, 2023

@beiller key thing to be aware of: tokenizer works on bytes, not on characters. so:

  • one character could take multiple tokens (up to 4 without emoji support, 10+ with emojis)
  • tokens can and will start and/or end in the middle of multibyte characters
  • tokenization is not unique (there could be multiple ways to encode given text)
  • decoding some token sequences you’ll sometimes get invalid utf-8, this is to be expected

@wizd
Copy link
Author

wizd commented Mar 12, 2023

some research. I use sentencepiece to tokenize a input and dump it. I got this:

piece: ▁
piece: <0xE7>
piece: <0xAF>
piece: <0x87>
piece: <0xE5>
piece: <0xB9>
piece: <0x85>
piece: 已
piece: 经
1
31290
31412

main: prompt: '篇幅已经'
main: number of tokens in prompt = 3
1 -> ''
31290 -> '已'
31412 -> '经'

"篇幅" is not found because in vocab table it is not what it is, but <0xE7>, <0xAF> ... etc.

@wizd
Copy link
Author

wizd commented Mar 12, 2023

with sentencepiece which full of magic number I can get the result right:

main: prompt: '篇幅已经'
main: number of tokens in prompt = 10
     1 -> '< s>'
 29871 -> '▁'
   234 -> '<0xE7>'
   178 -> '<0xAF>'
   138 -> '<0x87>'
   232 -> '<0xE5>'
   188 -> '<0xB9>'
   136 -> '<0x85>'
 31290 -> '已'
 31412 -> '经'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


< s>▁<0xE7><0xAF><0x87><0xE5><0xB9><0x85>已经<0xE6><0x8A><0xB5>达了50万,我们再期望完成到全部五十平民的目标<0xE8><0xAE><0xA9>最多能安放自<0xE5><0xB7><0xB1>去生活。<0x0A>如果你有人看<0xE9><0x80><0x99>段情景不好▁就可以关注在线<0xE6><0x92><0xAD>客(部分)▁这里为我们一起同行,因此地方▁全是在他实现<0xE5><0xA5><0xB9>的目标。参与<0xE7><0xAF><0x87><0xE5><0x8D><0xB3>将从未来开始开展!< /s> [end of text] 

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

@wizzard0 the tokenizer works on sentence pieces. The tokenizer used is here:

https://github.com/google/sentencepiece

But the code is not using this it is using a reverse engineered method. Which works great actually most of the time. The problem here is the code assumes a 1 integer to 1 "string" mapping. To fix we need to either reverse engineer sentencepiece or include it in the repo. Its a small codebase I have a branch that can compile it and its working well.

To reverse engineer it will need protobuf, or reverse engineer protobuf because the token model is stored in models/tokenizer.model which appears to be a protobuf file.

@wizzard0
Copy link
Contributor

@beiller llama uses sentencepiece in BPE mode. It does not care about characters.
quoting the paper:

We tokenize the data with the byte- pair encoding (BPE) algorithm (Sennrich et al., 2015). Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.

So you might need multiple tokens to build 1 printable character, and sometimes they won’t even add up to valid UTF-8. The same as with GPT3/ChatGPT.

But that is not a problem. Just forget the characters and work with the bytes directly.

@beiller
Copy link
Contributor

beiller commented Mar 12, 2023

@wizzard0 yes it falls back to byte encoding and I understand all of that. But in the code / model / dictionary we are using here, there is no 0x85 mapping to tokenID 136 (136 being what the model expects as an input). All the mappings are higher up the thread and they all map to 0xEFBFBD

� vocab[237] (EFBFBD)
� vocab[238] (EFBFBD)
� vocab[239] (EFBFBD)
� vocab[240] (EFBFBD)

Theres no way for us to map the unicode hex digits (0x85) to the proper ID (136) without sentencepiece. @wizd I believe even you had to use sentencepiece to "preprocess" your prompt in order to get the mapping 0x85 -> 136 correct?

Edit

maybe there's a hacky way to find 0x85 -> 136 by digging through the models/tokenizer.model file :)

@wizzard0
Copy link
Contributor

wizzard0 commented Mar 12, 2023

meh, i guess it’s simpler to code than to explain >_< it’s dead simple. just please please forget characters, words, regexes etc efc. the model works with raw bytes.

eg dictionary cannot be encoded as json because tokens wont be valid utf8 strings.

have to go to sleep rn, maybe will return back tomorrow and write this

@beiller
Copy link
Contributor

beiller commented Mar 13, 2023

I have a branch to include sentencepiece #66

@kharvd
Copy link
Contributor

kharvd commented Mar 13, 2023

image

Curiously, the https://nat.dev/ implementation also struggles with it

@j-f1
Copy link
Collaborator

j-f1 commented Mar 13, 2023

Made some improvements in #73!

@wizd
Copy link
Author

wizd commented Mar 13, 2023

I think UTF-8 encoding is fixed in #87

SlyEcho pushed a commit to SlyEcho/llama.cpp that referenced this issue Jun 2, 2023
cebtenzzre added a commit to cebtenzzre/llama.cpp that referenced this issue Nov 7, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
chsasank pushed a commit to chsasank/llama.cpp that referenced this issue Dec 20, 2023
* move gpu slicing python code into a module

* remove dead code in exporting gpu split

* streamline solver and export with one entrypoint

* new powerinfer.py module

* wip: invoke Python to generate gpu split on the fly

* wip: load gpu split on demand

* wip: new gpu split file format

* wip: generate and load new gpu idx format

* wip: generate and load gpu index on the fly

* minor: calculate total VRAM offloading via FFN splitting

* add option to disble gpu index

* bugfix

* wip: bug fix for segment fault

* bugfix

* bugfix and testing

* temporary fix for neuron factor in solving

* fix: generated gpu idx path

* Update README about gpu index
AAbushady pushed a commit to AAbushady/llama.cpp that referenced this issue Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment