Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CausalLM ERROR: byte not found in vocab #840

Open
4 tasks done
ArtyomZemlyak opened this issue Oct 24, 2023 · 7 comments
Open
4 tasks done

CausalLM ERROR: byte not found in vocab #840

ArtyomZemlyak opened this issue Oct 24, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@ArtyomZemlyak
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Run model CausalLM without errors
https://huggingface.co/TheBloke/CausalLM-14B-GGUF/tree/main

Current Behavior

Error when loading model (llama-cpp-python installed throught pip, not from source)

...
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q5_1:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
ERROR: byte not found in vocab: '
'
fish: Job 1, 'python server.py --api --listen…' terminated by signal SIGSEGV (Address boundary error)

ggerganov/llama.cpp#3732

Environment and Context

Docker container with latest

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Failure Logs

@abetlen abetlen added the bug Something isn't working label Oct 24, 2023
@Gincioks
Copy link

Same

@alienatorZ
Copy link

I am glad to know its not just me.

@thekitchenscientist
Copy link

I am facing the same issue using the Q5_0.gguf. The missing byte in the vocab varies. Sometimes all the spaces between words are replaced with !, other times there are no spaces between words in the output.

@jorgerance
Copy link

Has been already discussed in llama.cpp. The team behind CausalLM and TheBloke are aware of this issue which is caused by the "non-standard" vocabulary the model uses. As per the last time I tried, inference on CPU was already working for GGUF. As per the last comments on one of the issues related to this model and llama.cpp, inference seems to be running fine on GPU too: ggerganov/llama.cpp#3740

@jorgerance
Copy link

Got it working. It's quite straightforward; just follow the steps below.

Try reinstalling llama-cpp-python as follows (I would advise using a Python virtual environment but that's a different topic):

CMAKE_ARGS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_MMV_Y=8 -DCMAKE_CUDA_ARCHITECTURES=native' CUDACXX=/usr/local/cuda-11.8/bin/nvcc FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git --force-reinstall --no-cache-dir --verbose --upgrade

Set CUDACXX to the path of the nvcc version you intend to use. I have versions 11.8 and 12 installed and succeeded with 11.8 (I didn't try to build with version 12).

Afterwards, generation with CausalLM 14B works smoothly.

python llama_cpp_server_v2.py causallm
Model causallm is not in use, starting...
Starting model causallm

······················ Settings for causallm  ······················
> model:                            /root/code/localai/models/causallm_14b.Q5_1.gguf
> model_alias:                                                              causallm
> seed:                                                                   4294967295
> n_ctx:                                                                        8192
> n_batch:                                                                       128
> n_gpu_layers:                                                                   45
> main_gpu:                                                                        0
> rope_freq_base:                                                                0.0
> rope_freq_scale:                                                               1.0
> mul_mat_q:                                                                       1
> f16_kv:                                                                          1
> logits_all:                                                                      1
> vocab_only:                                                                      0
> use_mmap:                                                                        1
> use_mlock:                                                                       1
> embedding:                                                                       1
> n_threads:                                                                       4
> last_n_tokens_size:                                                            128
> numa:                                                                            0
> chat_format:                                                                chatml
> cache:                                                                           0
> cache_type:                                                                    ram
> cache_size:                                                             2147483648
> verbose:                                                                         1
> host:                                                                      0.0.0.0
> port:                                                                         8040
> interrupt_requests:                                                              1
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from /root/code/localai/models/causallm_14b.Q5_1.gguf (version unknown)
llama_model_loader: - tensor    0:                token_embd.weight q5_1     [  5120, 152064,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q5_1     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q5_1     [  5120,  5120,     1,     1 ]
[...]
llm_load_print_meta: model ftype      = mostly Q5_1
llm_load_print_meta: model params     = 14.17 B
llm_load_print_meta: model size       = 9.95 GiB (6.03 BPW) 
llm_load_print_meta: general.name   = causallm_14b
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token  = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  557.00 MB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 9629.41 MB
...........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 6400.00 MB
llama_new_context_with_model: kv self size  = 6400.00 MB
llama_new_context_with_model: compute buffer total size = 177.63 MB
llama_new_context_with_model: VRAM scratch buffer: 171.50 MB
llama_new_context_with_model: total VRAM used: 16200.92 MB (model: 9629.41 MB, context: 6571.50 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
INFO:     Started server process [735011]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8040 (Press CTRL+C to quit)
INFO:     127.0.0.1:52746 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47452 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47456 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47468 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47484 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:47500 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:52300 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:52302 - "OPTIONS / HTTP/1.0" 200 OK
·································· Prompt ChatML ··································

<|im_start|>system
You are a helpful assistant. You can help me by answering my questions. You can also ask me questions.<|im_end|>
<|im_start|>user
Is this chat working?<|im_end|>
<|im_start|>assistant

INFO:     127.0.0.1:52314 - "POST /v1/chat/completions HTTP/1.1" 200 OK

llama_print_timings:        load time =     173.04 ms
llama_print_timings:      sample time =      12.94 ms /    29 runs   (    0.45 ms per token,  2240.77 tokens per second)
llama_print_timings: prompt eval time =     172.94 ms /    64 tokens (    2.70 ms per token,   370.07 tokens per second)
llama_print_timings:        eval time =     434.08 ms /    28 runs   (   15.50 ms per token,    64.50 tokens per second)
llama_print_timings:       total time =    1158.61 ms
INFO:     127.0.0.1:52328 - "OPTIONS / HTTP/1.0" 200 OK
INFO:     127.0.0.1:52334 - "OPTIONS / HTTP/1.0" 200 OK

Feel free to ping me if you don't succeed.

antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this issue Oct 30, 2023
Use UTF-16 as input on Windows, since UTF-8 does not work and reads multibyte characters as zeros
@javaarchive
Copy link

javaarchive commented Oct 31, 2023

Odd, I'm still getting the crash with the error. I'm trying to apply this fix to the text generation webui docker container. I'm doing the following in portainer on the container.
source venv/bin/activate
CMAKE_ARGS='-DLLAMA_CUBLAS=on -DLLAMA_CUDA_MMV_Y=8 -DCMAKE_CUDA_ARCHITECTURES=native' CUDACXX=/usr/local/cuda-12.1/bin/nvcc FORCE_CMAKE=1 pip install git+https://github.com/abetlen/llama-cpp-python.git --force-reinstall --no-cache-dir --verbose --upgrade (a quick ls of /usr/local/ reveals cuda 12.1 is present in the directory).

Edit: never mind, the nvcc was missing form that directory

@NotSpooky
Copy link

I'm assuming for AMD GPUs the command would be different (due to the lack of nvcc). I'm currently getting the error for llama-cpp-python 0.2.18 with a 6800 XT on Manjaro Linux (CPU works fine).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants