Steps to Reproduce
Very easy to reproduce. Use --interactive --interactive-first -c 32
and paste in anything over 32 tokens. After processing the prompt, it'll crash as soon as it generates the first token. It's just a segfault when compiled without BLAS, with cuBLAS:
CUDA error 12 at ggml-cuda.cu:1567: invalid pitch argument
It appears the normal prompt size checking logic doesn't apply to input from interactive mode (or it's not functioning correctly). I verified that using -f
and a prompt from a file does gracefully fail as expected:
main: error: prompt is too long (185 tokens, max 124)
I didn't fill in the rest of the issue primarily because I'm a jerk but also because I don't think this issue could have anything to anything local.