-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible issue with kv shifting #4097
Comments
Not specific to YARN. I pulled down the original branch that included the shifting, same issue. First shift mostly retains coherency, but subsequent shifts for the same cells cause it to rapidly break down |
Closing due to invalid test case. |
Why was this thread closed? I encounter absolutely the same problems after context shifting: A view shifts with small amounts work quite well. But the more shifts and the greater the shift distance the weirder the results get. Evaluating the same prompt on a fresh instance returns the expected results. Remark: I'm writing a text editor with a llama.cpp based auto-completion feature and naturally users move through the text while editing. Few and small forward or backward shifts of part of the context work fine for completions referring to unshifted as well as completions referring to shifted parts of the prompt text. But over time the completions get weirder until the produce pure nonsense. |
I closed the issue because there was a problem with the test case I had set up, the issue was unresolved though. Unfortunately because I'm running almost entirely on CPU it takes a long time for me to run the tests required to be certain I have a valid test case, and without a valid test case I couldn't (in good conscience) push to have the issue fixed as I was the only one having it. I assume (possibly incorrectly) that the root cause of the issue is the result of accumulation of imprecision of the floats being used for ROPE calculations and to mitigate the issue I switched to using 1000 position shifts with a F32 KV cache. I would still really like a better solution to this problem though because having to cap the number of times I can shift a cache cell really ties me down in terms of what I can do. I'm sure you can open a new issue and reference this one and I'll comment/react whatever in support. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
I'm leveraging the Llama.dll directly through C# interop. I'm using YARN, and CUDA. I'm running in windows. This may not be specific to YARN, however I didn't see this while using NTK scaling. I did switch to YARN the same day I pulled down the code that introduced it, however.
I've been trying to track down the cause of this issue for the last few weeks. It seems as though once the tokens within the cache have shifted by a certain amount from their starting positions, the model just starts to spew garbage.
I'm currently running on a 2x (0.5) scale, at 8192.
Here is what I've managed to reasonably identify so far.
I've uploaded two files, containing text generated with shifting and text generated without shifting. Its far from a perfect test, since the nature of modifying the context itself would change the results. I made a slight modification to main.cpp to gather the shift text, instead of shifting 1/2 context size when reaching full context, I set it to shift by 256 once it reached 2048. For both examples I seeded the same ~2000 token prompt, the opening of a book.
The reason for modifying the main.exe to reduce the shift time, is both because I only have a 3090, and because its much more difficult to identify the problem with larger "windows" as it only "steps up" when each shift occurs. A 1000 position shift will largely appear to be coherent text until its shifted the extra 1000 positions, where it will turn to gibberish immediately. Not very helpful visually.
NoShift.txt
Shift.txt
In the "No Shift" example, you can see that the text stays coherent throughout the generation. It does eventually end up looping, however it is almost pristine from a spelling and grammar perspective. The text stays logical and consistent up until the point it loops
The "Shift" example, does a similar kind of repeat, however it very quickly becomes incoherent while doing so. I have inserted the phrase "-Shift-" into the text (along with a newline) for clarity.
From the very first shift, it decides to complete the word "but" with "terfield" which is already nonsensical, however it quickly recovers. This kind of stumble is incredibly common at the point of shifting, losing coherency momentarily and then recovering. Note the second shift, it decides to continue
The porch creaked under our
with the textold bus rides for Josh
, which again makes no sense.By the third shift, the model is having a much harder time "recovering". For the first half of the block, the text is filled with obvious errors.
Maybe it'll make lots of new friends here, Josh." "No way!" he cried angrily. "It's for the house was right about one thing. This house is gross,"
As subsequent shifts occur, the model eventually begins slapping chunks of words together from other areas within the context. It looses all sense of coherency."They look so much alike," Mr. Dawes. "Let's us. I couldn't decide if that was a work," Dad said, smoing at Dawes. "They look so much alike," Mr. Dawes told Mom," Dad said. mer and maybe a rec room too." "They’d like that—wouldn't you, Amanda?" I'd met since we got here. "I guessed it was because of all the tall, old trees. "I really want to go home," Josh said
One thing I've noticed about the lack of coherency caused by kv shifting, is that most of the "gibberish" is chunks of sentences repeated verbatim from elsewhere in the context. In the example above, the text
maybe a rec room too
originates from within the promptWe’ll have room for a den and maybe a rec room
. Another example of an almost copy and paste is the chunkwork," Dad said, smoing at Dawes.
which comes from within the prompt, found as"It just needs some work, Josh," Dad said, smiling at Mr. Dawes.
This is more apparent in the attached example as a result of the model having already started repeating these phrases at this point, however even in a "multi turn" conversation where the model is not given the ability to repeat itself before losing coherency fully, this same behavior is exhibited. When I manually guide the narrative through conversation, the model will break down and perform the exact same "large chunk copy and paste" out of nowhere that you see above.Aside from being able to "Recreate" (conditionally) the behavior using the stock Main.exe, I'm fairly positive at this point that the issue isn't the result of cache management on the client side. I've rewritten the cache management code in my application three times now over the past few weeks, convinced I was doing something wrong. I went as far as adding the token value to the kv cell and then pulling the cache over with every token evaluation to compare cell-by-cell that the kv cache state matched what I expected, and in every case it does.
I can provide any other examples required to diagnose and resolve this issue. After weeks of debugging I've resigned to the fact that its above my pay grade.
The text was updated successfully, but these errors were encountered: