Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible issue with kv shifting #4097

Closed
4 tasks done
MrJackSpade opened this issue Nov 16, 2023 · 4 comments
Closed
4 tasks done

Possible issue with kv shifting #4097

MrJackSpade opened this issue Nov 16, 2023 · 4 comments

Comments

@MrJackSpade
Copy link

MrJackSpade commented Nov 16, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

I'm leveraging the Llama.dll directly through C# interop. I'm using YARN, and CUDA. I'm running in windows. This may not be specific to YARN, however I didn't see this while using NTK scaling. I did switch to YARN the same day I pulled down the code that introduced it, however.

I've been trying to track down the cause of this issue for the last few weeks. It seems as though once the tokens within the cache have shifted by a certain amount from their starting positions, the model just starts to spew garbage.

I'm currently running on a 2x (0.5) scale, at 8192.

Here is what I've managed to reasonably identify so far.

  1. The problem does not occur until the cache begins to shift. I have filled the context a number of times from 0 - ~8000 with no loss of coherency.
  2. The position the cache begins to shift, does not matter. Following the first test, I decided to start shifting the cache at ~2000 tokens (full). instead of 8000. I left all other settings unchanged. After a handful of small (100 position) shifts, the model started generating garbage. This is the same behavior I would see when it started shifting at ~8000
  3. Changing the size that the cache shifts, does not seem to matter (much). Initially I was testing using frequent 100 position shifts. The model starts to lose coherency usually between 10 and 15 shifts. I've attempted to switch to 1000 position shifts, and the model turned to garbage after the third shift. The breakdown appears to start once the cells have been adjusted N positions from their starting point, but not after being shifted N number of times.
  4. The garbage is not the result of any kind of feedback loop in generation. The tests (below) I let the model do its own thing, however I get the same results when chatting in an "interactive" style with the model. Even while attempting to steer the model away from errors, provide new content, etc, the text always loses coherency at a certain point.

I've uploaded two files, containing text generated with shifting and text generated without shifting. Its far from a perfect test, since the nature of modifying the context itself would change the results. I made a slight modification to main.cpp to gather the shift text, instead of shifting 1/2 context size when reaching full context, I set it to shift by 256 once it reached 2048. For both examples I seeded the same ~2000 token prompt, the opening of a book.

The reason for modifying the main.exe to reduce the shift time, is both because I only have a 3090, and because its much more difficult to identify the problem with larger "windows" as it only "steps up" when each shift occurs. A 1000 position shift will largely appear to be coherent text until its shifted the extra 1000 positions, where it will turn to gibberish immediately. Not very helpful visually.

llama.cpp\out\build\x64-Release\bin\main.exe -t 20 -mg 1 -ngl 20 --no-mmap --mlock --no-penalize-nl  --seed 0 --temp 0 --file "prompt.txt" -c 8192 -n -1 --yarn-orig-ctx 4096 --yarn-ext-factor 1 --yarn-attn-factor 1 --rope-freq-scale 0.5 --rope-freq-base 10000 --rope-scaling yarn --keep -1 -m "Airoboros-l2-70b-3.1.2.Q5_K_M.gguf"

NoShift.txt

Shift.txt

In the "No Shift" example, you can see that the text stays coherent throughout the generation. It does eventually end up looping, however it is almost pristine from a spelling and grammar perspective. The text stays logical and consistent up until the point it loops

The "Shift" example, does a similar kind of repeat, however it very quickly becomes incoherent while doing so. I have inserted the phrase "-Shift-" into the text (along with a newline) for clarity.

From the very first shift, it decides to complete the word "but" with "terfield" which is already nonsensical, however it quickly recovers. This kind of stumble is incredibly common at the point of shifting, losing coherency momentarily and then recovering. Note the second shift, it decides to continue The porch creaked under our with the text old bus rides for Josh, which again makes no sense.

By the third shift, the model is having a much harder time "recovering". For the first half of the block, the text is filled with obvious errors. Maybe it'll make lots of new friends here, Josh." "No way!" he cried angrily. "It's for the house was right about one thing. This house is gross," As subsequent shifts occur, the model eventually begins slapping chunks of words together from other areas within the context. It looses all sense of coherency.

"They look so much alike," Mr. Dawes. "Let's us. I couldn't decide if that was a work," Dad said, smoing at Dawes. "They look so much alike," Mr. Dawes told Mom," Dad said. mer and maybe a rec room too." "They’d like that—wouldn't you, Amanda?" I'd met since we got here. "I guessed it was because of all the tall, old trees. "I really want to go home," Josh said

One thing I've noticed about the lack of coherency caused by kv shifting, is that most of the "gibberish" is chunks of sentences repeated verbatim from elsewhere in the context. In the example above, the text maybe a rec room too originates from within the prompt We’ll have room for a den and maybe a rec room. Another example of an almost copy and paste is the chunk work," Dad said, smoing at Dawes. which comes from within the prompt, found as "It just needs some work, Josh," Dad said, smiling at Mr. Dawes. This is more apparent in the attached example as a result of the model having already started repeating these phrases at this point, however even in a "multi turn" conversation where the model is not given the ability to repeat itself before losing coherency fully, this same behavior is exhibited. When I manually guide the narrative through conversation, the model will break down and perform the exact same "large chunk copy and paste" out of nowhere that you see above. 

Aside from being able to "Recreate" (conditionally) the behavior using the stock Main.exe, I'm fairly positive at this point that the issue isn't the result of cache management on the client side. I've rewritten the cache management code in my application three times now over the past few weeks, convinced I was doing something wrong. I went as far as adding the token value to the kv cell and then pulling the cache over with every token evaluation to compare cell-by-cell that the kv cache state matched what I expected, and in every case it does.

I can provide any other examples required to diagnose and resolve this issue. After weeks of debugging I've resigned to the fact that its above my pay grade.

@MrJackSpade
Copy link
Author

Not specific to YARN. I pulled down the original branch that included the shifting, same issue. First shift mostly retains coherency, but subsequent shifts for the same cells cause it to rapidly break down

@MrJackSpade MrJackSpade changed the title Possible issue with kv shifting and (possibly) YARN Possible issue with kv shifting Nov 16, 2023
@MrJackSpade
Copy link
Author

Closing due to invalid test case.

@MrJackSpade MrJackSpade closed this as not planned Won't fix, can't repro, duplicate, stale Nov 17, 2023
@leachim66
Copy link

Why was this thread closed?

I encounter absolutely the same problems after context shifting:

A view shifts with small amounts work quite well. But the more shifts and the greater the shift distance the weirder the results get.

Evaluating the same prompt on a fresh instance returns the expected results.

Remark: I'm writing a text editor with a llama.cpp based auto-completion feature and naturally users move through the text while editing.

Few and small forward or backward shifts of part of the context work fine for completions referring to unshifted as well as completions referring to shifted parts of the prompt text.

But over time the completions get weirder until the produce pure nonsense.

@MrJackSpade
Copy link
Author

I closed the issue because there was a problem with the test case I had set up, the issue was unresolved though.

Unfortunately because I'm running almost entirely on CPU it takes a long time for me to run the tests required to be certain I have a valid test case, and without a valid test case I couldn't (in good conscience) push to have the issue fixed as I was the only one having it.

I assume (possibly incorrectly) that the root cause of the issue is the result of accumulation of imprecision of the floats being used for ROPE calculations and to mitigate the issue I switched to using 1000 position shifts with a F32 KV cache.

I would still really like a better solution to this problem though because having to cap the number of times I can shift a cache cell really ties me down in terms of what I can do.

I'm sure you can open a new issue and reference this one and I'll comment/react whatever in support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants