Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC + Flash attention generation bug #7401

Closed
steampunque opened this issue May 19, 2024 · 6 comments
Closed

RPC + Flash attention generation bug #7401

steampunque opened this issue May 19, 2024 · 6 comments

Comments

@steampunque
Copy link

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

b2921 RPC 4070+4070+1070 fully offloaded mixtral

I was testing out new quants with RPC and discovered a problem running LAMBADA bench with Flash Attentation enabled that after ~100 prompts the generated result comes back with non-ascii tokens and it stays in that mode permanently.

Summary of my quants results:

QUANT SIZE ENTROPY PP TG FATTN LAMBADA QA100 COMMENT

BASE:
Q3_K_M 21.00 GiB (3.86 BPW) 178 tps 28 tps 0 0.732 x NEW
Q4_K_M 24.62 GiB (4.53 BPW) 171 tps 29 tps 0 0.732 x OLD QUANT
Q4_0 24.63 GiB (4.53 BPW) 163 tps 30.5 tps 0 0.731 x NEW
Q4_K_S 24.91 GiB (4.58 BPW) 172 tps 29 tps 0 0.746 x NEW
Q4_K_M 26.49 GiB (4.87 BPW) 150 tps 28 tps 0 0.740 x NEW
Q4_1 27.32 GiB (5.03 BPW) 168 tps 29 tps 0 0.733 x NEW

INSTRUCT:
Q4_0 24.63 GiB (4.53 BPW) 200 tps 30.5 tps 0 0.747 0.490 NEW
Q4_0 24.63 GiB (4.53 BPW) 200 tps 31 tps 1 xxxxx xxxxx GEN CORRUPTS AFTER ~100prompts
Q4_K_S 24.91 GiB (4.58 BPW) 172 tps 29 tps 0 0.745 0.500 NEW
Q4_K_S 24.91 GiB (4.58 BPW) 172 tps 29 tps 1 xxxxx xxxxx GEN CORRUPTS AFTER ~100prompts
Q4_K_M 26.49 GiB (4.87 BPW) 160 tps 27.5 tps 0 0.745 0.510 NEW
Q4_K_M 26.49 GiB (4.87 BPW) 160 tps 28 tps 1 xxxxx xxxxx GEN CORRUPTS AFTER ~100prompts

The failure cases occur in the INSTRUCT tests when I turn flash attention on. This is the prompt history and point of failure:

PROMPT : he was small , even for a dwarf , and his poor taste in sorcerous robes contrasted awkwardly with d'jebee 's elegant attire ; her long , diaphanous gown and his chemical-stained , star-s
pangled robe clashed almost as much as her vacuous expression alongside his own visage , alive as it was with cunning and a twisted intelligence . d'jebee sighed with boredom . what is it , my love ? ' poldanyelz oozed with ersatz concern . i 'm bored , ' d'jebee complained undiplomatically . ` no one ever comes here . i never see anyone except you . ' a shuffling from the main arch
alerted her to the inaccuracy of her
RESPONSE: statement . a figure ,
CORRECT RESPONSE: statement
LETTER T CORRECT T
79 29 108 0 0 .731
PROMPT : i found a book about napoleon 's life in our local library . one of the chapters described how he had visited a garden while traveling through italy . he went into the maze garden and co
uld n't find his way out . his entourage went into the garden to retrieve him . is n't that incredible ? a superb military general , who had conquered most of the west at the time , got lost in a
RESPONSE: maze garden .</
CORRECT RESPONSE: garden
LETTER F CORRECT T
79 30 109 0 0 .724
PROMPT : members of the audience could be forgiven for thinking that a lone man sitting cross legged and smoking on stage with others in front of him singing curious songs looked a bit like a hip
pie . then the show got under way . sycko happily retired from the stage and master jeremiah entered to a rapturous applause looking resplendent in his frock and sacred top hat . `` ladies and ge
ntlemen , dryvellers , dear guests , '' jeremiah said . it is the greatest pleasure to have you all here with us tonight and i 'm sure you 're going to have a wonderful time , in every sense of t
he word
RESPONSE: ▅▅▅▅▅
CORRECT RESPONSE: wonderful
LETTER F CORRECT T
79 31 110 0 0 .718
PROMPT : she 's a meddlesome , interfering brat , '' bowen said in amusement . but ' t is the truth we love her dearly and life would not be the same without her antics . '' genevieve grinn
ed . `` is n't that the way with little sisters ? '' bowen rose and held out his hand to
RESPONSE: ▅▅▅▅▅
CORRECT RESPONSE: genevieve

The point of failure is extremely consistent, it fails at the same prompt reliably. The bogus response turns into repeating triads of 0xe2 0x96 0x85 :

0880 64 20 68 65 6c 64 20 6f 75 74 20 68 69 73 20 68 d held out his h
0890 61 6e 64 20 74 6f 0a 52 45 53 50 4f 4e 53 45 3a and to.RESPONSE:
08a0 20 e2 96 85 e2 96 85 e2 96 85 e2 96 85 e2 96 85 ...............
08b0 0a 43 4f 52 52 45 43 54 20 52 45 53 50 4f 4e 53 .CORRECT RESPONS
08c0 45 3a 20 67 65 6e 65 76 69 65 76 65 0a ffffffff E: genevieve..

This error does not happen when flash attention is disabled. My guess it is related to multi-gpu splitting, not RPC layer, but I can only test with RPC as my local LAN boxes are SFFs.

@askmyteapot
Copy link

Looks like the same issue i was having.
#7400

@steampunque
Copy link
Author

Looks like the same issue i was having. #7400

Yeah I saw your issue just after I posted mine. I will do a test quick run with mixtral CPU + GPU partial offload (no RPC) and see if it also breaks with flash attention.

@steampunque
Copy link
Author

Looks like the same issue i was having. #7400

I ran CPU + 12 layer GPU offload on mixtral with flash attention on and it looks to be working OK for me on the LAMBADA bench. Still too coincidental that use of flash attention resulted in bad outputs in both my and your use case though so I think the same flash attention bug is still the most likely culprit, possibly managing memory pool resources gets messed up under certain conditions.

@steampunque
Copy link
Author

Closing this issue as a dup of #7400. There is a problem with using FORCE_MMQ build with CUDA and fast attention. There is no issue with RPC.

@steampunque
Copy link
Author

verified dup of #7400.

@steampunque
Copy link
Author

this bug is now fixed, most likely by #7465 o #7479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants