-
Notifications
You must be signed in to change notification settings - Fork 13.6k
CUDA: fix crash on uneven context #16988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: fix crash on uneven context #16988
Conversation
|
Should we a test for this as well? Maybe other backends also have the same issue |
Yes - good point. You can either add it, or leave a TODO in |
e5cc811 to
7c48209
Compare
|
I changed the logic for views in |
To me it would be a very unintuitive if, for example, the test parameters say |
|
How about this: instead of a boolean, specify some integer value as the view. The default is 0, which means no view, a non-zero value is used as dimension 0 of the view. |
The other way around. Use the value of |
7c48209 to
41735c2
Compare
| if (src0_nb[i] % (2*ts) != 0) { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this disables mmf for batch_size = 1. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| lfm2moe 8B.A1B F16 | 15.54 GiB | 8.34 B | CUDA | 99 | pp512 | 7456.65 ± 45.82 |
| lfm2moe 8B.A1B F16 | 15.54 GiB | 8.34 B | CUDA | 99 | tg128 | 146.77 ± 0.08 |
after
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| lfm2moe 8B.A1B F16 | 15.54 GiB | 8.34 B | CUDA | 99 | pp512 | 7405.42 ± 53.04 |
| lfm2moe 8B.A1B F16 | 15.54 GiB | 8.34 B | CUDA | 99 | tg128 | 129.49 ± 0.68 |
Fixes #16976 .
The problem is that the CUDA kernel selection logic does not check strides, so it's trying to run kernels where the strides don't fit. The tests don't detect this because the strides are always constructed as
2*ne00.@ggerganov I didn't see a warning w.r.t. the KV cache having an inconvenient size, I think it would make sense to add one.