Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature - Internal ggml precision GGML_TYPE_F16 support #1492

Closed
cmp-nct opened this issue May 17, 2023 · 9 comments
Closed

Feature - Internal ggml precision GGML_TYPE_F16 support #1492

cmp-nct opened this issue May 17, 2023 · 9 comments
Labels

Comments

@cmp-nct
Copy link
Contributor

cmp-nct commented May 17, 2023

It might be too much to ask for now, given it's rooting deep into ggml but in longterm I believe it's important to support 16 bit precision.
Especially as GPU support is finding more and more grip in GGML the 32 bit requirement is a significant performance burden while not providing any benefit on the multiplications.
After all the multiplications inside the GPU are all 16 bit, converting src1 from 32 bit to 16 bit for every calculation costs quite noticeable performance.

@Green-Sky
Copy link
Collaborator

This is already supported and in use? - not sure which parts you are referring to.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented May 17, 2023

This is already supported and in use? - not sure which parts you are referring to.

matmul is designed for 32 bit only, the precision is hardcoded for the dst. That's why all src1 matmul are 32 even in 4 bit quantized mode.
It's not just matmul and it's internal counterparts, also other parts of ggml have no 16 bit representation, so they will assert false if you attempt it.
On the CUDA side it's a tiny change, there you'd just need to check if src1 is 16 bit and skip the (slow) conversion to 16 bit.
In addition to the lower memory overhead cuBLAS is 20+% faster on pure 16 bit compute, currently it's the slower 16/32 bit computation.

@sw
Copy link
Collaborator

sw commented May 17, 2023

I believe this is what #959 was about.

@Green-Sky
Copy link
Collaborator

@cmp-nct you mean like this? #1508

@ggerganov
Copy link
Owner

and skip the (slow) conversion to 16 bit

Is it really slow? My expectation is it would be completely negligible

@cmp-nct
Copy link
Contributor Author

cmp-nct commented May 17, 2023

and skip the (slow) conversion to 16 bit

Is it really slow? My expectation is it would be completely negligible

I ran a test yesterday and had significant faster inference but it was a hacked together test.
From my memory I had mat_mul times of 72ms avg. with the change, and 90ms without. But I can't say for certain if it was only that or if I had more than one change.

With my recent upstream pull all my local code needs to be adapted again, I'll run a second test once I put the pieces together again to confirm it.

@ggerganov : Do you know if the 32bit precision comes with a real quality benefit compared to half precision ?

@ggerganov
Copy link
Owner

Do you know if the 32bit precision comes with a real quality benefit compared to half precision ?

There is no measurable difference in perplexity between F16 and F32

@cmp-nct
Copy link
Contributor Author

cmp-nct commented May 18, 2023

I ran a test again and I could not replicate the performance gain anymore, maybe I had two changes yesterday.
I believe you are right, the conversion (ggml_fp32_to_fp16_row) is not causing a relevant performance loss.

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants