Feature - Internal ggml precision GGML_TYPE_F16 support #1492

cmp-nct · 2023-05-17T02:15:17Z

It might be too much to ask for now, given it's rooting deep into ggml but in longterm I believe it's important to support 16 bit precision.
Especially as GPU support is finding more and more grip in GGML the 32 bit requirement is a significant performance burden while not providing any benefit on the multiplications.
After all the multiplications inside the GPU are all 16 bit, converting src1 from 32 bit to 16 bit for every calculation costs quite noticeable performance.

Green-Sky · 2023-05-17T15:34:04Z

This is already supported and in use? - not sure which parts you are referring to.

cmp-nct · 2023-05-17T18:13:10Z

This is already supported and in use? - not sure which parts you are referring to.

matmul is designed for 32 bit only, the precision is hardcoded for the dst. That's why all src1 matmul are 32 even in 4 bit quantized mode.
It's not just matmul and it's internal counterparts, also other parts of ggml have no 16 bit representation, so they will assert false if you attempt it.
On the CUDA side it's a tiny change, there you'd just need to check if src1 is 16 bit and skip the (slow) conversion to 16 bit.
In addition to the lower memory overhead cuBLAS is 20+% faster on pure 16 bit compute, currently it's the slower 16/32 bit computation.

sw · 2023-05-17T18:17:49Z

I believe this is what #959 was about.

Green-Sky · 2023-05-17T18:51:59Z

@cmp-nct you mean like this? #1508

ggerganov · 2023-05-17T19:04:41Z

and skip the (slow) conversion to 16 bit

Is it really slow? My expectation is it would be completely negligible

cmp-nct · 2023-05-17T19:46:11Z

and skip the (slow) conversion to 16 bit

Is it really slow? My expectation is it would be completely negligible

I ran a test yesterday and had significant faster inference but it was a hacked together test.
From my memory I had mat_mul times of 72ms avg. with the change, and 90ms without. But I can't say for certain if it was only that or if I had more than one change.

With my recent upstream pull all my local code needs to be adapted again, I'll run a second test once I put the pieces together again to confirm it.

@ggerganov : Do you know if the 32bit precision comes with a real quality benefit compared to half precision ?

ggerganov · 2023-05-17T19:56:00Z

Do you know if the 32bit precision comes with a real quality benefit compared to half precision ?

There is no measurable difference in perplexity between F16 and F32

cmp-nct · 2023-05-18T17:33:22Z

I ran a test again and I could not replicate the performance gain anymore, maybe I had two changes yesterday.
I believe you are right, the conversion (ggml_fp32_to_fp16_row) is not causing a relevant performance loss.

github-actions · 2024-04-09T01:09:03Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature - Internal ggml precision GGML_TYPE_F16 support #1492

Feature - Internal ggml precision GGML_TYPE_F16 support #1492

cmp-nct commented May 17, 2023

Green-Sky commented May 17, 2023

cmp-nct commented May 17, 2023 •

edited

Loading

sw commented May 17, 2023

Green-Sky commented May 17, 2023

ggerganov commented May 17, 2023

cmp-nct commented May 17, 2023 •

edited

Loading

ggerganov commented May 17, 2023

cmp-nct commented May 18, 2023

github-actions bot commented Apr 9, 2024

Feature - Internal ggml precision GGML_TYPE_F16 support #1492

Feature - Internal ggml precision GGML_TYPE_F16 support #1492

Comments

cmp-nct commented May 17, 2023

Green-Sky commented May 17, 2023

cmp-nct commented May 17, 2023 • edited Loading

sw commented May 17, 2023

Green-Sky commented May 17, 2023

ggerganov commented May 17, 2023

cmp-nct commented May 17, 2023 • edited Loading

ggerganov commented May 17, 2023

cmp-nct commented May 18, 2023

github-actions bot commented Apr 9, 2024

cmp-nct commented May 17, 2023 •

edited

Loading

cmp-nct commented May 17, 2023 •

edited

Loading