Add support for float16 (half-precision floats) and related operations such as hgemm() #234

jacobgorm · 2018-07-16T21:48:53Z

I am using BLIS for neural networks on embedded platforms (mostly ARMv8a), and I would like to reap the potential memory savings as well as possibly some speedups from running with half-precision floats. Are there any plans to support these in BLIS?

fgvanzee · 2018-07-16T21:59:30Z

@jacobgorm Thanks for the suggestion. This is something that is in our medium-range plans. Of course, as you probably already know, the complicating factor is that there is no standard C language support for a float16 datatype, so any solution would necessarily not be portable. (In principle, we can add float16 operations, but it would take a non-trivial amount of work. Also, we would need to design things so that the user could disable the system-specific float16 support if it were not available.)

jeffhammond · 2018-07-17T19:42:34Z

Some useful information from mpi-forum/mpi-issues#65:

Half-precision floating-point format on Wikipedia.
ISO/IEC JTC 1/SC 22/WG14 N1945 (ISO C proposal)
ISO/IEC JTC1 SC22 WG14 N2017 (ISO C++ proposal)
GCC documentation for Half-Precision Floating Point and Additional Floating Types (e.g. _Float16)
Clang/LLVM _Float16 support for C/C++ commit
Intel® Half-Precision Floating-Point Format Conversion Instructions
Performance Benefits of Half Precision Floats

fgvanzee · 2018-07-17T19:49:35Z

@jeffhammond Thank you for taking the time to rustle up these links, Jeff. This will surely prove very useful.

jeffhammond · 2018-07-17T22:52:02Z

I recommend that BLIS not support float16 but rather bfloat16. The latest research in machine learning suggests that float16 is inferior to bfloat16 for training because of the software and processing overheads associated with handling the limited numerical range associated with a 5-bit exponent.

In any case, implementing both float16 and bfloat16 on hardware that doesn't have native support is relatively easy. In both cases, you use float32 compute. For float16, you can AVX vcvtps2ph to convert from float16 storage to float32 storage and then do the compute as you would float32 (the latency is 4-7 cycles in the documentation I've found online). For bfloat16, the conversion is trivial, because you just copy the float16 data into the upper half of a float32 register and proceed as before.
It might be possible to reuse the float32 microkernel.

Google recommends the use of bfloat16 with TensorFlow and it is relatively straightforward to understand that it is a better use of bits to have an 8-bit exponent like float32 than the 5-bit exponent used by IEEE float16.

Intel's public statement on bfloat16 is:

Over time, Intel will be extending bfloat16 support across our AI product lines, including Intel Xeon processors and Intel FPGAs. This is part of a cohesive and comprehensive strategy to bring leading AI training capabilities to our silicon portfolio.

Disclaimer: I work for Intel.

Additional references:

fgvanzee · 2018-07-17T23:25:23Z

@jeffhammond Once again, this was very helpful Jeff. Thank you.

I had never even heard of bfloat16 before today. I can see why it would be preferable (especially for ML/AI applications) given the trade-off between exponent and mantissa.

poulson · 2018-07-18T00:08:55Z

Yes, bfloat16 is all the rage for inference right now for deciding which bucket to put something in. It's also worth mentioning the 8-bit integer quantization approach taken by https://github.com/google/gemmlowp.
Disclaimer: I sit next to the author at work.

jeffhammond · 2018-07-18T03:05:01Z

int8 and int16 are usually employed for inference although I’m aware of some efforts to use in training. Not sure if worth the software pain though.

Maratyszcza · 2018-09-17T23:43:42Z

ARMv8.2 defined instructions for FP16 (IEEE format) computations. These are natively supported in Cortex-A55 and Cortex-A75 cores, e.g. Snapdragon 845, with the same per-instruction throughput and 2x FLOPS of FP32 computations.

jacobgorm · 2019-05-01T21:29:03Z

hi again. Are you guys still considering adding half-precision support to BLIS? FWIW there does seem to be a bit of a hole in the market for portable LA library that supports this. I know of FBGEMM from Facebook but it is x86-only and uses a scary JIT, and last I tested the ARM Compute Library's GEMM it was really slow compared to BLIS. CLBlast is nice, but only works with OpenCL.

jeffhammond · 2019-05-01T22:37:04Z

https://arxiv.org/pdf/1904.06376.pdf ("Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations") is relevant reading for anyone following this thread.

jeffhammond · 2019-05-01T22:42:31Z

@jacobgorm I have spoken to @dnparikh and @fgvanzee about this on a number of occasions and I am confident that this is a priority for them.

jeffhammond · 2019-05-01T22:44:12Z

@fgvanzee I'd like to recant my prior comment in #234 (comment). For quantum chemistry, float16 might end up being more interesting. We are still studying this but it is ideal to have both for our experiments.

jeffhammond · 2019-05-01T22:49:51Z

Intel published the BF16 ISA in the April 2019 update (319433-036) of the Intel® Architecture Instruction Set Extensions and Future Features Programming Reference.

There is an unofficial synopsis for those who don't want to search the 149-page PDF on Anandtech.

fgvanzee · 2019-05-02T00:02:54Z

@fgvanzee I'd like to recant my prior comment in #234 (comment). For quantum chemistry, float16 might end up being more interesting. We are still studying this but it is ideal to have both for our experiments.

I'm trying to imagine what could have changed (what observations you could have made) that would flip the polarity on this issue. (You need those extra three bits of mantissa after all?)

rvdg · 2019-05-02T01:01:40Z

Jacob, Investigating bfloat16 is on our priority list. We are waiting for word on funding from a sponsor, which may bump it higher on the priority list. Robert

jeffhammond · 2019-05-02T03:47:22Z

I'm trying to imagine what could have changed (what observations you could have made) that would flip the polarity on this issue. (You need those extra three bits of mantissa after all?)

We don’t need the exponent bits so why not use for mantissa?

fgvanzee · 2019-05-02T19:46:30Z

We don’t need the exponent bits so why not use for mantissa?

Touche. Anyhow, I'm less concerned with what people want than I am with whether there is basic support for the datatype in either the compiler or the ISA (or both).

jacobgorm · 2019-05-02T20:46:27Z

Clang now has experimental _Float16 support, but only on ARM : https://clang.llvm.org/docs/LanguageExtensions.html .

rvdg · 2019-05-02T21:01:01Z

Sounds like ARM should sponsor this effort, so we can bump it up on our priority list! :-). Thank you for sharing.

jeffhammond · 2019-05-03T05:32:31Z

@jacobgorm https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point also says

__fp16 is supported on every target, as it is purely a storage format; see below.

and

__fp16 is a storage and interchange format only. This means that values of __fp16 are immediately promoted to (at least) float when used in arithmetic operations...

I would argue that BLIS should use typedef to support either format as input data.

jacobgorm · 2019-05-03T10:30:01Z

@jeffhammond the advantage as the library developer to having _Float16 in the compiler is that it does not promote to float, which should make initial development easier. I agree that the external interface could just as well be __fp16.

jeffhammond · 2019-05-03T18:15:04Z

@jacobgorm Yes, of course, but since I work for Intel, I have an interest in implementing something that is not restricted to ARM architectures 😃 In any case, since BLIS is going to do all the important math explicitly in the microkernel, compiler promotion shouldn't be a major issue.

fgvanzee · 2019-05-03T18:57:35Z

In any case, since BLIS is going to do all the important math explicitly in the microkernel, compiler promotion shouldn't be a major issue.

Let's all remember that BLIS allows the user to do more than level-3 operations! My goal is for full operation support for float16 (or bfloat16), even if the implementation is sub-optimal. So the issues around float16 and the compiler are very much important to me (even if efficiency is not).

jhogg41 · 2019-05-12T22:44:17Z

So far as I'm aware, there isn't a standardized calling convention for _Float16 on intel, or at least if there is, my version of clang doesn't have it yet. As such we can't pass data by value, which makes things a little messy (and using __fp16 would imply we worked as __fp16 rather than as _Float16).

amirgholami · 2020-08-18T00:05:17Z

I also wanted to request support for reduced precision support. I think it would be valuable to add both IEEE 754's FP16 as well as Bfloat16 as the former has major issues for training ML.

P.S: There is also a new TF32 format from Nvidia:
https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/

jeffhammond · 2020-08-19T20:56:14Z

@amirgholami BLIS doesn't support GPUs but TF32 is just a form of 19-bit floating-point with 32b data. In the absence of hardware support, there is no upside versus SGEMM. In the presence of SGEMM, the implementation is going to be the same as SGEMM but with a different microkernel, except for the loss of accuracy in the results, of course.

amirgholami · 2020-08-19T22:17:51Z

Hey @jeffhammond

Yes I am aware of the fact that TF32 is supported on Ampere architecture. I mentioned it as evidence that there is still a lot of active research on low precision arithmetics. On that note I should also add MSFP8 and MSFP11 which are from Microsoft and being used in their brainwave fpga project.

Aside from the above formats, which are relatively newer formats, there are a lot of different LA algorithms that have already incorporated FP16 or BFloat16 (for example as preconditioners), and it would be great if bliss would support them.

P.S: Regarding hardware support, Intel CooperLake that was announced last month has support for bfloat16 arithmetics.

AngryLoki · 2023-11-26T10:11:09Z

amd/blis fork adds aocl_gemm addon, that adds bf16 support to gemm for BF16-capable CPUs and sequence of functions for s8-u8 gemm for VNNI-capable CPUs. Additionally it adds support of ReLU/GeLU/Downscale/CLIP post-ops.

Merge of amd/blis changes is discussed in #770.

fgvanzee changed the title ~~It would be great if BLIS had HGEMM() using half-precision floats.~~ Add support for float16 (half-precision floats) and related operations such as hgemm() Jul 16, 2018

jeffhammond mentioned this issue May 23, 2020

Benchmarking and Performance #255

Closed

devinamatthews added the enhancement label Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for float16 (half-precision floats) and related operations such as hgemm() #234

Add support for float16 (half-precision floats) and related operations such as hgemm() #234

jacobgorm commented Jul 16, 2018

fgvanzee commented Jul 16, 2018

jeffhammond commented Jul 17, 2018

fgvanzee commented Jul 17, 2018

jeffhammond commented Jul 17, 2018

fgvanzee commented Jul 17, 2018

poulson commented Jul 18, 2018 •

edited

Loading

jeffhammond commented Jul 18, 2018 via email •

edited

Loading

Maratyszcza commented Sep 17, 2018

jacobgorm commented May 1, 2019

jeffhammond commented May 1, 2019

jeffhammond commented May 1, 2019

jeffhammond commented May 1, 2019

jeffhammond commented May 1, 2019

fgvanzee commented May 2, 2019

rvdg commented May 2, 2019 via email •

edited by jeffhammond

Loading

jeffhammond commented May 2, 2019 via email •

edited

Loading

fgvanzee commented May 2, 2019

jacobgorm commented May 2, 2019

rvdg commented May 2, 2019 via email •

edited by jeffhammond

Loading

jeffhammond commented May 3, 2019

jacobgorm commented May 3, 2019

jeffhammond commented May 3, 2019

fgvanzee commented May 3, 2019

jhogg41 commented May 12, 2019

amirgholami commented Aug 18, 2020

jeffhammond commented Aug 19, 2020

amirgholami commented Aug 19, 2020 •

edited

Loading

AngryLoki commented Nov 26, 2023

Add support for float16 (half-precision floats) and related operations such as hgemm() #234

Add support for float16 (half-precision floats) and related operations such as hgemm() #234

Comments

jacobgorm commented Jul 16, 2018

fgvanzee commented Jul 16, 2018

jeffhammond commented Jul 17, 2018

fgvanzee commented Jul 17, 2018

jeffhammond commented Jul 17, 2018

fgvanzee commented Jul 17, 2018

poulson commented Jul 18, 2018 • edited Loading

jeffhammond commented Jul 18, 2018 via email • edited Loading

Maratyszcza commented Sep 17, 2018

jacobgorm commented May 1, 2019

jeffhammond commented May 1, 2019

jeffhammond commented May 1, 2019

jeffhammond commented May 1, 2019

jeffhammond commented May 1, 2019

fgvanzee commented May 2, 2019

rvdg commented May 2, 2019 via email • edited by jeffhammond Loading

jeffhammond commented May 2, 2019 via email • edited Loading

fgvanzee commented May 2, 2019

jacobgorm commented May 2, 2019

rvdg commented May 2, 2019 via email • edited by jeffhammond Loading

jeffhammond commented May 3, 2019

jacobgorm commented May 3, 2019

jeffhammond commented May 3, 2019

fgvanzee commented May 3, 2019

jhogg41 commented May 12, 2019

amirgholami commented Aug 18, 2020

jeffhammond commented Aug 19, 2020

amirgholami commented Aug 19, 2020 • edited Loading

AngryLoki commented Nov 26, 2023

poulson commented Jul 18, 2018 •

edited

Loading

jeffhammond commented Jul 18, 2018 via email •

edited

Loading

rvdg commented May 2, 2019 via email •

edited by jeffhammond

Loading

jeffhammond commented May 2, 2019 via email •

edited

Loading

rvdg commented May 2, 2019 via email •

edited by jeffhammond

Loading

amirgholami commented Aug 19, 2020 •

edited

Loading