Skip to content

Add support for utilizing F16C instructions on xarch#127094

Open
tannergooding wants to merge 2 commits intodotnet:mainfrom
tannergooding:half-simd
Open

Add support for utilizing F16C instructions on xarch#127094
tannergooding wants to merge 2 commits intodotnet:mainfrom
tannergooding:half-simd

Conversation

@tannergooding
Copy link
Copy Markdown
Member

@tannergooding tannergooding commented Apr 17, 2026

Since #122649 had to be reverted due to the ABI concerns, this is a simpler initial change that works with the existing ABI and on hardware with AVX2 support (not just AVX512-FP16 capable hardware).

This should provide a nice win across most existing hardware and we can follow up with a PR that does similar for the AVX512-FP16 instructions that allow directly accelerated arithmetic operations, rather than only handling conversions.

Before

; Method Program:HalfToSingle(System.Half):float (FullOpts)
G_M16314_IG01:  ;; offset=0x0000
       4883EC28             sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25

G_M16314_IG02:  ;; offset=0x0004
       0FB7C9               movzx    rcx, cx
       FF156BA74500         call     [System.Half:op_Explicit(System.Half):float]
       90                   nop      
						;; size=10 bbWeight=1 PerfScore 3.50

G_M16314_IG03:  ;; offset=0x000E
       4883C428             add      rsp, 40
       C3                   ret      
						;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 19

; Method Program:SingleToHalf(float):System.Half (FullOpts)
G_M32250_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M32250_IG02:  ;; offset=0x0000
       FF2572A74500         tail.jmp [System.Half:op_Explicit(float):System.Half]
						;; size=6 bbWeight=1 PerfScore 2.00
; Total bytes of code: 6

After

; Method Program:HalfToSingle(System.Half):float (FullOpts)
G_M15861_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M15861_IG02:  ;; offset=0x0000
       0FB7C1               movzx    rax, cx
       C5F96EC0             vmovd    xmm0, eax
       C4E27913C0           vcvtph2ps xmm0, xmm0
						;; size=12 bbWeight=1 PerfScore 6.25

G_M15861_IG03:  ;; offset=0x000C
       C3                   ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 13

; Method Program:SingleToHalf(float):System.Half (FullOpts)
G_M15413_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M15413_IG02:  ;; offset=0x0000
       C4E3791DC000         vcvtps2ph xmm0, xmm0, 0
       C5F97EC0             vmovd    eax, xmm0
       0FB7C0               movzx    rax, ax
						;; size=13 bbWeight=1 PerfScore 6.25

G_M15413_IG03:  ;; offset=0x000D
       C3                   ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 14

Copilot AI review requested due to automatic review settings April 17, 2026 21:36
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 17, 2026
@tannergooding
Copy link
Copy Markdown
Member Author

@EgorBot -intel -amd

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Benchmarks
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
    }

    float dataF32 = float.Pi;
    Half dataF16 = Half.Pi;

    [Benchmark]
    public float HalfToSingle() => (float)dataF16;

    [Benchmark]
    public Half SingleToHalf() => (Half)dataF32;
}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds initial xarch JIT support to accelerate System.Halffloat explicit conversions by recognizing Half.op_Explicit as a named intrinsic and lowering it to AVX2/F16C conversion instructions where available, without changing the existing ABI.

Changes:

  • Mark System.Half and the Half(float) / float(Half) explicit operators as [Intrinsic] so the JIT can recognize them.
  • Add a new named intrinsic (NI_System_Half_op_Explicit) and importer expansion that emits AVX2 conversion HW intrinsics (vcvtps2ph / vcvtph2ps) for xarch.
  • Update xarch HW intrinsic lists + containment/perf metadata to support the new conversion instructions.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/libraries/System.Private.CoreLib/src/System/Half.cs Marks Half and key explicit operators as [Intrinsic] to enable JIT recognition.
src/coreclr/jit/namedintrinsiclist.h Adds a named intrinsic ID for System.Half.op_Explicit.
src/coreclr/jit/importercalls.cpp Recognizes Half.op_Explicit and expands it to AVX2 conversion intrinsic sequences on xarch.
src/coreclr/jit/importer.cpp Adds helper routines to pack/unpack scalar Half values through SIMD nodes.
src/coreclr/jit/compiler.h Declares helper routines and adds isSystemHalfClass type recognition.
src/coreclr/jit/hwintrinsiclistxarch.h Adds AVX2 conversion intrinsics for half<->single vector conversions.
src/coreclr/jit/lowerxarch.cpp Extends containment logic to support the new conversion/store patterns.
src/coreclr/jit/emitxarch.cpp Adds perf characteristics entries for the new conversion instructions.

Comment thread src/coreclr/jit/importer.cpp Outdated
Comment thread src/coreclr/jit/importercalls.cpp
Comment thread src/coreclr/jit/importercalls.cpp
@tannergooding
Copy link
Copy Markdown
Member Author

CC. @dotnet/jit-contrib, @EgorBo, @kg for review

@dotnet/intel and @jkotas as an FYI on the alternative approach. AVX512-FP16 support can be done nearly identically, it's just a bigger PR. I'll pull the changes from #122649 after this lands. We can then look at the ABI handling and ensuring Half is properly passed in a floating-point register in the future.

@tannergooding
Copy link
Copy Markdown
Member Author

Benchmark (EgorBot/Benchmarks#132) is too small to get good results...

The realistic is that HalfToSingle is roughly 4.16x faster and SingleToHalf is about 1.76x faster. Changing from about 28 instructions w/ 3 memory accesses and 25 instructions w/ 0 memory accesses, respectively, to about 2 instructions with 0 memory accesses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants