Skip to content

[GPU] implement f16 extension in GPU module#804

Merged
arshajii merged 4 commits into
exaloop:developfrom
BI71317:pr-missing-f16-device
May 18, 2026
Merged

[GPU] implement f16 extension in GPU module#804
arshajii merged 4 commits into
exaloop:developfrom
BI71317:pr-missing-f16-device

Conversation

@BI71317

@BI71317 BI71317 commented May 12, 2026

Copy link
Copy Markdown
Contributor

fixes #803

MRE

import gpu

@gpu.kernel
def kernel_f16(x: float16, out):
    out[0] = (x + float16(1.0)) * float16(2.0)


def main():
    out = [float16(0.0)]
    kernel_f16(float16(1.5), out, grid=1, block=1)
    print(out[0]) # 5.0f16

main()

Result

Seems Running Well.

$ codon run float16_kernel_repro.py 
5

Test Suite

In test\transform\kernels.codon, current test suite missing kernel scalar type lowering, so also added that too.

@test
def test_scalar_types():
    
    def check_exact(name, kernel, x, out, expected):
        kernel(x, out, grid=1, block=1)
        assert out[0] == expected
    
    @gpu.kernel
    def kernel_i8(x: i8, out):
        out[0] = (x + i8(3)) * i8(2)


    @gpu.kernel
    def kernel_i16(x: i16, out):
        out[0] = (x + i16(5)) * i16(3)

....
스크린샷 2026-05-12 144458

@BI71317 BI71317 requested review from arshajii and inumanag as code owners May 12, 2026 06:02
@cla-bot cla-bot Bot added the cla-signed label May 12, 2026
@arshajii

Copy link
Copy Markdown
Contributor

Thanks! I realized we also don't have this conversion for bfloat16 -- perhaps we can add it in this PR?

@BI71317

BI71317 commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

Hey, @arshajii!

How do you think about complex64 and complex case? Codon represents it as two float fields, so it may already be covered by the generic tuple path.

I’d like to clarify whether we want explicit complex scalar GPU support, or keep it implicit through tuple handling.

@BI71317

BI71317 commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

nvm.

As complex types are already tuple, they seems works well without extension class.....

@BI71317

BI71317 commented May 17, 2026

Copy link
Copy Markdown
Contributor Author

...perhaps we can add it in this PR?

@arshajii Yeah for sure.

I also looked at other native types that don't have a GPU extension class, for example complex, but those types seem to work fine already without one.

So I only added the bfloat16 extension, the test suite, and the truncate helper modules for bf.

CmakeLists.txt

# Codon runtime library
add_library(codonfloat STATIC
            codon/runtime/floatlib/extenddftf2.c
            codon/runtime/floatlib/fp_trunc.h
            codon/runtime/floatlib/truncdfhf2.c
            codon/runtime/floatlib/extendhfsf2.c
            codon/runtime/floatlib/int_endianness.h
            codon/runtime/floatlib/truncdfsf2.c
            codon/runtime/floatlib/extendhftf2.c
            codon/runtime/floatlib/int_lib.h
#            codon/runtime/floatlib/truncsfbf2.c
            codon/runtime/floatlib/extendsfdf2.c
            codon/runtime/floatlib/int_math.h
            codon/runtime/floatlib/truncsfhf2.c
            codon/runtime/floatlib/extendsftf2.c
            codon/runtime/floatlib/int_types.h
            codon/runtime/floatlib/trunctfdf2.c
            codon/runtime/floatlib/fp_extend.h
            codon/runtime/floatlib/int_util.h
            codon/runtime/floatlib/trunctfhf2.c
            codon/runtime/floatlib/fp_lib.h
#            codon/runtime/floatlib/truncdfbf2.c

truncdfbf2.c and truncsfbf2.c were commented out, so I uncommented them to enable truncation support for bfloat16.

스크린샷 2026-05-17 140658

Seems works well.

@BI71317

BI71317 commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

Please refer to #811 for context.

To clarify the timeline in a bit more detail: when I opened this PR, kernels.codon was indeed executable at that time.

After receiving the request to add bf16 support, I reran the workload and encountered an error. However, the failure does not appear to be related to either f16 or bf16, so I did not go into detail about it here.

In practice, the failure occurs in the test_conversions workload. If that specific case is excluded, all tests pass, including the version with f16 and bf16 support added.

@arshajii

Copy link
Copy Markdown
Contributor

Thanks -- I'll merge this and then look into the ordered dict issue.

@arshajii arshajii merged commit 8f1f2f9 into exaloop:develop May 18, 2026
9 checks passed
@BI71317 BI71317 deleted the pr-missing-f16-device branch May 19, 2026 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gpu: float16 lacks __to_gpu__ support for kernel argument passing

2 participants