[GPU] implement f16 extension in GPU module by BI71317 · Pull Request #804 · exaloop/codon

BI71317 · 2026-05-12T06:02:54Z

fixes #803

MRE

import gpu

@gpu.kernel
def kernel_f16(x: float16, out):
    out[0] = (x + float16(1.0)) * float16(2.0)


def main():
    out = [float16(0.0)]
    kernel_f16(float16(1.5), out, grid=1, block=1)
    print(out[0]) # 5.0f16

main()

Result

Seems Running Well.

$ codon run float16_kernel_repro.py 
5

Test Suite

In test\transform\kernels.codon, current test suite missing kernel scalar type lowering, so also added that too.

@test
def test_scalar_types():
    
    def check_exact(name, kernel, x, out, expected):
        kernel(x, out, grid=1, block=1)
        assert out[0] == expected
    
    @gpu.kernel
    def kernel_i8(x: i8, out):
        out[0] = (x + i8(3)) * i8(2)


    @gpu.kernel
    def kernel_i16(x: i16, out):
        out[0] = (x + i16(5)) * i16(3)

....

arshajii · 2026-05-15T13:33:01Z

Thanks! I realized we also don't have this conversion for bfloat16 -- perhaps we can add it in this PR?

BI71317 · 2026-05-15T14:01:11Z

Hey, @arshajii!

How do you think about complex64 and complex case? Codon represents it as two float fields, so it may already be covered by the generic tuple path.

I’d like to clarify whether we want explicit complex scalar GPU support, or keep it implicit through tuple handling.

BI71317 · 2026-05-15T14:24:51Z

nvm.

As complex types are already tuple, they seems works well without extension class.....

BI71317 · 2026-05-17T05:22:17Z

...perhaps we can add it in this PR?

@arshajii Yeah for sure.

I also looked at other native types that don't have a GPU extension class, for example complex, but those types seem to work fine already without one.

So I only added the bfloat16 extension, the test suite, and the truncate helper modules for bf.

CmakeLists.txt

# Codon runtime library
add_library(codonfloat STATIC
            codon/runtime/floatlib/extenddftf2.c
            codon/runtime/floatlib/fp_trunc.h
            codon/runtime/floatlib/truncdfhf2.c
            codon/runtime/floatlib/extendhfsf2.c
            codon/runtime/floatlib/int_endianness.h
            codon/runtime/floatlib/truncdfsf2.c
            codon/runtime/floatlib/extendhftf2.c
            codon/runtime/floatlib/int_lib.h
#            codon/runtime/floatlib/truncsfbf2.c
            codon/runtime/floatlib/extendsfdf2.c
            codon/runtime/floatlib/int_math.h
            codon/runtime/floatlib/truncsfhf2.c
            codon/runtime/floatlib/extendsftf2.c
            codon/runtime/floatlib/int_types.h
            codon/runtime/floatlib/trunctfdf2.c
            codon/runtime/floatlib/fp_extend.h
            codon/runtime/floatlib/int_util.h
            codon/runtime/floatlib/trunctfhf2.c
            codon/runtime/floatlib/fp_lib.h
#            codon/runtime/floatlib/truncdfbf2.c

truncdfbf2.c and truncsfbf2.c were commented out, so I uncommented them to enable truncation support for bfloat16.

Seems works well.

BI71317 · 2026-05-18T01:59:27Z

Please refer to #811 for context.

To clarify the timeline in a bit more detail: when I opened this PR, kernels.codon was indeed executable at that time.

After receiving the request to add bf16 support, I reran the workload and encountered an error. However, the failure does not appear to be related to either f16 or bf16, so I did not go into detail about it here.

In practice, the failure occurs in the test_conversions workload. If that specific case is excluded, all tests pass, including the version with f16 and bf16 support added.

arshajii · 2026-05-18T21:00:10Z

Thanks -- I'll merge this and then look into the ordered dict issue.

BI71317 added 2 commits May 12, 2026 14:51

implemented f16 extension in gpu module

230feb3

add scalar type call in kernel test

4fc03f3

BI71317 requested review from arshajii and inumanag as code owners May 12, 2026 06:02

cla-bot Bot added the cla-signed label May 12, 2026

Remove empty lines between GPU kernel definitions

82cb07e

implemented bf16 extension and add test, trunc helper

94ad748

BI71317 mentioned this pull request May 18, 2026

[GPU] ordered Dict GPU conversion segfaults for Dict[int, List[float]] while unordered-dict works #811

Closed

arshajii merged commit 8f1f2f9 into exaloop:develop May 18, 2026
9 checks passed

BI71317 deleted the pr-missing-f16-device branch May 19, 2026 00:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] implement f16 extension in GPU module#804

[GPU] implement f16 extension in GPU module#804
arshajii merged 4 commits into
exaloop:developfrom
BI71317:pr-missing-f16-device

BI71317 commented May 12, 2026

Uh oh!

arshajii commented May 15, 2026

Uh oh!

BI71317 commented May 15, 2026

Uh oh!

BI71317 commented May 15, 2026

Uh oh!

BI71317 commented May 17, 2026

Uh oh!

BI71317 commented May 18, 2026

Uh oh!

arshajii commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BI71317 commented May 12, 2026

MRE

Result

Test Suite

Uh oh!

arshajii commented May 15, 2026

Uh oh!

BI71317 commented May 15, 2026

Uh oh!

BI71317 commented May 15, 2026

Uh oh!

BI71317 commented May 17, 2026

CmakeLists.txt

Uh oh!

BI71317 commented May 18, 2026

Uh oh!

arshajii commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants