Add Futhark implementation #146

athas · 2022-11-15T12:23:20Z

I'm at SC22 and noticed that BabelStream is pretty popular, and I really like the idea of polyglot benchmark suites. So, here's an implementation in Futhark. I don't know if this is too obscure a language, but it's easy to invoke from C and C++ and so can use the same tooling as all the C++ implementations.

$ cmake -Bbuild -H. -DMODEL=futhark -DFUTHARK_BACKEND=opencl
...
$ cmake --build build
...
$ $ build/futhark-stream
BabelStream
Version: 4.0
Implementation: Futhark (OpencL)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1319634.622 0.00041     0.00141     0.00081
Mul         1338486.404 0.00040     0.00176     0.00081
Add         1355367.262 0.00059     0.00193     0.00098
Triad       1364525.941 0.00059     0.00196     0.00102
Dot         1315743.983 0.00041     0.00178     0.00079

Signed-off-by: Jeff Hammond <jehammond@nvidia.com>

Since Futhark compiles to C code, it can use the same main.cpp as the various C++ implementations.

Munksgaard · 2022-11-17T17:39:40Z

Perhaps the CI scripts should be updated as well?

athas · 2022-11-17T17:55:32Z

The more exotic implementations also aren't tested in CI, so I intentionally didn't add that to keep the invasiveness of this PR low. But I can do so if desired, no problem. (Incidentally, someone really ought to make a GA action for Futhark.)

tom91136 · 2022-11-21T02:27:21Z

The more exotic implementations also aren't tested in CI, so I intentionally didn't add that to keep the invasiveness of this PR low. But I can do so if desired, no problem. (Incidentally, someone really ought to make a GA action for Futhark.)

We do have CIs setup for the Rust, Julia, and Java variant so it would be good to have it for Futhark too.
Since it's using CMake, it would follow the CI steps like all other C++ implementations, see https://github.com/UoB-HPC/BabelStream/blob/main/src/ci-test-compile.sh and https://github.com/UoB-HPC/BabelStream/blob/main/.github/workflows/main.yaml.

athas · 2022-11-21T08:06:25Z

Alright, I'll add a CI step for Futhark.

athas · 2022-11-21T13:03:16Z

I have added a very simple Futhark action. It only tests a single version of cmake (the preinstalled one). If you want, I can also try to fit it into the C++ framework, but I don't think it's worth it (and might make it more difficult to test other Futhark backends).

tom91136 · 2022-12-22T17:25:10Z

Thanks for the PR and sorry for the late reply, I'm taking a look now (running benchmarks, etc), one thing I've noticed is the lack of a device enumeration API in the Futhark runtime (but apparently you can set devices, or even be presented a dialog in the OpenCL case), this was a bit problematic as I'm testing machines with more than one OpenCL platform. As a workaround, we may have to implement the device enumeration by replicating the logic in the OpenCL and CUDA models.

athas · 2022-12-22T17:33:15Z

It wouldn't be difficult to add. I can take a swing at it.

tom91136 · 2022-12-22T17:39:30Z

@athas Do have to say, the generated C code for multicore CPU is more readable than a good portion C libraries out there!

athas · 2022-12-22T17:45:01Z

Then the state of C libraries is more dire than I thought.

The Futhark-generated OpenCL/CUDA APIs allow one to select a device by index, but not to enumerate all devices (except through the menu). Would it be OK to only implement the selection, but not the enumeration?

athas · 2022-12-22T17:45:49Z

I can just copy the device enumeration code from the cuda and ocl implementations if you would prefer to have full functionality.

athas · 2022-12-22T17:52:06Z

I have added device selection now.

tom91136 · 2023-01-27T14:40:09Z

I've got benchmark results for a few platforms.
For Nvidia A100, it's on-par with the native CUDA/OpenCL implementation:

Which is excellent.
For the multicore backend, I think the runtime is lacking NUMA awareness.
On a local Ryzen 5900X (1 NUMA domain) machine with dual channel DDR4 3400MT, I'm seeing comparable performance with OpenMP:

>./build/omp-stream --arraysize 536870912
BabelStream
Version: 4.0
Implementation: OpenMP
Running kernels 100 times
Precision: double
Array size: 4295.0 MB (=4.3 GB)
Total size: 12884.9 MB (=12.9 GB)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        25610.437   0.33541     0.36010     0.34216
Mul         25402.003   0.33816     0.35810     0.34445
Add         29010.697   0.44414     0.47372     0.45097
Triad       28878.115   0.44618     0.47433     0.45268
Dot         44377.658   0.19356     0.21538     0.20067
>./build/futhark-stream --arraysize 536870912
BabelStream
Version: 4.0
Implementation: Futhark (parallel CPU)
Running kernels 100 times
Precision: double
Array size: 4295.0 MB (=4.3 GB)
Total size: 12884.9 MB (=12.9 GB)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        26107.059   0.32903     0.35351     0.33425
Mul         25928.510   0.33129     0.35788     0.33712
Add         29328.709   0.43933     0.46747     0.44469
Triad       28930.892   0.44537     0.47550     0.45077
Dot         44656.475   0.19236     0.20958     0.19728

Performance is still good on a dual socket Intel Xeon Gold 6338 (2 NUMA domains) system compared to OpenMP:

But for a dual socket AMD EPYC 7713 (8 NUMA domains), the performance is quite poor:

I didn't do the scaling here due to time constraints.

athas · 2023-01-27T14:41:45Z

Yes, the multicore backend is completely NUMA-ignorant. The GPU backends are much more mature.

src/futhark/model.cmake

src/futhark/FutharkStream.cpp

tom91136 · 2023-10-03T08:23:03Z

LGTM, I'm trying to validate the CI by merging this with develop. If you're OK with it, please check the Allow edits by maintainers box in the PR and I'll push the merge.

athas · 2023-10-03T09:00:53Z

Apparently it is impossible for me to do so for weird GitHub reasons. I can see about resolving the conflicts myself.

tom91136 · 2023-10-03T09:08:24Z

Thanks for the merge, I've approved it now, let's wait for CI.

tom91136

Minor changes to get the CI going again

.github/workflows/main.yaml

athas · 2023-10-03T11:46:12Z

CI fails due to an unrelated job running out of disk space. I wonder why that is not an issue on the develop branch.

tom91136 · 2023-10-03T12:59:26Z

No worries, it's probably because we ran out of cache due to how big the compilers are (we download and untar NVHPC in the setup). Thanks again and sorry about the slow turnaround.

* Add Futhark.

jeffhammond and others added 4 commits November 15, 2022 06:17

accept NVHPC NVC++ as a CUDA compiler when it is so

2bd7c9c

Signed-off-by: Jeff Hammond <jehammond@nvidia.com>

Add Futhark.

312bc70

Since Futhark compiles to C code, it can use the same main.cpp as the various C++ implementations.

Fix CUDA support for Futhark.

cca14e3

Slight documentation.

a854b95

athas added 2 commits November 17, 2022 11:57

Add missing copyright notice.

4125b97

Also mention here.

582b51c

tom91136 requested review from tom91136 and tomdeakin November 21, 2022 02:22

Simplistic Futhark CI.

d398da6

Implement device selection.

b173b5a

tom91136 requested changes Jan 30, 2023

View reviewed changes

src/futhark/model.cmake Outdated Show resolved Hide resolved

src/futhark/FutharkStream.cpp Outdated Show resolved Hide resolved

athas added 2 commits January 30, 2023 17:52

Remove unnecessary #define.

d6d82b3

Use proper cmake flags.

01f69fa

tom91136 added this to the v5.0 milestone Jun 14, 2023

athas requested a review from tom91136 June 25, 2023 00:09

Merge branch 'develop' into futhark

092fb82

tom91136 requested changes Oct 3, 2023

View reviewed changes

.github/workflows/main.yaml Show resolved Hide resolved

Increment numbers.

0571ff8

tom91136 merged commit 92fed70 into UoB-HPC:develop Oct 3, 2023
4 of 5 checks passed

pranav-sivaraman pushed a commit to hpcgroup/BabelStream that referenced this pull request Dec 7, 2023

Add Futhark implementation (UoB-HPC#146)

428fb72

* Add Futhark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Futhark implementation #146

Add Futhark implementation #146

athas commented Nov 15, 2022 •

edited

Loading

Munksgaard commented Nov 17, 2022

athas commented Nov 17, 2022

tom91136 commented Nov 21, 2022

athas commented Nov 21, 2022

athas commented Nov 21, 2022

tom91136 commented Dec 22, 2022

athas commented Dec 22, 2022

tom91136 commented Dec 22, 2022

athas commented Dec 22, 2022

athas commented Dec 22, 2022

athas commented Dec 22, 2022

tom91136 commented Jan 27, 2023

athas commented Jan 27, 2023

tom91136 commented Oct 3, 2023

athas commented Oct 3, 2023

tom91136 commented Oct 3, 2023

tom91136 left a comment

athas commented Oct 3, 2023

tom91136 commented Oct 3, 2023

Add Futhark implementation #146

Add Futhark implementation #146

Conversation

athas commented Nov 15, 2022 • edited Loading

Munksgaard commented Nov 17, 2022

athas commented Nov 17, 2022

tom91136 commented Nov 21, 2022

athas commented Nov 21, 2022

athas commented Nov 21, 2022

tom91136 commented Dec 22, 2022

athas commented Dec 22, 2022

tom91136 commented Dec 22, 2022

athas commented Dec 22, 2022

athas commented Dec 22, 2022

athas commented Dec 22, 2022

tom91136 commented Jan 27, 2023

athas commented Jan 27, 2023

tom91136 commented Oct 3, 2023

athas commented Oct 3, 2023

tom91136 commented Oct 3, 2023

tom91136 left a comment

Choose a reason for hiding this comment

athas commented Oct 3, 2023

tom91136 commented Oct 3, 2023

athas commented Nov 15, 2022 •

edited

Loading