Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Futhark implementation #146

Merged
merged 12 commits into from
Oct 3, 2023
Merged

Add Futhark implementation #146

merged 12 commits into from
Oct 3, 2023

Conversation

athas
Copy link
Contributor

@athas athas commented Nov 15, 2022

I'm at SC22 and noticed that BabelStream is pretty popular, and I really like the idea of polyglot benchmark suites. So, here's an implementation in Futhark. I don't know if this is too obscure a language, but it's easy to invoke from C and C++ and so can use the same tooling as all the C++ implementations.

$ cmake -Bbuild -H. -DMODEL=futhark -DFUTHARK_BACKEND=opencl
...
$ cmake --build build
...
$ $ build/futhark-stream
BabelStream
Version: 4.0
Implementation: Futhark (OpencL)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1319634.622 0.00041     0.00141     0.00081
Mul         1338486.404 0.00040     0.00176     0.00081
Add         1355367.262 0.00059     0.00193     0.00098
Triad       1364525.941 0.00059     0.00196     0.00102
Dot         1315743.983 0.00041     0.00178     0.00079

jeffhammond and others added 4 commits November 15, 2022 06:17
Signed-off-by: Jeff Hammond <jehammond@nvidia.com>
Since Futhark compiles to C code, it can use the same main.cpp as the
various C++ implementations.
@Munksgaard
Copy link

Perhaps the CI scripts should be updated as well?

@athas
Copy link
Contributor Author

athas commented Nov 17, 2022

The more exotic implementations also aren't tested in CI, so I intentionally didn't add that to keep the invasiveness of this PR low. But I can do so if desired, no problem. (Incidentally, someone really ought to make a GA action for Futhark.)

@tom91136
Copy link
Member

The more exotic implementations also aren't tested in CI, so I intentionally didn't add that to keep the invasiveness of this PR low. But I can do so if desired, no problem. (Incidentally, someone really ought to make a GA action for Futhark.)

We do have CIs setup for the Rust, Julia, and Java variant so it would be good to have it for Futhark too.
Since it's using CMake, it would follow the CI steps like all other C++ implementations, see https://github.com/UoB-HPC/BabelStream/blob/main/src/ci-test-compile.sh and https://github.com/UoB-HPC/BabelStream/blob/main/.github/workflows/main.yaml.

@athas
Copy link
Contributor Author

athas commented Nov 21, 2022

Alright, I'll add a CI step for Futhark.

@athas
Copy link
Contributor Author

athas commented Nov 21, 2022

I have added a very simple Futhark action. It only tests a single version of cmake (the preinstalled one). If you want, I can also try to fit it into the C++ framework, but I don't think it's worth it (and might make it more difficult to test other Futhark backends).

@tom91136
Copy link
Member

Thanks for the PR and sorry for the late reply, I'm taking a look now (running benchmarks, etc), one thing I've noticed is the lack of a device enumeration API in the Futhark runtime (but apparently you can set devices, or even be presented a dialog in the OpenCL case), this was a bit problematic as I'm testing machines with more than one OpenCL platform. As a workaround, we may have to implement the device enumeration by replicating the logic in the OpenCL and CUDA models.

@athas
Copy link
Contributor Author

athas commented Dec 22, 2022

It wouldn't be difficult to add. I can take a swing at it.

@tom91136
Copy link
Member

@athas Do have to say, the generated C code for multicore CPU is more readable than a good portion C libraries out there!

@athas
Copy link
Contributor Author

athas commented Dec 22, 2022

Then the state of C libraries is more dire than I thought.

The Futhark-generated OpenCL/CUDA APIs allow one to select a device by index, but not to enumerate all devices (except through the menu). Would it be OK to only implement the selection, but not the enumeration?

@athas
Copy link
Contributor Author

athas commented Dec 22, 2022

I can just copy the device enumeration code from the cuda and ocl implementations if you would prefer to have full functionality.

@athas
Copy link
Contributor Author

athas commented Dec 22, 2022

I have added device selection now.

@tom91136
Copy link
Member

I've got benchmark results for a few platforms.
For Nvidia A100, it's on-par with the native CUDA/OpenCL implementation:

BabelStream 4 0 A100 40G (array size=4 3GB)

Which is excellent.
For the multicore backend, I think the runtime is lacking NUMA awareness.
On a local Ryzen 5900X (1 NUMA domain) machine with dual channel DDR4 3400MT, I'm seeing comparable performance with OpenMP:

>./build/omp-stream --arraysize 536870912
BabelStream
Version: 4.0
Implementation: OpenMP
Running kernels 100 times
Precision: double
Array size: 4295.0 MB (=4.3 GB)
Total size: 12884.9 MB (=12.9 GB)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        25610.437   0.33541     0.36010     0.34216
Mul         25402.003   0.33816     0.35810     0.34445
Add         29010.697   0.44414     0.47372     0.45097
Triad       28878.115   0.44618     0.47433     0.45268
Dot         44377.658   0.19356     0.21538     0.20067
>./build/futhark-stream --arraysize 536870912
BabelStream
Version: 4.0
Implementation: Futhark (parallel CPU)
Running kernels 100 times
Precision: double
Array size: 4295.0 MB (=4.3 GB)
Total size: 12884.9 MB (=12.9 GB)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        26107.059   0.32903     0.35351     0.33425
Mul         25928.510   0.33129     0.35788     0.33712
Add         29328.709   0.43933     0.46747     0.44469
Triad       28930.892   0.44537     0.47550     0.45077
Dot         44656.475   0.19236     0.20958     0.19728

Performance is still good on a dual socket Intel Xeon Gold 6338 (2 NUMA domains) system compared to OpenMP:

BabelStream 4 0 Xeon Gold 6338 2 (Triad kernel, array size=4 3GB)

But for a dual socket AMD EPYC 7713 (8 NUMA domains), the performance is quite poor:

BabelStream 4 0 EPYC 7713 (array size=4 3GB)

I didn't do the scaling here due to time constraints.

@athas
Copy link
Contributor Author

athas commented Jan 27, 2023

Yes, the multicore backend is completely NUMA-ignorant. The GPU backends are much more mature.

src/futhark/model.cmake Outdated Show resolved Hide resolved
src/futhark/FutharkStream.cpp Outdated Show resolved Hide resolved
@tom91136 tom91136 added this to the v5.0 milestone Jun 14, 2023
@athas athas requested a review from tom91136 June 25, 2023 00:09
@tom91136
Copy link
Member

tom91136 commented Oct 3, 2023

LGTM, I'm trying to validate the CI by merging this with develop. If you're OK with it, please check the Allow edits by maintainers box in the PR and I'll push the merge.

@athas
Copy link
Contributor Author

athas commented Oct 3, 2023

Apparently it is impossible for me to do so for weird GitHub reasons. I can see about resolving the conflicts myself.

@tom91136
Copy link
Member

tom91136 commented Oct 3, 2023

Thanks for the merge, I've approved it now, let's wait for CI.

Copy link
Member

@tom91136 tom91136 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes to get the CI going again

.github/workflows/main.yaml Show resolved Hide resolved
@athas
Copy link
Contributor Author

athas commented Oct 3, 2023

CI fails due to an unrelated job running out of disk space. I wonder why that is not an issue on the develop branch.

@tom91136 tom91136 merged commit 92fed70 into UoB-HPC:develop Oct 3, 2023
4 of 5 checks passed
@tom91136
Copy link
Member

tom91136 commented Oct 3, 2023

No worries, it's probably because we ran out of cache due to how big the compilers are (we download and untar NVHPC in the setup). Thanks again and sorry about the slow turnaround.

pranav-sivaraman pushed a commit to hpcgroup/BabelStream that referenced this pull request Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants