-
Notifications
You must be signed in to change notification settings - Fork 124
Julia implementation #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Julia implementation #106
Conversation
tomdeakin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting to see this - think it's pretty close. In general I'd like to make sure this is idiomatic of Julia code and we're not introducing any extra layers of abstraction in the benchmark.
JuliaStream.jl/src/StreamData.jl
Outdated
| @@ -0,0 +1,8 @@ | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to wrap up the data arrays in a whole new struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's similar to what we have for Rust, there are no classes in Julia, only functions and data (structs).
The alternative is to pass everything as parameters:
function mul!(data::StreamData{T}) where {T}
becomes
function mul!(a::C, b::C, c::C, scalar::T, size::Int) where {T,C<:AbstractArray{T}}
Same applies when I copy everything back to host:
const ROCData = StreamData{T,ROCArray{T}} where {T}
...
function read_data(data::ROCData{T})::VectorData{T} where {T}
return VectorData{T}(data.a, data.b, data.c, data.scalar, data.size)
end
read_data would be very long, and it needs to be implemented for 5 different implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer passing things as parameters.
Show the selected implementation and not a constant "threaded"
|
Ready for review again. |
|
Thanks @tom91136. I think I prefer the parameter passing rather than making a structure just to hold the arrays. I think that in a larger code with more arrays, you're just going to have to pass things around rather than keep wrapping things up in bundles to pass to different functions. Ideally we're aiming to write BabelStream in a way that is representative of something much bigger. |
JuliaStream.jl/src/CUDAStream.jl
Outdated
| const DotBlocks = 256::Int | ||
|
|
||
| function devices() | ||
| return !CUDA.functional(false) ? [] : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick: make the function type-stable
| return !CUDA.functional(false) ? [] : | |
| return !CUDA.functional(false) ? String[] : |
JuliaStream.jl/src/AMDGPUStream.jl
Outdated
| map(repr, gpu_agents_in_repr_order()) | ||
| catch | ||
| # probably unsupported | ||
| [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [] | |
| String[] |
| end | ||
| end | ||
|
|
||
| function copy!(data::SharedArrayData{T}) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that here and in the functions below where T isn't really used to constrain the signature nor in the body of the function, the parameter is redundant and
| function copy!(data::SharedArrayData{T}) where {T} | |
| function copy!(data::SharedArrayData) |
would work just fine
JuliaStream.jl/src/Stream.jl
Outdated
| if config.float | ||
| type = Float32 | ||
| else | ||
| type = Float64 | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, you can also do
| if config.float | |
| type = Float32 | |
| else | |
| type = Float64 | |
| end | |
| type = if config.float | |
| Float32 | |
| else | |
| Float64 | |
| end |
or even
| if config.float | |
| type = Float32 | |
| else | |
| type = Float64 | |
| end | |
| type = config.float ? Float32 : Float64 |
if you want to be even more concise. Also in the if below you can save a couple of benchmark = in the same way
😃 Is there something we can do to move this forward? I had a very quick look, but could do a more thorough review, if that helps |
|
@giordano Thanks for the review! This PR will be used for an upcoming submission to PMBS so I got a few more local changes (I've added a functional oneAPI.jl and KA implementation) that I'm in the process of finalising. I'll incorporate your review and put up a final version for further review by the end of the week. @tomdeakin and I had a discussion on the parameter passing and I think we've settled on the current approach being acceptable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff! Let me know if I can help with anything AMDGPU.jl-related (I'm the author and maintainer of it).
JuliaStream.jl/src/AMDGPUStream.jl
Outdated
| return | ||
| end | ||
| AMDGPU.wait( | ||
| soft = false, # soft wait causes HSA_REFCOUNT overflow issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be an issue on AMDGPU.jl 0.2.11, where I've disabled HSA_REFCOUNT until I can figure out why it's broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, hard wait could be more efficient than soft (since it hypothetically can use interrupts), but I'd make sure to test with both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested it with and without soft wait, and for the very few times soft=true actually run to completion (in most cases it dies after 70 iterations, so 5*70 kernels) the performance seems within margin of error. I think I'll stick to soft=false for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what you mean by:
actually run to completion (in most cases it dies after 70 iterations, so 5*70 kernels)
If you have a stacktrace or other information, that would help greatly.
Isolate projects to avoid transitive dependency Add parameter for passing devices Incorporate further reviews Update all dependencies
maleadt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I added a couple of small comments. Great to see that the performance is good 🙂
JuliaStream.jl/src/CUDAStream.jl
Outdated
| end | ||
|
|
||
| function init_arrays!(data::CuData{T}, _, init::Tuple{T,T,T}) where {T} | ||
| CUDA.fill!(data.a, init[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need the module prefix here, as fill! is just a Base method that dispatches correctly since you're passing a CuArray argument already. Only when that's not the case (e.g. with fill, which just takes an integer) you need the CUDA-specific version.
JuliaStream.jl/src/CUDAStream.jl
Outdated
|
|
||
| function copy!(data::CuData{T}, _) where {T} | ||
| function kernel(a, c) | ||
| i = (blockIdx().x - 1) * blockDim().x + threadIdx().x # only blockIdx starts at 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
threadIdx also starts at 1, but our arrays are 1-indexed, so everything works out here.
| return | ||
| end | ||
| @cuda blocks = data.size ÷ TBSize threads = TBSize kernel(data.b, data.c, data.scalar) | ||
| CUDA.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the underlying architecture here, but do you really need to synchronize after every kernel? That's not great for performance. Our GPU-to-CPU copy already synchronizes, so typically you don't need anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation issues cudaDeviceSynchronize(); after kernel submission (see
Line 161 in 8f9ca7b
| cudaDeviceSynchronize(); |
synchronize in Julia does a lot more than cudaDeviceSynchronize()? From what I can see, synchronisation doesn't seem to add any more overhead compare to the native version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should have similar overhead.
JuliaStream.jl/src/CUDAStream.jl
Outdated
| end | ||
| partial_sum = CuArray{T}(undef, DotBlocks) | ||
| @cuda blocks = DotBlocks threads = TBSize kernel(data.a, data.b, data.size, partial_sum) | ||
| CUDA.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sum below will run on the same stream as this kernel, so you don't need the synchronization.
JuliaStream.jl/src/oneAPIStream.jl
Outdated
| end | ||
|
|
||
| function add!(data::oneData{T}, groupsize::Int) where {T} | ||
| function kernel(a, b, c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a very minor detail, but you might want to put the T typevar on the inner kernel function instead, e.g., by using a::AbstractArray{T}, ... as arguments. That way you enforce specialization of the kernel method, whereas the outer CPU method doesn't necessarily need to be specialized (and you can use eltype(data) if you need the element type there).
But these things only matter when you work with very complicated inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this apply to all other GPU implementations or just oneAPI?
Ideally this would be an exemplar for writing kernels so would you recommend I annotate every other GPU implementation like this?
Aside:
I thought about doing this with more precise types (e.g ::CuDeviceVector{T} for CUDA) but felt like the actual type is an implementation detail for the kernel launch macro (@cuda, @oneapi, etc) so I guess AbstractArray{T} is the best we an do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to all Julia code, not just the oneAPI GPU implementation. If you add a typevar (::T .. .where T) you force the compiler to specialize. That's needed for the GPU kernel, because we don't want or support dynamic code for GPUs, but the host function (here, add!) may remain underspecialized if the compiler thinks that's useful.
And yeah, the AbstractArray suggestion is so that you don't have to deal with the exact device-side types, which are an implementation detail.
JuliaStream.jl/src/CUDAStream.jl
Outdated
| error("arraysize ($(arraysize)) must be divisible by $(TBSize)!") | ||
| end | ||
|
|
||
| # so CUDA's device is 0 indexed, so -1 from Julia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl's device! is 0-indexed too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, the comment was outdated as we switched to passing device instances instead of index around.
Drop soft=false for AMDGPU as this option was removed Update dependencies
Update dependencies
| # This doesn't do well on aarch64 because of the excessive Threads.threadid() ccall | ||
| # and inhibited vectorisation from the lack of @fastmath | ||
| # partial = zeros(T, Threads.nthreads()) | ||
| # Threads.@threads for i = 1:data.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally I would do a:
@threads for tid in 1:nthreads()
region = ....
acc = 0.0
@simd for i in region
acc += ...
end
partial[i] = acc
end
Or using @tkf FLoops.jl/Folds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that semantically equivalent to static_par_ranged? (minus the @simd bit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Basically that's the canonical example of saying, I want one iteration per thread. Then you have to do the splitting manually, but you also don't have to reimplement the lower-level details xD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/JuliaFolds/FLoops.jl#parallel-loop is an alternative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it'd be better to use @threads for i 1:nthreads() than touching internals like task.sticky = true and jl_set_task_tid.
The main difference to FLoops.jl is that it lets the Julia runtime dynamic load-balancing. It doesn't matter if you are only micro-benchmarking a parallel dot product but static scheduling like @threads for and static_par_ranged would fail when used inside other parallel functions (it has to wait for all worker threads to be available). So, if the aim is to benchmark practical and composable pieces of routines in Julia, it may be better to use dynamic scheduling. On the other hand, if you want to benchmark what you can squeeze out from the Julia runtime at the moment, static scheduling like @threads for and static_par_ranged may be fine.
Another difference to FLoops.jl is that, by default, it uses divide-and-conquer (DAC) scheduling (@spawns and waits) while @threads for and static_par_ranged use sequential scheduling. I find that DAC (in Julia) sometimes helps performance since it has larger parallelism (in the sense of work/span ratio), especially when there are a lot of cores.
JuliaStream.jl/src/ThreadedStream.jl
Outdated
| partial = Vector{T}(undef, Threads.nthreads()) | ||
| static_par_ranged(data.size, Threads.nthreads()) do group, startidx, endidx | ||
| acc = zero(T) | ||
| @fastmath for i = startidx:endidx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use @simd instead of @fastmath that makes sure that only the reduction variable get's reordered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I didn't really want to use @fastmath in a benchmark like STREAM and I'm happy report I get the same runtime with @simd on both ThunderX2 and A64fx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the A64FX node of Isambard, I've got a good speed up with @simd when setting the environment variable JULIA_LLVM_ARGS="-aarch64-sve-vector-bits-min=512": JuliaLang/julia#40308 (comment). Note that this requires llvm 12, so Julia v1.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah @simd finds the reduction chain, and then adds a subset of the fast math flags onto that chain.
For the A64fx you might also want to set JULIA_LLVM_ARGS="-aarch64-sve-vector-bits-min=512" see JuliaLang/julia#40308 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I didn't show in that comment that there is also a good scaling with the size of the floating type (including half-precision Float16):
$ JULIA_LLVM_ARGS="-aarch64-sve-vector-bits-min=512" julia -q
julia> using BenchmarkTools
julia> function sumsimd(x)
s = zero(eltype(x))
@simd for xi in x
s += xi
end
s
end
sumsimd (generic function with 1 method)
julia> @btime sumsimd(x) setup=(x = randn(Float64, 1_000_000))
191.912 μs (0 allocations: 0 bytes)
1853.0335322487956
julia> @btime sumsimd(x) setup=(x = randn(Float32, 1_000_000))
80.330 μs (0 allocations: 0 bytes)
400.9806f0
julia> @btime sumsimd(x) setup=(x = randn(Float16, 1_000_000))
42.761 μs (0 allocations: 0 bytes)
Float16(1.872e3)Related to this, you can watch this lightning talk at JuliaCon 2021 about speedup on using Float16 on Isambard: https://www.youtube.com/watch?v=btHfZr2C0GA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the speedup reported at JuliaCon requires JuliaLang/julia#40216
This PR adds the Julia implementation of BabelStream with the following implementations:
PlainStream.jl- Single threadedforThreadedStream.jl- Threaded implementation withThreads.@threadsmacrosDistributedStream.jl- Process based parallelism with@distributedmacrosCUDAStream.jl- Direct port of BabelStream's native CUDA implementation using CUDA.jlAMDGPUStream.jl- Direct port of BabelStream's native HIP implementation using AMDGPU.jlSee README.md for details on build and run instructions.
Performance is surprisingly good across all supported hardware platforms.
All benchmarks uses Julia 1.6.1, specific versions of each package are available in
Manifest.tomlAMDGPU.jl is currently still in heavy development, although the project reports most core features are working.
At the time of this PR, there are still a few issues that makes it unsuitable for production use:
Finally, there isn't a process-based (e.g MPI) implementation of BabelStream so comparison for
DistributedStream.jlhas been omitted.That said, performance seems to be significantly worst than
ThreadedStream.jldue to the added serialisation overhead.Future work
We should be able to include oneAPi.jl once it is ready for general use.
There's also OpenCL.jl but it simply wraps the OpenCL host API; kernels must still be written in OpenCL C, so this wouldn't be any different from BabelStream's OCLStream.