Skip to content

Conversation

@tom91136
Copy link
Member

This PR adds the Julia implementation of BabelStream with the following implementations:

  • PlainStream.jl - Single threaded for
  • ThreadedStream.jl - Threaded implementation with Threads.@threads macros
  • DistributedStream.jl - Process based parallelism with @distributed macros
  • CUDAStream.jl - Direct port of BabelStream's native CUDA implementation using CUDA.jl
  • AMDGPUStream.jl - Direct port of BabelStream's native HIP implementation using AMDGPU.jl

See README.md for details on build and run instructions.

Performance is surprisingly good across all supported hardware platforms.
All benchmarks uses Julia 1.6.1, specific versions of each package are available in Manifest.toml

omp
cuda

AMDGPU.jl is currently still in heavy development, although the project reports most core features are working.
At the time of this PR, there are still a few issues that makes it unsuitable for production use:

  • Kernel performance is inconsistent (tested on on Radeon VII w/ ROCm 3.10)
  • API for device selection is missing, the implementation had to resort to introspection
  • Rapid kernel submission seems to cause ref-count overflow issues if we attempt to wait for each kernel to complete, this is a bit counter-intuitive. The workaround is to use hardware-based events for synchronisation.

hip

Finally, there isn't a process-based (e.g MPI) implementation of BabelStream so comparison for DistributedStream.jl has been omitted.
That said, performance seems to be significantly worst than ThreadedStream.jl due to the added serialisation overhead.

Future work

We should be able to include oneAPi.jl once it is ready for general use.

There's also OpenCL.jl but it simply wraps the OpenCL host API; kernels must still be written in OpenCL C, so this wouldn't be any different from BabelStream's OCLStream.

@tom91136 tom91136 requested a review from tomdeakin June 10, 2021 04:15
@tomdeakin tomdeakin added this to the v4.0 milestone Jun 22, 2021
Copy link
Contributor

@tomdeakin tomdeakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting to see this - think it's pretty close. In general I'd like to make sure this is idiomatic of Julia code and we're not introducing any extra layers of abstraction in the benchmark.

@@ -0,0 +1,8 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to wrap up the data arrays in a whole new struct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's similar to what we have for Rust, there are no classes in Julia, only functions and data (structs).
The alternative is to pass everything as parameters:

function mul!(data::StreamData{T}) where {T}

becomes

function mul!(a::C, b::C, c::C, scalar::T, size::Int) where {T,C<:AbstractArray{T}}

Same applies when I copy everything back to host:

const ROCData = StreamData{T,ROCArray{T}} where {T}
...
function read_data(data::ROCData{T})::VectorData{T} where {T}
  return VectorData{T}(data.a, data.b, data.c, data.scalar, data.size)
end

read_data would be very long, and it needs to be implemented for 5 different implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer passing things as parameters.

@tom91136
Copy link
Member Author

Ready for review again.

@tomdeakin
Copy link
Contributor

Thanks @tom91136. I think I prefer the parameter passing rather than making a structure just to hold the arrays. I think that in a larger code with more arrays, you're just going to have to pass things around rather than keep wrapping things up in bundles to pass to different functions. Ideally we're aiming to write BabelStream in a way that is representative of something much bigger.

const DotBlocks = 256::Int

function devices()
return !CUDA.functional(false) ? [] :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick: make the function type-stable

Suggested change
return !CUDA.functional(false) ? [] :
return !CUDA.functional(false) ? String[] :

map(repr, gpu_agents_in_repr_order())
catch
# probably unsupported
[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[]
String[]

end
end

function copy!(data::SharedArrayData{T}) where {T}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that here and in the functions below where T isn't really used to constrain the signature nor in the body of the function, the parameter is redundant and

Suggested change
function copy!(data::SharedArrayData{T}) where {T}
function copy!(data::SharedArrayData)

would work just fine

Comment on lines 177 to 181
if config.float
type = Float32
else
type = Float64
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, you can also do

Suggested change
if config.float
type = Float32
else
type = Float64
end
type = if config.float
Float32
else
Float64
end

or even

Suggested change
if config.float
type = Float32
else
type = Float64
end
type = config.float ? Float32 : Float64

if you want to be even more concise. Also in the if below you can save a couple of benchmark = in the same way

@giordano
Copy link
Contributor

Performance is surprisingly good across all supported hardware platforms.

😃

Is there something we can do to move this forward? I had a very quick look, but could do a more thorough review, if that helps

@tom91136
Copy link
Member Author

@giordano Thanks for the review! This PR will be used for an upcoming submission to PMBS so I got a few more local changes (I've added a functional oneAPI.jl and KA implementation) that I'm in the process of finalising. I'll incorporate your review and put up a final version for further review by the end of the week.
If you're interested, the PMBS submission will also include a compute bound benchmark written in Julia.

@tomdeakin and I had a discussion on the parameter passing and I think we've settled on the current approach being acceptable.

Copy link

@jpsamaroo jpsamaroo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff! Let me know if I can help with anything AMDGPU.jl-related (I'm the author and maintainer of it).

return
end
AMDGPU.wait(
soft = false, # soft wait causes HSA_REFCOUNT overflow issues

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be an issue on AMDGPU.jl 0.2.11, where I've disabled HSA_REFCOUNT until I can figure out why it's broken.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, hard wait could be more efficient than soft (since it hypothetically can use interrupts), but I'd make sure to test with both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested it with and without soft wait, and for the very few times soft=true actually run to completion (in most cases it dies after 70 iterations, so 5*70 kernels) the performance seems within margin of error. I think I'll stick to soft=false for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what you mean by:

actually run to completion (in most cases it dies after 70 iterations, so 5*70 kernels)

If you have a stacktrace or other information, that would help greatly.

Isolate projects to avoid transitive dependency
Add parameter for passing devices
Incorporate further reviews
Update all dependencies
Copy link

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I added a couple of small comments. Great to see that the performance is good 🙂

end

function init_arrays!(data::CuData{T}, _, init::Tuple{T,T,T}) where {T}
CUDA.fill!(data.a, init[1])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need the module prefix here, as fill! is just a Base method that dispatches correctly since you're passing a CuArray argument already. Only when that's not the case (e.g. with fill, which just takes an integer) you need the CUDA-specific version.


function copy!(data::CuData{T}, _) where {T}
function kernel(a, c)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x # only blockIdx starts at 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threadIdx also starts at 1, but our arrays are 1-indexed, so everything works out here.

return
end
@cuda blocks = data.size ÷ TBSize threads = TBSize kernel(data.b, data.c, data.scalar)
CUDA.synchronize()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the underlying architecture here, but do you really need to synchronize after every kernel? That's not great for performance. Our GPU-to-CPU copy already synchronizes, so typically you don't need anything else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation issues cudaDeviceSynchronize(); after kernel submission (see

cudaDeviceSynchronize();
) so this port must match that. Does the synchronize in Julia does a lot more than cudaDeviceSynchronize()? From what I can see, synchronisation doesn't seem to add any more overhead compare to the native version.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should have similar overhead.

end
partial_sum = CuArray{T}(undef, DotBlocks)
@cuda blocks = DotBlocks threads = TBSize kernel(data.a, data.b, data.size, partial_sum)
CUDA.synchronize()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum below will run on the same stream as this kernel, so you don't need the synchronization.

end

function add!(data::oneData{T}, groupsize::Int) where {T}
function kernel(a, b, c)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very minor detail, but you might want to put the T typevar on the inner kernel function instead, e.g., by using a::AbstractArray{T}, ... as arguments. That way you enforce specialization of the kernel method, whereas the outer CPU method doesn't necessarily need to be specialized (and you can use eltype(data) if you need the element type there).

But these things only matter when you work with very complicated inputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply to all other GPU implementations or just oneAPI?
Ideally this would be an exemplar for writing kernels so would you recommend I annotate every other GPU implementation like this?

Aside:
I thought about doing this with more precise types (e.g ::CuDeviceVector{T} for CUDA) but felt like the actual type is an implementation detail for the kernel launch macro (@cuda, @oneapi, etc) so I guess AbstractArray{T} is the best we an do.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to all Julia code, not just the oneAPI GPU implementation. If you add a typevar (::T .. .where T) you force the compiler to specialize. That's needed for the GPU kernel, because we don't want or support dynamic code for GPUs, but the host function (here, add!) may remain underspecialized if the compiler thinks that's useful.

And yeah, the AbstractArray suggestion is so that you don't have to deal with the exact device-side types, which are an implementation detail.

error("arraysize ($(arraysize)) must be divisible by $(TBSize)!")
end

# so CUDA's device is 0 indexed, so -1 from Julia
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl's device! is 0-indexed too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, the comment was outdated as we switched to passing device instances instead of index around.

# This doesn't do well on aarch64 because of the excessive Threads.threadid() ccall
# and inhibited vectorisation from the lack of @fastmath
# partial = zeros(T, Threads.nthreads())
# Threads.@threads for i = 1:data.size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally I would do a:

@threads for tid in 1:nthreads()
    region = ....
    acc = 0.0
    @simd for i in region
       acc += ...
    end
    partial[i] = acc
end

Or using @tkf FLoops.jl/Folds

Copy link
Member Author

@tom91136 tom91136 Aug 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that semantically equivalent to static_par_ranged? (minus the @simd bit)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Basically that's the canonical example of saying, I want one iteration per thread. Then you have to do the splitting manually, but you also don't have to reimplement the lower-level details xD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@tkf tkf Aug 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it'd be better to use @threads for i 1:nthreads() than touching internals like task.sticky = true and jl_set_task_tid.

The main difference to FLoops.jl is that it lets the Julia runtime dynamic load-balancing. It doesn't matter if you are only micro-benchmarking a parallel dot product but static scheduling like @threads for and static_par_ranged would fail when used inside other parallel functions (it has to wait for all worker threads to be available). So, if the aim is to benchmark practical and composable pieces of routines in Julia, it may be better to use dynamic scheduling. On the other hand, if you want to benchmark what you can squeeze out from the Julia runtime at the moment, static scheduling like @threads for and static_par_ranged may be fine.

Another difference to FLoops.jl is that, by default, it uses divide-and-conquer (DAC) scheduling (@spawns and waits) while @threads for and static_par_ranged use sequential scheduling. I find that DAC (in Julia) sometimes helps performance since it has larger parallelism (in the sense of work/span ratio), especially when there are a lot of cores.

partial = Vector{T}(undef, Threads.nthreads())
static_par_ranged(data.size, Threads.nthreads()) do group, startidx, endidx
acc = zero(T)
@fastmath for i = startidx:endidx

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use @simd instead of @fastmath that makes sure that only the reduction variable get's reordered

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I didn't really want to use @fastmath in a benchmark like STREAM and I'm happy report I get the same runtime with @simd on both ThunderX2 and A64fx!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the A64FX node of Isambard, I've got a good speed up with @simd when setting the environment variable JULIA_LLVM_ARGS="-aarch64-sve-vector-bits-min=512": JuliaLang/julia#40308 (comment). Note that this requires llvm 12, so Julia v1.7

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @simd finds the reduction chain, and then adds a subset of the fast math flags onto that chain.

For the A64fx you might also want to set JULIA_LLVM_ARGS="-aarch64-sve-vector-bits-min=512" see JuliaLang/julia#40308 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I didn't show in that comment that there is also a good scaling with the size of the floating type (including half-precision Float16):

$ JULIA_LLVM_ARGS="-aarch64-sve-vector-bits-min=512" julia -q
julia> using BenchmarkTools

julia> function sumsimd(x)
           s = zero(eltype(x))
           @simd for xi in x
               s += xi
           end
           s
       end
sumsimd (generic function with 1 method)

julia> @btime sumsimd(x) setup=(x = randn(Float64, 1_000_000))
  191.912 μs (0 allocations: 0 bytes)
1853.0335322487956

julia> @btime sumsimd(x) setup=(x = randn(Float32, 1_000_000))
  80.330 μs (0 allocations: 0 bytes)
400.9806f0

julia> @btime sumsimd(x) setup=(x = randn(Float16, 1_000_000))
  42.761 μs (0 allocations: 0 bytes)
Float16(1.872e3)

Related to this, you can watch this lightning talk at JuliaCon 2021 about speedup on using Float16 on Isambard: https://www.youtube.com/watch?v=btHfZr2C0GA

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the speedup reported at JuliaCon requires JuliaLang/julia#40216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants