This is strangely specific, but look, a factor ~200 slowdown when enabling multithreading:
julia> using BenchmarkTools, FastBroadcast, ForwardDiff
julia> N = 100; x = ForwardDiff.Dual.(randn(N), randn(N)); v = zeros(N);
julia> @btime @. $v = ForwardDiff.value($x); # Baseline
17.873 ns (0 allocations: 0 bytes)
julia> @btime @.. $v = ForwardDiff.value($x);
16.712 ns (0 allocations: 0 bytes)
julia> @btime @.. thread=true $v = ForwardDiff.value($x);
3.101 μs (1 allocation: 48 bytes)
It's the same when extracting partials. Is there something special about the Dual struct layout or how these functions are written that causes multithreaded broadcasting to go haywire here?