I found really interesting seeing SIMD array being slower than the tuple, so after a bit of investigation a hint of `x[i] &+= s` being actually producing a copy of SIMD e.g.
// Possible copy?// x[i] &+= svarvec = x[I]
ver &+= sx[i] = vec
So changing the function types and benchmarks with addition of a wrapper where the subscript would use _read, _modify accessor to avoid a possible copy to see the affects
classList<T>: CustomStringConvertible {
varbase: [T]
init(repeatingvalue: T, count: Int) {
base = Array(repeating: value, count: count)
}
@inlinablevarcount : Int { base.count }
@inlinablesubscript(_i: Int) -> T {
_modify {
yield &base[i]
}
_read {
yield base[i]
}
}
vardescription: String { base.description }
}
@inline(__always)
funcvec(x: inoutList<(Int8, Int8, Int8, Int8)>, s: Int8) {
foriin0..<x.count {
x[i].0 += sx[i].1 += sx[i].2 += sx[i].3 += s
}
}
@inline(__always)
funcvec1(x: inoutList<SIMD4<Int8>>, s: Int8) {
foriin0..<x.count {
x[i] &+= s
}
}
vara = List<(Int8, Int8, Int8, Int8)>(repeating: (1, 8, 5, 6), count: count)
varb = List<SIMD4<Int8>>(repeating: [1, 8, 5, 6], count: count)
letvecBench = MyBenchmark("vecTupleList", settings: [Iterations(1000)],
closure: {
vec(x: &a, s: 5)
}, tearDownClosure: {
// Decrement to avoid overflow on iterations since we using the same buffer.vec(x: &a, s: -5)
})
defaultBenchmarkSuite.register(benchmark: vecBench)
letvec1Bench = MyBenchmark("vecSIMDList", settings: [Iterations(1000)],
closure: {
vec1(x: &b, s: 5)
}, tearDownClosure: {
// Decrement to avoid overflow on iterations since we using the same buffer.vec1(x: &b, s: -5)
})
defaultBenchmarkSuite.register(benchmark: vec1Bench)
We note that the SIMD using a list with _read, _modify subscript is now fast as expected, although for tuple type it really didn't affect much... also, we can note that the SIMD contiguous array also perform as expected.
So here are the questions
I note that there some other this happening on Array subscript, for example handle possible NSArray bridging , bound check and so it is expected to be slower (given contiguous array which doesn't have all that is fast as expected)?
There could be really a copy on SIMD vector element (as in the snippet Possible copy?) which could be causing a slowdown on that method? Something that should be caught by the optimizer but wasn't?
Emitting x86-64 assembly for the SIMD Array function, we can see some call instructions `call swift_isUniquelyReferenced_nonNull_native@PLT` which I suspect is for COW, so could this be a factor in final runtime performance?
Note: I'm opening as a bug, but feel free to change to task or improvement if that is more fit, I'm more trying to understand the details of why this may be happening =]
Adding the full benchmark code as file attachment.
The text was updated successfully, but these errors were encountered:
Attachment: Download
Environment
Xcode 12.4 (12D4e)
Benchmark tool: https://github.com/google/swift-benchmark
Additional Detail from JIRA
md5: 57f0fbd352711ede6de9a35e4c6c6346
Issue Description:
Using https://github.com/google/swift-benchmark
Given those functions and benchmarks of experimenting with SIMD:
The result we see
I found really interesting seeing SIMD array being slower than the tuple, so after a bit of investigation a hint of `x[i] &+= s` being actually producing a copy of SIMD e.g.
So changing the function types and benchmarks with addition of a wrapper where the subscript would use _read, _modify accessor to avoid a possible copy to see the affects
The results are the following
We note that the SIMD using a list with _read, _modify subscript is now fast as expected, although for tuple type it really didn't affect much... also, we can note that the SIMD contiguous array also perform as expected.
So here are the questions
I note that there some other this happening on Array subscript, for example handle possible NSArray bridging , bound check and so it is expected to be slower (given contiguous array which doesn't have all that is fast as expected)?
There could be really a copy on SIMD vector element (as in the snippet Possible copy?) which could be causing a slowdown on that method? Something that should be caught by the optimizer but wasn't?
Emitting x86-64 assembly for the SIMD Array function, we can see some call instructions `call swift_isUniquelyReferenced_nonNull_native@PLT` which I suspect is for COW, so could this be a factor in final runtime performance?
Note: I'm opening as a bug, but feel free to change to task or improvement if that is more fit, I'm more trying to understand the details of why this may be happening =]
Adding the full benchmark code as file attachment.
The text was updated successfully, but these errors were encountered: