Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-14354] Array using subscript showing slower in benchmarks #56713

Open
LucianoPAlmeida opened this issue Mar 14, 2021 · 2 comments
Open

[SR-14354] Array using subscript showing slower in benchmarks #56713

LucianoPAlmeida opened this issue Mar 14, 2021 · 2 comments

Comments

@LucianoPAlmeida
Copy link
Collaborator

@LucianoPAlmeida LucianoPAlmeida commented Mar 14, 2021

Previous ID SR-14354
Radar rdar://problem/75452086
Original Reporter @LucianoPAlmeida
Type Bug

Attachment: Download

Environment

Xcode 12.4 (12D4e)
Benchmark tool: https://github.com/google/swift-benchmark

Additional Detail from JIRA
Votes 0
Component/s Compiler, Standard Library
Labels Bug
Assignee None
Priority Medium

md5: 57f0fbd352711ede6de9a35e4c6c6346

Issue Description:

Using https://github.com/google/swift-benchmark
Given those functions and benchmarks of experimenting with SIMD:

let count = 1_000_000

var c = [SIMD4<Int8>](repeating: [1, 8, 5, 6], count: count)
var d = ContiguousArray<SIMD4<Int8>>(repeating: [1, 8, 5, 6], count: count)
var e = ContiguousArray<(Int8, Int8, Int8, Int8)>(repeating: (1, 8, 5, 6), count: count)
var f = [(Int8, Int8, Int8, Int8)](repeating: (1, 8, 5, 6), count: count)

@inline(__always)
func vecCont(x: inout ContiguousArray<(Int8, Int8, Int8, Int8)>, s: Int8) {
  for i in 0..<x.count {
    x[i].0 += s
    x[i].1 += s
    x[i].2 += s
    x[i].3 += s
  }
}

@inline(__always)
func vec1Arr(x: inout [SIMD4<Int8>], s: Int8) {
  for i in 0..<x.count {
    x[i] &+= s
  }
}

@inline(__always)
func vec1ContArr(x: inout ContiguousArray<SIMD4<Int8>>, s: Int8) {
  for i in 0..<x.count {
    x[i] &+= s
  }
}
// Benchmarks
let vecTupleBench = MyBenchmark("vecTupleContArray", settings: [Iterations(1000)],
                           closure: {
                             vecCont(x: &e, s: 5)
                           }, tearDownClosure: {
                             // Decrement to avoid overflow on iterations since we using the same buffer.
                             vecCont(x: &e, s: -5)
                           })
defaultBenchmarkSuite.register(benchmark: vecTupleBench)

let vecTupleArrBench = MyBenchmark("vecTupleArray", settings: [Iterations(1000)],
                           closure: {
                             vecArr(x: &f, s: 5)
                           }, tearDownClosure: {
                             // Decrement to avoid overflow on iterations since we using the same buffer.
                             vecArr(x: &f, s: -5)
                           })
defaultBenchmarkSuite.register(benchmark: vecTupleArrBench)

let vec2Bench = MyBenchmark("vecSIMDArray", settings: [Iterations(1000)],
                           closure: {
                             vec1Arr(x: &c, s: 5)
                           }, tearDownClosure: {
                             // Decrement to avoid overflow on iterations since we using the same buffer.
                             vec1Arr(x: &c, s: -5)
                           })
defaultBenchmarkSuite.register(benchmark: vec2Bench)

let vec3Bench = MyBenchmark("vecSIMDContArray", settings: [Iterations(1000)],
                           closure: {
                             vec1ContArr(x: &d, s: 5)
                           }, tearDownClosure: {
                             // Decrement to avoid overflow on iterations since we using the same buffer.
                             vec1ContArr(x: &d, s: -5)
                           })
defaultBenchmarkSuite.register(benchmark: vec3Bench)

The result we see

name              time     std        iterations
------------------------------------------------
vecTupleContArray 2.478 ms ±   3.54 %       1000
vecTupleArray     1.802 ms ±   8.39 %       1000
vecSIMDArray      2.239 ms ±   4.76 %       1000
vecSIMDContArray  0.631 ms ±   6.80 %       1000

I found really interesting seeing SIMD array being slower than the tuple, so after a bit of investigation a hint of `x[i] &+= s` being actually producing a copy of SIMD e.g.

// Possible copy?
//  x[i] &+= s
var vec = x[I]
ver &+= s
x[i] = vec 

So changing the function types and benchmarks with addition of a wrapper where the subscript would use _read, _modify accessor to avoid a possible copy to see the affects

class List<T>: CustomStringConvertible {
  var base: [T]
  
  init(repeating value: T, count: Int) {
    base = Array(repeating: value, count: count)
  }

  @inlinable
  var count : Int { base.count }
  
  @inlinable
  subscript(_ i: Int) -> T {
    _modify {
      yield &base[i]
    }
    _read {
      yield base[i]
    }
  }
  var description: String { base.description }
}

@inline(__always)
func vec(x: inout List<(Int8, Int8, Int8, Int8)>, s: Int8) {
  for i in 0..<x.count {
    x[i].0 += s
    x[i].1 += s
    x[i].2 += s
    x[i].3 += s
  }
}

@inline(__always)
func vec1(x: inout List<SIMD4<Int8>>, s: Int8) {
  for i in 0..<x.count {
    x[i] &+= s
  }
}

var a = List<(Int8, Int8, Int8, Int8)>(repeating: (1, 8, 5, 6), count: count)
var b = List<SIMD4<Int8>>(repeating: [1, 8, 5, 6], count: count)

let vecBench = MyBenchmark("vecTupleList", settings: [Iterations(1000)],
                           closure: {
                             vec(x: &a, s: 5)
                           }, tearDownClosure: {
                             // Decrement to avoid overflow on iterations since we using the same buffer.
                             vec(x: &a, s: -5)
                           })
defaultBenchmarkSuite.register(benchmark: vecBench)

let vec1Bench = MyBenchmark("vecSIMDList", settings: [Iterations(1000)],
                           closure: {
                             vec1(x: &b, s: 5)
                           }, tearDownClosure: {
                             // Decrement to avoid overflow on iterations since we using the same buffer.
                             vec1(x: &b, s: -5)
                           })
defaultBenchmarkSuite.register(benchmark: vec1Bench)

The results are the following

name              time     std        iterations
------------------------------------------------
vecTupleList      1.708 ms ±   4.94 %       1000
vecSIMDList       0.354 ms ±   8.54 %       1000
vecTupleContArray 2.491 ms ±   3.36 %       1000
vecTupleArray     1.804 ms ±   4.15 %       1000
vecSIMDArray      2.251 ms ±   5.26 %       1000
vecSIMDContArray  0.632 ms ±   6.36 %       1000

We note that the SIMD using a list with _read, _modify subscript is now fast as expected, although for tuple type it really didn't affect much... also, we can note that the SIMD contiguous array also perform as expected.

So here are the questions

  • I note that there some other this happening on Array subscript, for example handle possible NSArray bridging , bound check and so it is expected to be slower (given contiguous array which doesn't have all that is fast as expected)?

  • There could be really a copy on SIMD vector element (as in the snippet Possible copy?) which could be causing a slowdown on that method? Something that should be caught by the optimizer but wasn't?

  • Emitting x86-64 assembly for the SIMD Array function, we can see some call instructions `call swift_isUniquelyReferenced_nonNull_native@PLT` which I suspect is for COW, so could this be a factor in final runtime performance?

Note: I'm opening as a bug, but feel free to change to task or improvement if that is more fit, I'm more trying to understand the details of why this may be happening =]

Adding the full benchmark code as file attachment.

@LucianoPAlmeida
Copy link
Collaborator Author

@LucianoPAlmeida LucianoPAlmeida commented Mar 14, 2021

@typesanitizer
Copy link

@typesanitizer typesanitizer commented Mar 15, 2021

@swift-ci create

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants