
# Simd examples


## Explicit SIMD vectorization

In [2]:
using SIMD
Base.@pure simdwidth(::Type{T}) where {T} = Int(256/8/sizeof(T))

println("simdwitdth Float16: ", simdwidth(Float16))
println("simdwitdth Float32: ", simdwidth(Float32))
println("simdwitdth Float64: ", simdwidth(Float64))

println("\nsimdwitdth Int16: ", simdwidth(Int16))
println("simdwitdth Int32: ", simdwidth(Int32))
println("simdwitdth Int64: ", simdwidth(Int64))

simdwitdth Float16: 16
simdwitdth Float32: 8
simdwitdth Float64: 4

simdwitdth Int16: 16
simdwitdth Int32: 8
simdwitdth Int64: 4


In [3]:
v = Vec{8,Float32}((1,2,3,4,5,6,7,8))

<8 x Float32>[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]

In [4]:
# This will should return an error on this machine?
v = Vec{8,Float64}((1,2,3,4,5,6,7,8))

<8 x Float64>[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]

In [5]:
v = Vec{8,Float32}((1,2,3,4,5,6,7,8))

<8 x Float32>[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]

In [6]:
v_tuple = NTuple{8,Float32}(v)

(1.0f0, 2.0f0, 3.0f0, 4.0f0, 5.0f0, 6.0f0, 7.0f0, 8.0f0)

In [7]:
println(typeof(v_tuple))
println(typeof(v))

NTuple{8, Float32}
Vec{8, Float32}


#### Operations on SIMD.Vec types


SIMD.Vec types can contain elements from the following collection:
```
Bool Int{8,16,32,64,128} UInt{8,16,32,64,128} Float{16,32,64}
```

The following vector operations can be used

```
+ - * / % ^ ! ~ & | $ << >> >>> == != < <= > >=
```


In [8]:
using BenchmarkTools

In [9]:
v1 = Vec{8,Float32}((1,2,3,4,5,6,7,8))
v2 = Vec{8,Float32}((1,2,3,4,5,6,7,8))

<8 x Float32>[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]

In [10]:
x1 = Array{Float32}([1,2,3,4,5,6,7,8])
x2 = Array{Float32}([1,2,3,4,5,6,7,8]);

In [11]:
@benchmark aux = v1 + v2

BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m20.266 ns[22m[39m … [35m 2.256 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 96.98%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m21.652 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m24.759 ns[22m[39m ± [32m57.761 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m6.79% ±  2.90%

  [39m [39m▆[39m█[34m▅[39m[39m▂[39m▁[39m▃[39m▃[32m▂[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m▆[39m█[39m█[34m█[39m[39m

In [12]:
@benchmark aux = x1 + x2

BenchmarkTools.Trial: 10000 samples with 986 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m51.885 ns[22m[39m … [35m 1.382 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 90.60%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m54.210 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m61.950 ns[22m[39m ± [32m50.392 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m3.31% ±  4.06%

  [39m▄[39m█[34m▆[39m[39m▃[39m▃[39m▁[39m▃[39m▂[32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▁[39m [39m▁[39m▂[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[39m█[34m█[39m[39m█[39m

In [13]:

function make_n_sums_simd(v1,v2,n)
    n_v1 = length(v1)
    aux = Vec{n_v1,eltype(v1)}(tuple(zeros(n_v1)...))
    for i in 1:n
        aux += v1 + v2
    end
    return aux
end

function make_n_sums(v1,v2,n)
    aux = zeros(eltype(v1),length(v1))
    
    for i in 1:n
        aux += v1 + v2
    end
    return aux
end

make_n_sums (generic function with 1 method)

In [14]:
 make_n_sums(x1,x2,100)

8-element Vector{Float32}:
  200.0
  400.0
  600.0
  800.0
 1000.0
 1200.0
 1400.0
 1600.0

In [15]:
make_n_sums_simd(v1,v2,100)

<8 x Float32>[200.0, 400.0, 600.0, 800.0, 1000.0, 1200.0, 1400.0, 1600.0]

In [18]:
@benchmark make_n_sums_simd($v1,$v2,1000)

BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.755 μs[22m[39m … [35m327.097 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 98.16%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.815 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.988 μs[22m[39m ± [32m  3.299 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.62% ±  0.98%

  [39m█[34m▇[39m[39m▆[39m▄[39m▂[32m▁[39m[39m▁[39m▁[39m▂[39m▂[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39m[39m█[39m█[39m█

In [19]:
@benchmark make_n_sums($x1,$x2,1000)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m72.617 μs[22m[39m … [35m 1.507 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 91.04%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m75.562 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m88.296 μs[22m[39m ± [32m69.757 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m4.70% ±  5.72%

  [39m▆[39m█[34m▅[39m[39m▄[39m▃[39m▃[39m▂[39m▁[39m [39m [32m▂[39m[39m▂[39m▁[39m▁[39m [39m [39m▁[39m▂[39m▂[39m▂[39m▂[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[39m█[34m█[39m[39m█[39m█[

In [20]:
v1 = Vec{16,Float32}((1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
v2 = Vec{16,Float32}((1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))

<16 x Float32>[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0]

In [21]:
x1 = Array{Float32}([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])
x2 = Array{Float32}([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]);

In [22]:
make_n_sums(x1,x2,10)

16-element Vector{Float32}:
  20.0
  40.0
  60.0
  80.0
 100.0
 120.0
 140.0
 160.0
 180.0
 200.0
 220.0
 240.0
 260.0
 280.0
 300.0
 320.0

In [28]:
make_n_sums_simd(v1, v2, 10)

<16 x Float32>[20.0, 40.0, 60.0, 80.0, 100.0, 120.0, 140.0, 160.0, 180.0, 200.0, 220.0, 240.0, 260.0, 280.0, 300.0, 320.0]

In [29]:
@benchmark make_n_sums_simd($v1, $v2, 1000)

BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.032 μs[22m[39m … [35m 11.462 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.097 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.277 μs[22m[39m ± [32m671.205 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[34m█[39m[39m▆[39m▃[39m▁[32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39m[39m█[39m█[39m█[

In [30]:
@benchmark make_n_sums($x1, $x2, 1000)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m81.803 μs[22m[39m … [35m 1.245 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 91.11%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m85.136 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m98.285 μs[22m[39m ± [32m67.983 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m4.73% ±  6.46%

  [39m▆[39m█[34m▆[39m[39m▄[39m▃[39m▂[39m [39m [39m [32m [39m[39m [39m [39m▁[39m▁[39m [39m▂[39m▂[39m▂[39m▂[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[39m█[34m█[39m[39m█[39m█[


#### Reduction operations

The following reduction operations can be used

```
all any maximum minimum sum prod
```



In [33]:
v = Vec{8,Float32}((1,2,3,4,5,6,7,8))

<8 x Float32>[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]

In [34]:
#println("all(v): ", all(v))
println("sum(v): ",     sum(v))
println("maximum(v): ", maximum(v))
println("minimum(v): ", minimum(v))
println("prod(v): ",    prod(v))

sum(v): 36.0
maximum(v): 8.0
minimum(v): 1.0
prod(v): 40320.0


#### Accessing arrays: reading and writting from julia Arrays
When using explicit SIMD vectorization, it is convenient to allocate arrays still as arrays of scalars, not as arrays of vectors. The vload and vstore functions allow reading vectors from and writing vectors into arrays, accessing several contiguous array elements.

In [35]:
arr = Vector{Float64}(100:200);

In [36]:
# The vload call reads a vector of size 4 from the array, i.e. it reads arr[i:i+3]
xs = vload(Vec{4,Float64 }, arr, 1)

<4 x Float64>[100.0, 101.0, 102.0, 103.0]

In [37]:
xs = 2*xs
#Similarly, the vstore call writes the vector xs to the four array elements arr[i:i+3].
vstore(xs, arr, 1)

101-element Vector{Float64}:
 200.0
 202.0
 204.0
 206.0
 104.0
 105.0
 106.0
 107.0
 108.0
 109.0
 110.0
 111.0
 112.0
   ⋮
 189.0
 190.0
 191.0
 192.0
 193.0
 194.0
 195.0
 196.0
 197.0
 198.0
 199.0
 200.0

#### Making some easy functions

In [39]:
x1 = rand(Float32, 512)
x2 = rand(Float32, 512)
y = similar(x1)

function add!(y, x1,x2)
    @inbounds for i=1:length(x1)
        y[i] = x1[i] + x2[i] 
    end
end

function simd_add!(y, x1,x2)
    @simd for i=1:length(x1)
        @inbounds y[i] = x1[i] + x2[i] 
    end
end

function simd_add_no_inbounds!(y, x1,x2)
    @simd for i=1:length(x1)
        y[i] = x1[i] + x2[i] 
    end
end


simd_add_no_inbounds! (generic function with 1 method)

In [40]:

function vadd!{N,T}(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}=Vec{8,T})
    @inbounds for i in 1:N:length(xs)
        xv = vload(Vec{N,T}, xs, i)
        yv = vload(Vec{N,T}, ys, i)
        xv += yv 
        vstore(xv, y, i)
    end
end


function euclid!(y, x1,x2)
    @inbounds for i=1:length(x1)
        y[i] = sqrt(x1[1] * x1[1]) + (x2[i] * x2[i])
    end
end

function veuclid!{N,T}(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}=Vec{8,T})
    @inbounds for i in 1:N:length(xs)
        xv = vload(Vec{N,T}, xs, i)
        yv = vload(Vec{N,T}, ys, i)
        xv = sqrt(xv*xv + yv*yv)
        vstore(xv, y, i)
    end
end

LoadError: UndefVarError: vadd! not defined

In [41]:
@benchmark euclid!(y,x1,x2)

LoadError: UndefVarError: euclid! not defined

In [87]:
 @benchmark veuclid!(y,x1,x2)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     335.747 ns (0.00% GC)
  median time:      361.215 ns (0.00% GC)
  mean time:        393.591 ns (0.00% GC)
  maximum time:     1.545 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     221

In [88]:
 @benchmark add!(y,x1,x2)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     65.874 ns (0.00% GC)
  median time:      68.821 ns (0.00% GC)
  mean time:        83.308 ns (0.00% GC)
  maximum time:     196.249 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     976

In [89]:
@benchmark vadd!(y,x1,x2)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     146.237 ns (0.00% GC)
  median time:      148.293 ns (0.00% GC)
  mean time:        186.955 ns (0.00% GC)
  maximum time:     447.124 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     839

In [44]:
@benchmark simd_add!($y,$x1,$x2)

BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m23.123 ns[22m[39m … [35m139.196 ns[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m23.304 ns               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m23.856 ns[22m[39m ± [32m  3.182 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[34m▃[39m[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39m[39m█[32

In [102]:
@benchmark simd_add_no_inbounds!(y,x1,x2)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     539.552 ns (0.00% GC)
  median time:      545.930 ns (0.00% GC)
  mean time:        611.648 ns (0.00% GC)
  maximum time:     1.293 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     194

### median pooling

In [46]:
#https://discourse.julialang.org/t/make-this-code-fast-median-pooling/6405

In [8]:
@inline function median5_swap(a,b,c,d,e)
    # https://github.com/JeffreySarnoff/SortingNetworks.jl/blob/master/src/swapsort.jl
    a,b = minmax(a,b)
    c,d = minmax(c,d)
    a,c = minmax(a,c)
    b,d = minmax(b,d)
    c,e = minmax(e,c)
    max(c, min(e,b))
end

@inline median5(args...) = median5_swap(args...)

function medmedpool55!(out::AbstractMatrix, img::AbstractMatrix)
    @assert size(out, 1) >= size(img, 1) ÷ 5
    @assert size(out, 2) >= size(img, 2) ÷ 5
    @inbounds for j ∈ indices(out)[2]
        @simd for i ∈ indices(out)[1]
            x11 = img[5i-4, 5j-4]
            x21 = img[5i-3, 5j-4]
            x31 = img[5i-2, 5j-4]
            x41 = img[5i-1, 5j-4]
            x51 = img[5i-0, 5j-4]
            
            x12 = img[5i-4, 5j-3]
            x22 = img[5i-3, 5j-3]
            x32 = img[5i-2, 5j-3]
            x42 = img[5i-1, 5j-3]
            x52 = img[5i-0, 5j-3]
            
            x13 = img[5i-4, 5j-2]
            x23 = img[5i-3, 5j-2]
            x33 = img[5i-2, 5j-2]
            x43 = img[5i-1, 5j-2]
            x53 = img[5i-0, 5j-2]
            
            x14 = img[5i-4, 5j-1]
            x24 = img[5i-3, 5j-1]
            x34 = img[5i-2, 5j-1]
            x44 = img[5i-1, 5j-1]
            x54 = img[5i-0, 5j-1]
            
            x15 = img[5i-4, 5j-0]
            x25 = img[5i-3, 5j-0]
            x35 = img[5i-2, 5j-0]
            x45 = img[5i-1, 5j-0]
            x55 = img[5i-0, 5j-0]
            
            y1 = median5(x11,x12,x13,x14,x15)
            y2 = median5(x21,x22,x23,x24,x25)
            y3 = median5(x31,x32,x33,x34,x35)
            y4 = median5(x41,x42,x43,x44,x45)
            y5 = median5(x51,x52,x53,x54,x55)
            
            z = median5(y1,y2,y3,y4,y5)
            out[i,j] = z
        end
    end
    out
end

medmedpool55! (generic function with 1 method)

In [5]:
using BenchmarkTools
imgs = randn(Float32, 1024,1024, 10)
img = view(imgs, :,:,1)
out = similar(img, size(img) .÷ 5)
@benchmark medmedpool55!(out, img)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.746 ms (0.00% GC)
  median time:      10.786 ms (0.00% GC)
  mean time:        11.430 ms (0.00% GC)
  maximum time:     47.553 ms (0.00% GC)
  --------------
  samples:          437
  evals/sample:     1

In [13]:
size(imgs),size([rand(T,N) for _ in 1:6])

((1024, 1024, 10), (6,))

In [None]:
Base.@pure simdwidth(::Type{T}) where {T} = Int(256/8/sizeof(T))

@inline function median3(a,b,c)
    max(min(a,b), min(c,max(a,b)))
end

@inline function median5(a,b,c,d,e)
    # https://stackoverflow.com/questions/480960/code-to-calculate-median-of-five-in-c-sharp
    f=max(min(a,b),min(c,d))
    g=min(max(a,b),max(c,d))
    median3(e,f,g)
end

@noinline function median5_vectors!(out, a,b,c,d,e)
    K = simdwidth(eltype(out))
    N = length(out)
    T = eltype(out)
    V = Vec{K,T}
    @assert mod(N,K) == 0

    @inbounds for i in 1:K:N
        va = vload(V,a, i)
        vb = vload(V,b, i)
        vc = vload(V,c, i)
        vd = vload(V,d, i)
        ve = vload(V,e, i)
        vo = median5(va,vb,vc,vd,ve)
        vstore(vo,out, i)
    end
    out
end

using BenchmarkTools
T = UInt8
T = Float32
N = 10^6
N = N ÷ simdwidth(T) * simdwidth(T)
out, a,b,c,d,e = [rand(T,N) for _ in 1:6]
@benchmark median5_vectors!(out, a,b,c,d,e)


In [47]:
simdwidth(Float16)

16

### Test views

In [304]:
X = ones(10,5)
x = Array(1:10);
X2 = x.+X

10×5 Array{Float64,2}:
  2.0   2.0   2.0   2.0   2.0
  3.0   3.0   3.0   3.0   3.0
  4.0   4.0   4.0   4.0   4.0
  5.0   5.0   5.0   5.0   5.0
  6.0   6.0   6.0   6.0   6.0
  7.0   7.0   7.0   7.0   7.0
  8.0   8.0   8.0   8.0   8.0
  9.0   9.0   9.0   9.0   9.0
 10.0  10.0  10.0  10.0  10.0
 11.0  11.0  11.0  11.0  11.0

In [305]:
v = view(X2,1,:)
for i in 1:size(X2,1)
    v = view(X2,i,:)
    X2[i,:] += v
end
X2

10×5 Array{Float64,2}:
  4.0   4.0   4.0   4.0   4.0
  6.0   6.0   6.0   6.0   6.0
  8.0   8.0   8.0   8.0   8.0
 10.0  10.0  10.0  10.0  10.0
 12.0  12.0  12.0  12.0  12.0
 14.0  14.0  14.0  14.0  14.0
 16.0  16.0  16.0  16.0  16.0
 18.0  18.0  18.0  18.0  18.0
 20.0  20.0  20.0  20.0  20.0
 22.0  22.0  22.0  22.0  22.0

In [302]:
X = ones(10,5)
x = Array(1:10);
X2 = x.+X

10×5 Array{Float64,2}:
  2.0   2.0   2.0   2.0   2.0
  3.0   3.0   3.0   3.0   3.0
  4.0   4.0   4.0   4.0   4.0
  5.0   5.0   5.0   5.0   5.0
  6.0   6.0   6.0   6.0   6.0
  7.0   7.0   7.0   7.0   7.0
  8.0   8.0   8.0   8.0   8.0
  9.0   9.0   9.0   9.0   9.0
 10.0  10.0  10.0  10.0  10.0
 11.0  11.0  11.0  11.0  11.0

In [303]:
v = view(X2,1,:)
for i in 1:size(X2,1)
    v .= view(X2,i,:)
    X2[i,:] += v
end

X2

10×5 Array{Float64,2}:
 11.0  11.0  11.0  11.0  11.0
  6.0   6.0   6.0   6.0   6.0
  8.0   8.0   8.0   8.0   8.0
 10.0  10.0  10.0  10.0  10.0
 12.0  12.0  12.0  12.0  12.0
 14.0  14.0  14.0  14.0  14.0
 16.0  16.0  16.0  16.0  16.0
 18.0  18.0  18.0  18.0  18.0
 20.0  20.0  20.0  20.0  20.0
 22.0  22.0  22.0  22.0  22.0

In [325]:
X = ones(10,5)
x = Array(1:10);
X2 = x.+X

v = deepcopy(view(X2,1,:))
for i in 1:size(X2,1)
    v .= view(X2,i,:)
    X2[i,:] += v
end
X2

10×5 Array{Float64,2}:
  4.0   4.0   4.0   4.0   4.0
  6.0   6.0   6.0   6.0   6.0
  8.0   8.0   8.0   8.0   8.0
 10.0  10.0  10.0  10.0  10.0
 12.0  12.0  12.0  12.0  12.0
 14.0  14.0  14.0  14.0  14.0
 16.0  16.0  16.0  16.0  16.0
 18.0  18.0  18.0  18.0  18.0
 20.0  20.0  20.0  20.0  20.0
 22.0  22.0  22.0  22.0  22.0

In [324]:
X = ones(10,5)
x = Array(1:10);
X2 = x.+X

v = zeros(X2[1,:])
for i in 1:size(X2,1)
    v .= view(X2,i,:)
    X2[i,:] += v
end
X2

10×5 Array{Float64,2}:
  4.0   4.0   4.0   4.0   4.0
  6.0   6.0   6.0   6.0   6.0
  8.0   8.0   8.0   8.0   8.0
 10.0  10.0  10.0  10.0  10.0
 12.0  12.0  12.0  12.0  12.0
 14.0  14.0  14.0  14.0  14.0
 16.0  16.0  16.0  16.0  16.0
 18.0  18.0  18.0  18.0  18.0
 20.0  20.0  20.0  20.0  20.0
 22.0  22.0  22.0  22.0  22.0

In [315]:
@time v = deepcopy(view(X2,1,:))

  0.000056 seconds (37 allocations: 1.813 KiB)


5-element SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true}:
 4.0
 4.0
 4.0
 4.0
 4.0

In [None]:
@time v = deepcopy(view(X2,1,:))

In [313]:
@time auxiliar = view(X2,1,:)

  0.000043 seconds (21 allocations: 512 bytes)


5-element SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true}:
 4.0
 4.0
 4.0
 4.0
 4.0