# Vector sum in OpenCL

In [1]:
using OpenCL

srand(123)
a = rand(Float32, 5000_000)
b = rand(Float32, 5000_000);

Let us look at the available devices for this host.

In [2]:
cl.devices()

3-element Array{OpenCL.cl.Device,1}:
 OpenCL.Device(Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz on Apple @0x00000000ffffffff)
 OpenCL.Device(HD Graphics 4000 on Apple @0x0000000001024400)                         
 OpenCL.Device(GeForce GT 650M on Apple @0x0000000001022700)                          

In order to use a device we need to 

- Create a context: Define a context for a given device.
- Create a queue: From the host to de device.

In [3]:
device = cl.devices()[3]

OpenCL.Device(GeForce GT 650M on Apple @0x0000000001022700)

In [4]:
ctx = cl.Context(device)

OpenCL.Context(@0x00007fa0e5ebb920 on GeForce GT 650M)

In [5]:
queue = cl.CmdQueue(ctx)

OpenCL.CmdQueue(@0x00007fa0eac44770)

#### Sending information to the global memory

We can send information to the gloval memory using our context

In [6]:
a_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=a)
b_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=b)
c_buff = cl.Buffer(Float32, ctx, :w, length(a))

Buffer{Float32}(@0x00007fa0ead2a8a0)

In [7]:
const sum_kernel = "
   __kernel void sum(__global const float *a,
                     __global const float *b,
                     __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] + b[gid];
    }
";

In [8]:
p = cl.build!(cl.Program(ctx, source=sum_kernel))

OpenCL.Program(@0x00007fa0eabf22b0)

In [9]:
k = cl.Kernel(p, "sum")

OpenCL.Kernel("sum" nargs=3)

In [18]:
?OpenCL.cl.CLObject

No documentation found.

**Summary:**

```
abstract type OpenCL.cl.CLObject <: Any
```

**Subtypes:**

```
OpenCL.cl.CLArray
OpenCL.cl.CLEvent
OpenCL.cl.CLMemObject
OpenCL.cl.CmdQueue
OpenCL.cl.Context
OpenCL.cl.Device
OpenCL.cl.Kernel
OpenCL.cl.Platform
OpenCL.cl.Program
```


In [11]:
queue(k, size(a), nothing, a_buff, b_buff, c_buff)

OpenCL.Event(@0x00007fa0ea9e87e0)

In [12]:
r = cl.read(queue, c_buff);

In [13]:
using BenchmarkTools

In [14]:
@benchmark $a+$b

BenchmarkTools.Trial: 
  memory estimate:  19.07 MiB
  allocs estimate:  2
  --------------
  minimum time:     6.518 ms (0.00% GC)
  median time:      11.316 ms (0.00% GC)
  mean time:        12.263 ms (20.44% GC)
  maximum time:     80.918 ms (90.44% GC)
  --------------
  samples:          407
  evals/sample:     1

In [15]:
@benchmark queue(k, size(a), nothing, a_buff, b_buff, c_buff)

BenchmarkTools.Trial: 
  memory estimate:  992 bytes
  allocs estimate:  32
  --------------
  minimum time:     1.568 ms (0.00% GC)
  median time:      1.793 ms (0.00% GC)
  mean time:        1.832 ms (0.00% GC)
  maximum time:     4.586 ms (0.00% GC)
  --------------
  samples:          2688
  evals/sample:     1

In [16]:
@benchmark queue(k, size(a), nothing)

BenchmarkTools.Trial: 
  memory estimate:  448 bytes
  allocs estimate:  12
  --------------
  minimum time:     1.547 ms (0.00% GC)
  median time:      1.707 ms (0.00% GC)
  mean time:        1.727 ms (0.00% GC)
  maximum time:     2.135 ms (0.00% GC)
  --------------
  samples:          2865
  evals/sample:     1

#### Reading the data takes a lot of time

In [17]:
@benchmark begin queue(k, size(a), nothing, a_buff, b_buff, c_buff)
    r = cl.read(queue, c_buff);
end

BenchmarkTools.Trial: 
  memory estimate:  19.07 MiB
  allocs estimate:  37
  --------------
  minimum time:     23.153 ms (0.00% GC)
  median time:      27.725 ms (0.00% GC)
  mean time:        27.815 ms (7.45% GC)
  maximum time:     100.794 ms (75.73% GC)
  --------------
  samples:          180
  evals/sample:     1