# Exercise 9.0: "Inspect your HW capabilities"
Run the script "introspection.py" and investigate the following:

- Which platforms, devices, compute units are available on your computer?
- How much memory is available?

In [None]:
%run ..\introspection.py

# Exercise 9.1: "Chained Vector Addition"

1. Extend the vector addition program in **template.py** to do a chained addition: $C = A+B$ followed by $D = C+E$. Let E be a random vector like A and B.

In [5]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# A short template to test small kernels.
# 

import numpy as np
import pyopencl as cl

VEC_SIZE = 50000

# Create the context (containing platform and device information) and command queue.
context = cl.create_some_context()
cmd_queue = cl.CommandQueue(context)

# Create the host side data and a empty array to hold the result.
a_host = np.random.rand(VEC_SIZE).astype(np.float32)
b_host = np.random.rand(VEC_SIZE).astype(np.float32)
e_host = np.random.rand(VEC_SIZE).astype(np.float32)
c_host = np.empty_like(a_host)
d_host = np.empty_like(a_host)

# Create a device side read-only memory buffer and copy the data from "hostbuf" into it.
# Create as many 
# You can find the other possible mem_flags values at
# https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateBuffer.html
mf = cl.mem_flags
a_device = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_host)
b_device = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_host)
e_device = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=e_host)
c_device = cl.Buffer(context, mf.WRITE_ONLY, a_host.nbytes)
d_device = cl.Buffer(context, mf.WRITE_ONLY, a_host.nbytes)

# Source of the kernel itself.
kernel_source = """
__kernel void sum(
    __global const float *a_device, 
    __global const float *b_device, 
    __global const float *e_device,
    __global       float *c_device,
    __global       float *d_device)
{
  int gid = get_global_id(0);
  c_device[gid] = a_device[gid] + b_device[gid];
  d_device[gid] = c_device[gid] + e_device[gid];
}
"""

# If you want to keep the kernel in a seperate file uncomment this line and adjust the filename
#kernel_source = open("kernel.cl").read()

# Create a new program from the kernel and build the source.
prog = cl.Program(context, kernel_source).build()

# Execute the "sum" kernel in the program. Parameters are:
# 
#        Command queue         Work group size   Kernel param 1
#            ↓   Global grid size   ↓   Kernel param 0  ↓  Kernel param 2
#            ↓           ↓          ↓       ↓           ↓        ↓
prog.sum(cmd_queue, a_host.shape, None, a_device, b_device, e_device, c_device, d_device)

# Copy the result back from device to host.
cl.enqueue_copy(cmd_queue, c_host, c_device)
cl.enqueue_copy(cmd_queue, d_host, d_device)

# Check the results in the host array with Numpy.
print("1. All elements close?", np.allclose(c_host, (a_host + b_host)))
print("2. All elements close?", np.allclose(d_host, (c_host + e_host)))

1. All elements close? True
2. All elements close? True


# Exercise 9.2: "Matrix multiplication - part 1 (continued in next lecture)"
Consider the matrix multiplication
$$
C=AB
$$

Where $A$ is an $n \times m$ matrix and $B$ is an $m\times p$ matrix. The resulting matrix $C$ is $n\times p$.

1. How many operations are involved in the multiplication?

### Answer
$n\times p$ operations are involved in the multiplication.

2. Assume that all three matrices are of the data type float (IEEE754, aka Binary32, 4 bytes floating point). How much storage is needed to perform the operation?

### Answer
In order to perform the operation, the matrices A, B and C all have to be stored. They each hold values of the size 4 bytes, so that would give:
$$
4(n*m+m*p+n*p) \text{ Bytes}
$$

3. Create a naive implementation of the matrix multiplication in Python/Numpy. Check your results by comparing to Numpys built-in matrix multiplication ("A@B"). Does the memory use of your implementation align with the required above? (Check in variable explorer in Spyder)


In [23]:
### Naive Python Version
import numpy as np
from sys import getsizeof

N = 10000
A = np.random.randn(N, N).astype(np.float32)
B = np.random.randn(N, N).astype(np.float32)
C = np.zeros((N,N)).astype(np.float32)

for x in range(N):
    for y in range(N):
        C[x][y] = A[x][y] * B[y][x]

print("Size of A: {} Bytes".format(getsizeof(A)))
print("Size of B: {} Bytes".format(getsizeof(B)))
print("Size of C: {} Bytes".format(getsizeof(C)))
print("Total size: {} Bytes".format(getsizeof(A) + getsizeof(B) + getsizeof(C)))

Size of A: 400000128 Bytes
Size of B: 400000128 Bytes
Size of C: 400000128 Bytes
Total size: 1200000384 Bytes


In [22]:
### Using Numpy's built-in matrix multiplication (A@B)
import numpy as np
from sys import getsizeof

N = 10000
A = np.random.randn(N, N).astype(np.float32)
B = np.random.randn(N, N).astype(np.float32)

C = A@B

print("Size of A: {} Bytes".format(getsizeof(A)))
print("Size of B: {} Bytes".format(getsizeof(B)))
print("Size of C: {} Bytes".format(getsizeof(C)))
print("Total size: {} Bytes".format(getsizeof(A) + getsizeof(B) + getsizeof(C)))

Size of A: 400000128 Bytes
Size of B: 400000128 Bytes
Size of C: 400000128 Bytes
Total size: 1200000384 Bytes


Memory use is the same no matter the method, however the formula i gave seems to be wrong.
Execution time is decreased drastically for sufficiently large N, when using the \@ method

4. Consider the dependencies in the operation. Which rows and columns of A and B are required to to compute Ci,j? Can you use this information to compute multiple elements of C in parallel?
    - Consider how the operation can be performed if there is not enough memory in the system to contain all three matrices.

### Answer
For computing Ci,j, row i of A is needed as well as column j of B.

If there is not enough memory in the system to contain all three matrices, one could simply compute each row and column of A and B as they are needed, then overwrite them when done. Then the only memory requirements are: The size of C, a row of A and a column of B

5. Create a simple GPU kernel that multiplies two vectors element-wise. Afterwards, create a kernel that calculates the dot product of two vectors. You can use the attached 'template.py' as a starting point.


In [38]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# A short template to test small kernels.
# 

import numpy as np
import pyopencl as cl

VEC_SIZE = 50000

# Create the context (containing platform and device information) and command queue.
context = cl.create_some_context()
cmd_queue = cl.CommandQueue(context)

# Create the host side data and a empty array to hold the result.
a_host = np.random.rand(VEC_SIZE).astype(np.float32)
b_host = np.random.rand(VEC_SIZE).astype(np.float32)
mul_host = np.empty_like(a_host)
dot_host = np.empty_like(a_host)

# Create a device side read-only memory buffer and copy the data from "hostbuf" into it.
# Create as many 
# You can find the other possible mem_flags values at
# https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateBuffer.html
mf = cl.mem_flags
a_device = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_host)
b_device = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_host)
mul_device = cl.Buffer(context, mf.WRITE_ONLY, a_host.nbytes)
dot_device = cl.Buffer(context, mf.WRITE_ONLY, a_host.nbytes)

# Source of the kernel itself.
kernel_mul = """
__kernel void mul(
    __global const float *a_device, 
    __global const float *b_device, 
    __global       float *result_device)
{
  int gid = get_global_id(0);
  result_device[gid] = a_device[gid] * b_device[gid];
}
"""

kernel_dot = """
__kernel void dot_calc(
  __global const float *a_device,
  __global const float *b_device,
  __global       float *dot_device)
{
  int gid = get_global_id(0);
  dot_device[gid] = a_device[gid] * b_device[gid];
}
"""

# If you want to keep the kernel in a seperate file uncomment this line and adjust the filename
#kernel_source = open("kernel.cl").read()

# Create a new program from the kernel and build the source.
prog_mul = cl.Program(context, kernel_mul).build()
prog_dot = cl.Program(context, kernel_dot).build()

# Execute the "mul" kernel in the program. Parameters are:
# 
#        Command queue         Work group size   Kernel param 1
#            ↓   Global grid size   ↓   Kernel param 0  ↓  Kernel param 2
#            ↓           ↓          ↓       ↓           ↓        ↓
prog_mul.mul(cmd_queue, a_host.shape, None, a_device, b_device, mul_device)
prog_dot.dot_calc(cmd_queue, a_host.shape, None, a_device, b_device, dot_device)

# Copy the result back from device to host.
cl.enqueue_copy(cmd_queue, mul_host, mul_device)
cl.enqueue_copy(cmd_queue, dot_host, dot_device)

# Check the results in the host array with Numpy.
print("All elements close?", np.allclose(mul_host, (a_host * b_host)))
print("All elements close?", np.allclose(dot_host, (np.dot(a_host, b_host))))

All elements close? True
All elements close? False
