# Exercise - Hadamard matrix multiplication timing!

In this exercise we are going to use the OpenCL profiling library to time the kernel execution for the Hadamard multiplication problem, where the values in matrices **D** and **E** at coordinates (i0,i1) are multiplied together to set the value at coordinates (i0,i1) in matrix **F**.

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../images/elementwise_multiplication.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Elementwise multiplication of matrices D and E to get F.</figcaption>
</figure>

The source code is located in [mat_elementwise.cpp](mat_elementwise.cpp) and the kernel is in [kernels_elementwise.c](kernels_elementwise.c). Matrices **D** and **E** are read in from disk and matrix **F** is produced as the output. Your task is to measure how long it takes to execute the kernel using the OpenCL profiling interface.

## Constructing the inputs and solution

As before, we construct input matrices and call them **D** and **E**.

In [1]:
import numpy as np

from matplotlib import pyplot as plt

%matplotlib widget

# Matrices D, E, F are of size (NROWS_D, NCOLS_D)
NROWS_F = 520
NCOLS_F = 1032

# Data type
dtype = np.float32

# Make up the arrays A, B, and C
D = np.random.random(size = (NROWS_F, NCOLS_F)).astype(dtype)
E = np.random.random(size = (NROWS_F, NCOLS_F)).astype(dtype)

# Make up the answer using Hadamard multiplication
F = D*E

# Write out the arrays as binary files
D.tofile("array_D.dat")
E.tofile("array_E.dat")

## Run the code

In [2]:
!make; ./mat_elementwise.exe

make: Nothing to be done for 'all'.
	               name: NVIDIA GeForce RTX 3060 
	 global memory size: 12635 MB
	    max buffer size: 3158 MB
	     max local size: (1024,1024,64)
	     max work-items: 1024


As you can see, there currently is no output to tell how long the kernel ran for.

## Check the output

In [3]:
# Import axes machinery
from mpl_toolkits.axes_grid1 import make_axes_locatable

# Read in the output from OpenCL
F_ocl = np.fromfile("array_F.dat", dtype=dtype).reshape((NROWS_F, NCOLS_F))

# Make plots
fig, axes = plt.subplots(3, 1, figsize=(6,8), sharex=True, sharey=True)

# Data to plot
data = [F, F_ocl, np.abs(F-F_ocl)]

# Labels to plot
labels = ["Numpy", "OpenCL", "Absolute residual"]

for n, value in enumerate(data):
    # Plot the graph
    ax = axes[n]
    im = ax.imshow(value)
    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="5%", pad=0.1)

    # Set labels on things
    ax.set_xlabel("Dimension 1 (columns)")
    ax.set_ylabel("Dimension 0 (rows)")
    ax.set_title(labels[n])

    # Put a color bar on the plot
    plt.colorbar(mappable=im, cax=cax)

fig.tight_layout()
plt.show()

## Tasks

Your task is to time the kernel execution using OpenCL events and the the command queue functionality to profile events.

* Modify the options to the helper function **h_create_command_queues** to enable profiling
* Call the helper function **h_get_event_time_ms** to print out the kernel execution time (in milliseconds).

### Bonus task

* Use the helper function **h_get_event_time_ms** to measure the time and IO rate of the uploads and downloads to the compute device.

### Answers

You can of course always look at the answer, in [mat_elementwise.cpp](mat_elementwise_answers.cpp) and run the answer below.

In [6]:
!make; ./mat_elementwise_answers.exe

make: Nothing to be done for 'all'.
	               name: NVIDIA GeForce RTX 3060 
	 global memory size: 12635 MB
	    max buffer size: 3158 MB
	     max local size: (1024,1024,64)
	     max work-items: 1024
Time for event "Uploading Buffer D": 0.212 ms (10128.34 MB/s)
Time for event "Uploading Buffer E": 0.210 ms (10242.79 MB/s)
Time for event "Kernel execution": 0.037 ms
Time for event "Downloading Buffer F": 0.210 ms (10242.79 MB/s)
