# Performance 

* Mapping APIs give superior memcpy performance
* Pinned memory is faster
* Allocating more work per work item for CPUs
* More work items per workgroup

### Research papers

* [Using OpenCL on CPUs](https://www.hindawi.com/journals/sp/2015/859491/)

In [1]:
import os
import sys
import numpy as np
sys.path.insert(0, os.path.abspath("../include"))

import py_helper

%matplotlib widget

dtype=np.float32
NCOLS_A = 512
NROWS_C = 512
NCOLS_C = 1024

# A is of size (NROWS_C, NCOLS_A)
# B is of size (NCOLS_A, NCOLS_C)    
# C is of size (NROWS_C, NCOLS_C)

mat_mul=py_helper.MatMul(NCOLS_A, NROWS_C, NCOLS_C, dtype)
mat_mul.make_data()

local_opt=py_helper.LocalOpt2D(2)
local_opt.make_data()

In [2]:
!make; 
local_opt.run_problem(["./mat_mult.exe","-cpu", "0"])

make: Nothing to be done for 'all'.
	               name: AMD Ryzen Threadripper 2950X 16-Core Processor  
	 global memory size: 135049 MB
	    max buffer size: 33762 MB
	     max local size: (8192,8192,8192)
	     max work-items: 8192
None
None


In [3]:
mat_mul.check_data()

### Access to vector elements

Buffer allocations of memory can be interpreted as vectors up to 16 elements. This is a potential source of optimisation because 

<figure style="margin-left:auto; margin-right:auto; width:100%;">
    <img style="vertical-align:middle" src="../images/matrix_multiplication_shmem_transp.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Using local memory in matrix multiplication.</figcaption>
</figure>

In [7]:
!make; ./mat_mult_local_transpose_vector.exe -gpu 1

make: Nothing to be done for 'all'.
	               name: NVIDIA GeForce RTX 3060 
	 global memory size: 12635 MB
	    max buffer size: 3158 MB
	     max local size: (1024,1024,64)
	     max work-items: 1024
Time for event "Running transpose kernel": 0.015 ms
Time for event "Running mat_mul_transp_vec kernel": 1.101 ms


In [8]:
# Import axes machinery
from mpl_toolkits.axes_grid1 import make_axes_locatable

# Read in the output from OpenCL
C_ocl = np.fromfile("array_C.dat", dtype=dtype).reshape((NROWS_C, NCOLS_C))

# Make plots
fig, axes = plt.subplots(3, 1, figsize=(6,8), sharex=True, sharey=True)

# Data to plot
data = [C, C_ocl, np.abs(C-C_ocl)]

# Labels to plot
labels = ["Numpy", "OpenCL", "Absolute residual"]

for n, value in enumerate(data):
    # Plot the graph
    ax = axes[n]
    im = ax.imshow(value)
    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="5%", pad=0.1)

    # Set labels on things
    ax.set_xlabel("Dimension 1 (columns)")
    ax.set_ylabel("Dimension 0 (rows)")
    ax.set_title(labels[n])

    # Put a color bar on the plot
    plt.colorbar(mappable=im, cax=cax)

fig.tight_layout()
plt.show()