# Working with images

The goal of this notebook is to get you familiar with OpenCL API related to images. Until now, we've worked with buffers and images are handled a bit differently. 

In the exercises you will implement a filter that can be applied to an image. We're going to define a relatively small filter - a fixed size 3x3 matrix. This matrix will be applied to every pixel in the image. You can think about it as a transform, which processes our image. Some common examples include:
* image blurring with [Box blur](https://en.wikipedia.org/wiki/Box_blur)
* edge detection with [Sobel Filter](https://en.wikipedia.org/wiki/Sobel_operator)

NOTE: The name ```kernel``` can be confusing and used for different purposes depending on the context:
- in GPU programming kernel indicates a program which is run on GPU - can also be referred to as shader
- in AI kernel mean the filter applied to a tensor, matrix or image - so in this notebook we will use the term filter

## Setup

Load libraries and extensions.

In [None]:
import pyopencl as cl
import numpy as np

%load_ext pyopencl.ipython_ext

Create context and queue.

In [None]:
platform = cl.get_platforms()[0]

ctx = cl.Context(
    dev_type=cl.device_type.ALL, 
    properties=[(cl.context_properties.PLATFORM, platform)])    

queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
    
devices = ctx.get_info(cl.context_info.DEVICES)
for d in devices:
    print(f"device={d}")

## Image copy exercise

Let's start with the host code that reads an image from a file into a numpy array. Numpy is a very popular Python library for working with arrays and matrices. Many operations are simple and convenient using numpy.

For reading the image into numpy array we will use another popular Computer Vision library - OpenCV. We are going to prefix variable names with:
* h_ or host_ - to indicate that variable refers to an object in host (CPU) memory
* d_ or device_ - to indicate that variable refers to device (GPU) memory

NOTE: The image is assumed to be accessed from the "Bee Hive" cluster. Change the path if you are working in a different environment.

In [None]:
import cv2
import numpy as np

#image_path = "/mnt/images/your_image.png"
image_path = "C:\\projects\\images\\desktop_2x1080p.png"

h_image = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
h_image = cv2.cvtColor(h_image, cv2.COLOR_BGR2BGRA) #add Alpha channel - needed for GPU

Below we'll define a helper function for profiling GPU kernels. In this case by profiling we mean measuring execution time. Since the code executes asynchronously on GPU we cannot just use methods known from Python like using "time" module. 

In PyOpenCL we can easily retrieve two timestamps collected from GPU clock and we can compute the time GPU has spent on executing the kernel.

Additionally, we'll execute the kernel "n" number of times and take the average.

In [None]:
def profile_gpu(function, n, queue, global_size, local_size, *args):
    times = np.zeros(n)
    function(queue, global_size, local_size, *args).wait()
    function(queue, global_size, local_size, *args).wait()
    
    for i in range(n):
        e = function(queue, global_size, local_size, *args)
        e.wait()
        elapsed = (e.profile.end - e.profile.start) * 1e-6
        times[i] = elapsed

    avg_ms = np.mean(times)
    median_ms = np.median(times)
    variance = np.var(times)
    std = np.std(times)
    min_ms = np.min(times)

    print(f"{function.function_name} took minimum = {min_ms:.4f} on average {avg_ms:.4f} ms, with median {median_ms:.4f} ms, variance {variance:.4f} ms, standard deviation {std:.4f} ms.")

For convenience, we'll define a function to display an image in a Jupyter notebook.

In [None]:
import IPython

def show_image(bgra_image):
    _, ret = cv2.imencode('.png', bgra_image)
    i = IPython.display.Image(data=ret)
    IPython.display.display(i)

Here is how the image looks like. This is an example of a desktop screenshot, so a very commmon content that we encode in DisplayLink.

In [None]:
show_image(h_image)

So far the image loaded was stored in CPU memory. In order to process it in GPU we need to transfer it to GPU. Additionally we'll extract the image dimensions from the shape property.

In [None]:
height = h_image.shape[0]
width = h_image.shape[1]
nr_channels = h_image.shape[2]

d_image = cl.image_from_array(ctx, h_image, nr_channels)

For copying the image we also need to create a second image to which the first one will be copied to.

In [None]:
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)

d_output = cl.Image(ctx, cl.mem_flags.WRITE_ONLY, fmt, shape=(width, height))

### Write a copy kernel

Your first task is to write an OpenCL kernel which will copy pixels from one image to another. Modify the code below to read the pixels, and write them back into output image.

* To read pixel from a texture you can use the [read_image](https://man.opencl.org/read_imagei2d.html) function.
* To write pixel use [write_image](https://man.opencl.org/write_image2d.html)
* To retrieve a thread ID you can use the [get_global_id](https://man.opencl.org/get_global_id.html) function.
* Anything else can be found on the [OpenCL 3.0 reference Guide](https://www.khronos.org/files/opencl30-reference-guide.pdf)

In [None]:
%%cl_kernel

__kernel void copy_image(read_only image2d_t image, write_only image2d_t output)
{    
    // Your code goes here
}

Modify this section to adjust execution configuration - the number and size of work groups. This can be solved in many ways.

In [None]:
local_work_size = (8,8)
# FIX the execution configuration
global_work_size = (256,256)

The code below schedules the kernel execution and reads the data back to the CPU. This code section is setup correctly so you should not modify it.

In [None]:
# keep this cell unchanged

# recreate empty output buffer
d_output = cl.Image(ctx, cl.mem_flags.WRITE_ONLY, fmt, shape=(width, height))

# schedule 'copy_image' kernel execution onto a GPU
profile_gpu(copy_image, 20, queue,
            global_work_size, 
            local_work_size,
            d_image,
            d_output)

# initialize CPU output buffer
h_output = np.zeros_like(h_image)

# schedule copy from GPU resource to CPU memory
_ = cl.enqueue_copy(queue=queue,
                dest=h_output,
                src=d_output,
                origin=(0, 0),
                region=(width, height)
               )

For debugging purposes you can modify those lines that print pixel values in the cell below. 

Also the copied image is displayed in the cell area so you can visually assess if it was copied correctly.

In [None]:
display(h_image[:2, :2, :])
display(h_output[:2, :2, :])
print(f"Input and output are the same: {np.array_equal(h_image, h_output)}")

# display image in the cell below
show_image(h_output)

Refer to the [solution](./open_day/copy_image_solution.txt) if you get stuck.

## Implement Box Blur

Now that you have written a copy kernel we have a good basis to implement a convolutional filter. 
* by filter we mean a mathematical operation on the pixels. 
* by convolutional we mean that conceptually a window will slide through an image and apply the operation.

Your task is to implements a GPU kernel, which will blur the image. Simple Box Blur reads N x N pixels around a pixel and averages them. 
Compared to a copy kernel from previous exercise, additionally to reading a pixel value you will have to read the pixels surrounding it.

Extend the copy kernel written previously:
* read 3x3 pixeld - 9 pixels in total. One in the middle and 8 surrounding it.
* Take a mean of 9 pixels. Pay attention to data types.
* write the mean pixel to the output image.
* we'll add sampler to read_imagei function so that we don't have to think about handling edges of the image.

You can refer to the [Implementation](https://en.wikipedia.org/wiki/Box_blur#Implementation) section of Wikipedia site on Box Blur. 

In [None]:
%%cl_kernel

__constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;

__kernel void box_blur(read_only image2d_t image, write_only image2d_t output)
{
    int2 coord = (int2)(get_global_id(0), get_global_id(1));
    uint4 avg = (uint4)(0);

    // Your code goes here
    
    avg /= 9;
    write_imageui(output, coord, avg);
}

Similarly to the copy image example, the code below schedules kernel execution. Feel free to experiment with local and global work sizes.

In [None]:
local_work_size = (8,8)
global_work_size = (width, height)

# keep below code cell unchanged

# recreate empty output buffer
d_blur_output = cl.Image(ctx, cl.mem_flags.WRITE_ONLY, fmt, shape=(width, height))

# schedule 'box_blur' kernel execution onto a GPU
profile_gpu(box_blur, 50, queue,
            global_work_size, 
            local_work_size,
            d_image,
            d_blur_output)

# initialize CPU output buffer
h_blur_output = np.zeros_like(h_image)

# schedule copy from GPU resource to CPU memory
_ = cl.enqueue_copy(queue=queue,
                dest=h_blur_output,
                src=d_blur_output,
                origin=(0, 0),
                region=(width, height)
               )

To help with debugging we display 128x128 pixels of upper left part of the original image and a blurred image. Does the second one look blurred?

Click [blurred_image.bmp](./blurred_image.bmp) to view the full image in a separate tab to check how it looks.

Refer to the [solution](./box_blur_solution.c) if you get stuck.

In [None]:
# display image in the cell below
show_image(h_image[:512, :512, :])
show_image(h_blur_output[:512, :512, :])
_ = cv2.imwrite("blurred_image.bmp", h_blur_output)

## Implement edge detection - Sobel filter

This exercise will demonstate how to implement arbitrary filter - which is passed from a host to GPU. We will extend the Box Blur code to load 3x3 filter matrix into GPU and perform calculations on it. We will learn it using an edge detection filter - Sobel filter.

There is additional argument passed as kernel parameter containing the filter weights. The values are taken from [here](https://en.wikipedia.org/wiki/Sobel_operator#Formulation) and they are:

$$
\begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}
$$

Sobel filter works on [luminance](https://en.wikipedia.org/wiki/Luma_(video)) - pixel intensity. To calculate it you can use simple version of the formula:

$$
\frac{Red + 2 Green + Blue}{4}
$$

You can follow steps below (or do it your own way):
* load pixel values for all 9 pixels
* convert from RGB space to luma using formula above
* multiple the relavant values by filter weight - filter weights are float so make sure calculations are in floats
* since values can get negative take absolute value of the accumulated weighted mean and clamp it to valid pixel values of <0,255>
* write back the output value at correct output coordinates - output image format was changed to contain only 1 LUMINANCE channel

In [None]:
%%cl_kernel

__constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;

__kernel void sobel_filter(read_only image2d_t image, 
                           const __global float* my_filter, 
                           write_only image2d_t output)
{    
    int2 coord = (int2)(get_global_id(0), get_global_id(1));
    int2 image_size = (int2)(get_image_width(image), get_image_height(image));
    
    float acc = 0;

    // your code goes here
    
    acc = fabs(acc);
    acc = fmin(acc, 255.0f);
    
    write_imageui(output, coord, (uint4)(acc, 0, 0, 0));
}

The code below schedules execution to the GPU. Notice that we have added an extra buffer that stores values of 3x3 filter.

In [None]:
local_work_size = (8,8)
global_work_size = (width, height)
h_filter = np.array(
    [[-1, 0, 1], 
     [-2, 0, 2], 
     [-1, 0, 1]], 
    dtype=np.float32)

# keep below code cell unchanged

# recreate empty output buffer
sobel_format = cl.ImageFormat(cl.channel_order.LUMINANCE, cl.channel_type.UNSIGNED_INT8)
d_output = cl.Image(ctx, cl.mem_flags.WRITE_ONLY, sobel_format, shape=(width, height))
d_filter = cl.Buffer(ctx, flags=cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=h_filter)

# schedule 'sobel_filter' kernel execution onto a GPU
profile_gpu(sobel_filter, 50, queue,
            global_work_size, 
            local_work_size,
            d_image,
            d_filter,
            d_output)

# initialize CPU output buffer - the shape is (width, height, 1) - only LUMINANCE
h_filter_output = np.zeros_like(h_image[:,:,0])

# schedule copy from GPU resource to CPU memory
_ = cl.enqueue_copy(queue=queue,
                dest=h_filter_output,
                src=d_output,
                origin=(0, 0),
                region=(width, height)
               )

show_image(h_filter_output)
_ = cv2.imwrite("sobel_filter.bmp", h_filter_output)

Click [sobel_filter.bmp](./sobel_filter.bmp) to view the full image in a separate tab to check how it looks.

Refer to the [solution](./sobel_filter_solution.c) if you get stuck.

If interested, you can try out other types of filters. [Here](https://en.wikipedia.org/wiki/Kernel_(image_processing)) is a wikipedia list of popular 3x3 filters.