# Point-in-Polygon

The goal of this exercise is to teach you about the different memory spaces available in CUDA.

To complete this exercise you need to do the following:

**Step 1.** Carefully read the entire notebook before you continue, make sure you understand everything, and run all the cells once from top to bottom.

**Step 2.** Change both the kernel and the Python code to store the vertices in constant memory space and only use the vertices in constant memory within the kernel.

Hints: Inside the CUDA kernel declare a float2 array of size VERTICES as a global variable. Choose a unique name and use the ``__constant__`` qualifier to declare this variable as residing in constant memory space.
Make sure the constant memory array is used correctly inside the CUDA kernel, instead of the currently used ‘vertices’ array in global memory. Just leave the original global memory array unused in the kernel (if you change the kernel arguments you have to change the hostcode as well).

Hints 2: You can use memcpy_htod() to copy the data to device memory, but you need to find the symbol to copy the data to. [See PyCuda documentation on get_global](https://documen.tician.de/pycuda/driver.html#pycuda.driver.Module.get_global).

As usual we start with some imports and initializing PyCuda.

In [None]:
import numpy as np
import cupy as cp

In [None]:
# Initialize cupy and create a device context
device = cp.cuda.Device(0)
device.use();

The next cell defines our CUDA kernel, by running the cell the contents of the cell will be written to a file named pnpoly.cu.

This kernel implements the crossing number algorithm for determining whether a point resides on the inside or on the outside of a polygon in the 2D plane. The polygon is defined as a set of vertices and the points are simply x,y coordinates. The result is a bitmap that indicates for each point if it's in or out.

In [None]:
%%writefile pnpoly.cu

#define VERTICES 600

extern "C"
__global__ void cn_pnpoly(int *bitmap, float2 *points, float2 *vertices, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    if (i < n) {
        int c = 0;
        float2 p = points[i];

        int k = VERTICES-1;

        for (int j=0; j<VERTICES; k = j++) {    // edge from vk to vj
            float2 vj = vertices[j];
            float2 vk = vertices[k];

            float slope = (vk.x-vj.x) / (vk.y-vj.y);

            if ( (  (vj.y>p.y) != (vk.y>p.y)) &&            //if p is between vj and vk vertically
                    (p.x < slope * (p.y-vj.y) + vj.x) ) {   //if p.x crosses the line vk-vj when moved in positive x-direction
                c = !c;
            }
        }

        bitmap[i] = c; // 0 if even (out), and 1 if odd (in)
    }

}

The next step is to prepare the input and output data structures for our kernel

In [None]:
# download the vertices
!wget https://github.com/benvanwerkhoven/gpu-course/raw/master/pnpoly/vertices.dat
#set the number of points and the number of vertices
size = np.int32(2e7)
vertices = 600

#generate/read input data
points = np.random.randn(2*size).astype(np.float32)
vertices = np.fromfile("vertices.dat", dtype=np.float32)
bitmap = np.zeros(size, dtype=np.int32)

Now we setup GPU memory for the input and output data as well as the argument list and launch parameters.

In [None]:
#allocate device memory and copy to GPU
d_vertices = cp.array(vertices)
d_points = cp.array(points)
d_bitmap = cp.array(bitmap)

#kernel arguments
gpu_args = [d_bitmap, d_points, d_vertices, size]

#setup thread block sizes
threads = (256, 1, 1)
grid = (int(np.ceil(size/float(threads[0]))), 1)

#create events for time measurement
start = cp.cuda.Event()
end = cp.cuda.Event()

Now before we turn to our CUDA kernel we first run a reference kernel to compute the reference output, which allows us to check if the result from our kernel is correct. **It is recommended to only run this cell once, before you make any modifications.**

In [None]:
#compile and run the reference kernel
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!wget -O pnpoly_reference.cu https://github.com/benvanwerkhoven/gpu-course/raw/master/pnpoly/pnpoly_reference_kernel.cu
with open('pnpoly_reference.cu', 'r') as f:
    kernel_string = f.read()
module = cp.RawModule(code=kernel_string, options=())
#compute the reference answer using the reference kernel
d_reference = cp.zeros_like(d_bitmap)
reference_kernel = module.get_function("cn_pnpoly_reference_kernel")
ref_args = [d_reference, d_points, d_vertices, size]
device.synchronize()
start.record()
reference_kernel(grid, threads, ref_args)
end.record()
device.synchronize()
reference = cp.asnumpy(d_reference)
print("reference kernel took", cp.cuda.get_elapsed_time(start, end), "ms.")

Now we are ready to compile and run our kernel and see if the result is correct.

Note that this cell will print PASSED when you haven't made any modifications. The goal here is to make sure that the kernel uses the vertices from constant memory. If you re-run this cell after your modifications it should still print PASSED and hopefully it will be slightly faster, but that differs per GPU.

In [None]:
#read kernel into string
with open('pnpoly.cu', 'r') as f:
    kernel_string = f.read()

#compile the kernels
module = cp.RawModule(code=kernel_string, options=())
pnpoly_kernel = module.get_function("cn_pnpoly")

# HINT: need to obtain a reference constant memory symbol
#symbol = ....

# need to copy vertices to the constant memory, uncomment next two lines
#constant_mem = cp.ndarray(vertices.shape, vertices.dtype, symbol)
#cp.copyto(constant_mem, d_vertices)

#make sure all previous operations have completed
device.synchronize()

#run the kernel and measure time using events
start.record()
pnpoly_kernel(grid, threads, gpu_args)
end.record()
device.synchronize()

print("cn_pnpoly took", cp.cuda.get_elapsed_time(start, end), "ms.")
bitmap = cp.asnumpy(d_bitmap)

print("PASSED" if np.sum(np.absolute(bitmap - reference)) == 0 else "FAILED")