# Basic Texture Sampling
This notebook focuses on texture memory, a powerful tool in GPU programming.
Texture memory has its own unique caching mechanism, specifically optimized for accessing 2D data like images and grid data.
When combined with 2D thread blocks, which are frequently used in image processing, texture memory enables efficient data access and sampling.
This allows for intuitive data access using texture coordinates, along with built-in interpolation features and flexible addressing modes.
These characteristics make texture memory particularly effective for stencil pattern computations, where each pixel's value depends on its surrounding neighbors.

In [None]:
import os
import math
import io

import numpy as np
import cupy as cp

from PIL import Image as PILImage
from IPython.display import Image, display

img_src = np.array([[192, 127], [63, 255]], dtype=np.uint8)

First, we define a 2x2 NumPy array called img_src.
This small array serves as our source image for texture sampling.
The values in each element `(192, 127, 63, 255)` represent pixel intensities.
This tiny, checkerboard-like pattern will be sampled and reconstructed into a larger image by a CUDA kernel, which we'll discuss later.
This pattern is key to visually understanding the various behaviors of texture sampling.

Next, we read the CUDA kernel source code file `02_basic_texture_sampling_1.cu` into a string.
We then display the source code content and compile it using CuPy's RawModule.
This `RawModule` converts the loaded CUDA code into an executable format on the GPU, preparing it for subsequent calls to the kernel function.

In [None]:
dn = os.getcwd()
fpfn = os.path.join(dn, '02_basic_texture_sampling_1.cu')
with open(fpfn, 'r') as f:
  cuda_source = f.read()

print(cuda_source)
module = cp.RawModule(code=cuda_source)

This scale kernel calculates the coordinates for each output pixel assigned to a thread and then samples the texture from those coordinates.
This kernel is designed to be invoked using 2D thread blocks, which are common in image processing, allowing each thread to process a corresponding 2D position in the output image.
It transforms the output image pixel coordinates into normalized `[0, 1)` texture coordinates, and then scales them further into the `[-1.5, 1.5)` range.
This scaling ensures that regions far outside the original 2x2 texture's `[0, 1)` range are also sampled, clearly visualizing the effects of various addressing modes (edge handling) that we'll demonstrate later.
Finally, it uses the scaled x, y coordinates to fetch a pixel value from the texture via `tex2D<unsigned char>(tex, x, y)` and writes it to the output array.
The addressing and filtering modes configured in the texture object are automatically applied during this sampling process.

The texture memory space on the GPU resides in device memory and is cached by the texture cache.
Therefore, a data read from texture memory (texture fetch) only incurs a device memory read on a cache miss; otherwise, it's a fast read from the texture cache.
The texture cache is specifically optimized for 2D spatial locality, maximizing access performance for 2D data.

Texture memory offers an advantage where it can achieve high bandwidth even for access patterns that global or constant memory might struggle with, provided there's locality in the texture fetches. 
Additionally, address calculations are performed automatically by dedicated units outside the kernel, reducing the programmer's burden.
Due to these characteristics, texture memory can be a more advantageous and higher-performance alternative to reading device memory from global or constant memory, especially for data with specific access patterns like those found in image processing.

In [None]:
def create_to(img_src, boader):
  channel_format_descriptor = cp.cuda.texture.ChannelFormatDescriptor(8, 0, 0, 0, cp.cuda.runtime.cudaChannelFormatKindUnsigned)

  img_src_gpu = cp.cuda.texture.CUDAarray(channel_format_descriptor, img_src.shape[1], img_src.shape[0])
  img_src_gpu.copy_from(img_src)

  resouce_descriptor = cp.cuda.texture.ResourceDescriptor(
    cp.cuda.runtime.cudaResourceTypeArray,
    cuArr = img_src_gpu)

  texture_descriptor = cp.cuda.texture.TextureDescriptor(
    addressModes = (boader, boader),
    filterMode = cp.cuda.runtime.cudaFilterModePoint,
    readMode = cp.cuda.runtime.cudaReadModeElementType,
    normalizedCoords = 1)

  texture_object = cp.cuda.texture.TextureObject(resouce_descriptor, texture_descriptor)
  return texture_object

The `create_to` function takes CPU image data (`img_src`) and a desired addressing mode (boader) as input, then generates and returns a texture object on the GPU.
A texture object is a GPU resource that CUDA kernels use for efficient image data sampling.

First, we use `ChannelFormatDescriptor` to define that each pixel in the texture is an 8-bit unsigned integer.
Next, we create a `cp.cuda.texture.CUDAarray` based on this format and the original image's dimensions, then copy the `img_src` data from the CPU to this `CUDAarray` on the GPU.
The `CUDAarray` is a specialized GPU memory region for texture sampling and serves as the source data for the texture object.

Following this, we specify that the `CUDAarray` created earlier will be used as the texture's resource via `ResourceDescriptor`.
Then, we configure the `TextureDescriptor`, which is the most crucial part for defining the texture's behavior.
Here, we apply the specified addressing mode (`boader`) to both U and V directions (`addressModes`), set the `filterMode` to `cudaFilterModePoint` (nearest-neighbor interpolation), and define the `readMode` to read the raw element type. `normalizedCoords` is set to 1, indicating that the kernel will use normalized coordinates ranging from `0.0` to `1.0`.
For more details on these settings, please refer to [the Texture Fetching section of the NVIDIA CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#texture-fetching).

Finally, we construct a `cp.cuda.texture.TextureObject` using these descriptors and return it from the function. This texture object is what the CUDA kernel will use to sample image data through the `tex2D` function.

In [None]:
w, h = 256, 256
img_dst_gpu = cp.empty((h, w), dtype=cp.uint8)
assert img_dst_gpu.flags.c_contiguous

border_str = 'wrap', 'clamp', 'mirror', 'border'
border_kind = cp.cuda.runtime.cudaAddressModeWrap,\
  cp.cuda.runtime.cudaAddressModeClamp,\
  cp.cuda.runtime.cudaAddressModeMirror,\
  cp.cuda.runtime.cudaAddressModeBorder

Here, we prepare for image generation and texture sampling.
First, we set the final output image size to `256x256` pixels for `w` and `h`, and initialize `img_dst_gpu as` a GPU array using CuPy to store the results.
The line assert `img_dst_gpu.flags.c_contiguous` is particularly important when handling 2D arrays.
It ensures that `img_dst_gpu has` a row-major contiguous memory layout on GPU memory, similar to C-style arrays.
This contiguity is often required for CUDA kernels to access memory efficiently and safely.

Next, we define the types of texture addressing modes (edge handling). `border_str` contains descriptive strings for each mode (e.g., `wrap`, `clamp`, `mirror`, `border`).
These settings determine how data is fetched when texture coordinates fall outside the original texture's range.
`border_kind` contains the corresponding CUDA runtime constants provided by CuPy for these strings.
These settings will be used in the subsequent loop to create texture objects for each mode and execute the kernel.

In [None]:
results_imgs = {}
for s, k in zip(border_str, border_kind):
  texture_object = create_to(img_src, k)
  gpu_func = module.get_function('scale')
  sz_block = 32, 32
  sz_grid = math.ceil(img_dst_gpu.shape[1] / sz_block[0]), math.ceil(img_dst_gpu.shape[0] / sz_block[1])
  gpu_func(
      block=sz_block,
      grid=sz_grid,
      args=(
          img_dst_gpu, texture_object, w, h
      )
  )
  results_imgs[s] = img_dst_gpu.get()


This iterative process generates images on the GPU corresponding to each of the defined addressing modes (`wrap`, `clamp`, `mirror`, `border`).

We begin by initializing an empty dictionary, `results_imgs`, to store the results.
Then, we iterate through each element of `border_str` and `border_kind`, performing the appropriate processing for each mode.
In each iteration of the loop, we use the current mode to generate a texture object via the `create_to` function.
Next, we retrieve the compiled CUDA kernel scale function (`gpu_func`) and calculate the necessary thread block and grid sizes for the kernel's execution.
This setup ensures all output pixels are processed efficiently.

After setting up, we call `gpu_func` to execute the CUDA kernel, passing the calculated block and grid sizes, the output GPU array, the generated texture object, and the output image's width and height as arguments.
Once the kernel execution completes, we download the resulting image data (`img_dst_gpu`) from GPU memory to CPU memory and store it in the `results_imgs` dictionary, using the corresponding addressing mode string as the key.
Repeating this process for all addressing modes prepares all the image data needed to compare how different edge handling affects the final image.

In [None]:
for s in border_str:
    pil_img = PILImage.fromarray(results_imgs[s])
    img_byte_arr = io.BytesIO()
    pil_img.save(img_byte_arr, format='PNG')
    img_byte_arr = img_byte_arr.getvalue()
    print('border type: {}\n'.format(s))
    display(Image(data=img_byte_arr))