### Produce DH matrix on the fly using Numba (not requiring D and H ahead of time)

Based on previous experimentation, producing D and H separately and performing matrix multiplication of the two to produce DH is technically fast, but is memory intensive due to how H scales with the desired high resolution image size and how much of the blur kernel is needed per row of H. I attempted to produce DH manually by first creating a stacked blur kernel, flattening it, and striding each 'row' of the kernel over each row of H. This works, but is computationally slow as I had to use a standard for loop to process each row of DH. Threading in python is not great... there are ways to do it but they dont play nicely with other libraries. In my case, the Scipy sparse matrices are not recognised by libraries such as Numba. Furthermore, the Scipy sparse matrices that allow insertion at sliced indices are dog slow when inserting (i.e., dictionary of keys (dok) and list of lists lil).

My next approach is to produce DH directly with the help of Numba, where I intend to produce three numpy arrays representing the rows/cols/vals of a Scipy sparse coordinate matrix (coo), where X(row,col) -> val. Numba can atleast understand what numpy arrays are so I should be able to populate the three arrays in parallel and use the three arrays to produce one coo sparse matrix of DH.

Tasks to perform:
- [x] First confirm whether Numba will populate three small arrays in parallel to produce the coo matrix, a small toy example is sufficient...
- [x] Extend the system to accomodate the stacked blur kernel, including strided kernel rows (the matrix will have redundant data present but thats okay for now)
- [x] Then extend the system to perform trimming at the edges where kernel entries are out of bounds of the DH matrix
- [x] There is an outstanding issue where the kernel_offset val needs to dynamically change depending on high/low res and the dimension of stacked kernel... consider using excel to inspect different configurations to see if a pattern can be found
- [ ] Seems to be a slight disagreement between dh vs d@h, looks like a precision issue so double check precisions across the system
- [ ] Producing the coo matrix from row/col/val arrays is thrashing memory at higher low/high resolutions and blur kernel sizes. This appears to be because the coo_matrix constructor internally creates a copy of row/col/val buffers instead of simply referencing the provided ones - what a crap solution... Consider performing batching instead and accumulating coo_matrices over time (say low_res_dim // low_res_dim*^2)
- [ ] Adding to above, consider changing the func to produce a csr or bsr matrix instead of coo arrays. This will save on both memory ahead of time, and should avoid the memory and runtime requirements needed to change format.
- [ ] Seems like its faster to just use numpy instead of any numba features (njit, parallel, etc.). Why is this...

Notes:
* The number of entries over each row of DH can be different in length, so there will need to be a 'double sweep' approach which first calculates how many entries will be in one row into a sort of row lookup table. This is needed as each individual thread in Numba will need to know how much of the row/col/val arrays it can touch. We can assume that one row will be one thread, so we can use the row indices to determine where to read from the lookup table. The assumption is that the indices of row/col/val arrays touched by one thread is based on lookup(thread_id, thread_id + 1).

In [1]:
%matplotlib inline
%load_ext line_profiler
%load_ext memory_profiler

from numba import njit, prange
from linear_system_super_resolution import *

In [2]:
%%file dh_matrix_analysis.py

from numba import njit, prange
from linear_system_super_resolution import *

### ===================================================================================================================
### Full approach
### ===================================================================================================================

def generate_dh_matrix(kernel, high_res_dim, downsample):
    
    low_res_dim = high_res_dim // downsample
    
    # %lprun -f calculate_d_origins d_origins = calculate_d_origins(high_res_dim, downsample)
    d_origins = calculate_d_origins(high_res_dim, downsample)
    
    # %lprun -f produce_stacked_kernel stacked_kernel = produce_stacked_kernel(kernel, downsample)
    stacked_kernel = produce_stacked_kernel(kernel, downsample)
    # show_image(stacked_kernel, "Stacked")
    
    # Fill in entries for sparse DH matrix
    # %lprun -f populate_dh_buffers row_buffer, col_buffer, val_buffer = populate_dh_buffers(d_origins, stacked_kernel, high_res_dim // downsample, high_res_dim, kernel.shape[0])
    row_buffer, col_buffer, val_buffer = populate_dh_buffers(d_origins, stacked_kernel, low_res_dim, high_res_dim, kernel.shape[0])
    
    return coo_matrix((val_buffer, (row_buffer, col_buffer)), shape=(low_res_dim**2, high_res_dim**2), dtype=np.float32)

# @njit(parallel = True)
def populate_dh_buffers(d_origins, kernel, low_res_dim, high_res_dim, original_kernel_samples):
    
    kernel_samples = kernel.shape[0]
    kernel_offset = ((original_kernel_samples - 1) // 2) * (high_res_dim + 1)
    kernel = kernel.flatten()
    
    buffer_strides = np.zeros(low_res_dim**2, dtype=np.uintc)
    left_clip = np.zeros(low_res_dim**2, dtype=np.uintc)
    right_clip = np.zeros(low_res_dim**2, dtype=np.uintc)
    
    # First pass to determine how big to make row/col/val buffers and the stride used for each thread to populate each respective row of DH
    for i in prange(low_res_dim**2):
        repeated_range = np.repeat(np.arange(kernel_samples, dtype=np.intc), kernel_samples)
        cols = repeated_range.reshape(-1, kernel_samples).T.flatten() + (repeated_range * high_res_dim) + d_origins[i] - kernel_offset
        left_clip[i] = cols[cols < 0].shape[0]
        right_clip[i] = cols[cols >= high_res_dim**2].shape[0]
    
    samples_per_clipped_dh_row = kernel_samples**2 - (left_clip + right_clip)
    total_samples = samples_per_clipped_dh_row.sum()
    
    # Below for Numba annotated func, cumsum doesnt support type...
    # buffer_strides = np.append([0], np.cumsum(samples_per_clipped_dh_row))
    # Below for regular func, cumsum requires dtype...
    buffer_strides = np.append([0], np.cumsum(samples_per_clipped_dh_row, dtype=np.uintc)) # prepend 0 to allow for ranges
    
    row_buffer = np.zeros(total_samples, dtype=np.uintc)
    col_buffer = np.zeros(total_samples, dtype=np.uintc)
    val_buffer = np.zeros(total_samples, dtype=np.float32)
    
    # Second pass to populate the row/col/val buffers using predetermined strides and clipping parameters
    for i in prange(low_res_dim**2):
        repeated_range = np.repeat(np.arange(kernel_samples, dtype=np.intc), kernel_samples)
        cols = repeated_range.reshape(-1, kernel_samples).T.flatten() + (repeated_range * high_res_dim) + d_origins[i] - kernel_offset
        row_buffer[buffer_strides[i] : buffer_strides[i+1]] = i
        col_buffer[buffer_strides[i] : buffer_strides[i+1]] = cols[left_clip[i] : kernel_samples**2 - right_clip[i]]
        val_buffer[buffer_strides[i] : buffer_strides[i+1]] = kernel[left_clip[i] : kernel_samples**2 - right_clip[i]]
        
    return row_buffer, col_buffer, val_buffer

### ===================================================================================================================
### 
### ===================================================================================================================




### ===================================================================================================================
### Batched approach
### ===================================================================================================================

def generate_dh_matrix_batched(kernel, high_res_dim, downsample, dh_matrix_batch_size):
    
    low_res_dim = high_res_dim // downsample
    
    # %lprun -f calculate_d_origins d_origins = calculate_d_origins(high_res_dim, downsample)
    d_origins = calculate_d_origins(high_res_dim, downsample)
    
    # %lprun -f produce_stacked_kernel stacked_kernel = produce_stacked_kernel(kernel, downsample)
    stacked_kernel = produce_stacked_kernel(kernel, downsample)
    # show_image(stacked_kernel, "Stacked")
    
    # Placeholder matrix to be populated over time
    m = coo_matrix((low_res_dim**2, high_res_dim**2), dtype=np.float32)

    # Fill in entries for sparse DH matrix
    for b in range(low_res_dim**2 // dh_matrix_batch_size):
        # %lprun -f populate_dh_buffers row_buffer, col_buffer, val_buffer = populate_dh_buffers(d_origins, stacked_kernel, high_res_dim // downsample, high_res_dim, kernel.shape[0])
        row_buffer, col_buffer, val_buffer = populate_dh_buffers_batched(d_origins, stacked_kernel, low_res_dim, high_res_dim, kernel.shape[0], dh_matrix_batch_size, b * dh_matrix_batch_size)
        m += coo_matrix((val_buffer, (row_buffer, col_buffer)), shape=(low_res_dim**2, high_res_dim**2), dtype=np.float32)
        # show_image(m.todense(), f"Matrix (Batch {b})")
    
    return m

@njit(parallel = True)
def populate_dh_buffers_batched(d_origins, kernel, low_res_dim, high_res_dim, original_kernel_samples, batch_size, batch_offset):
    
    kernel_samples = kernel.shape[0]
    kernel_offset = ((original_kernel_samples - 1) // 2) * (high_res_dim + 1)
    kernel = kernel.flatten()
    
    buffer_strides = np.zeros(batch_size, dtype=np.uintc)
    left_clip = np.zeros(batch_size, dtype=np.uintc)
    right_clip = np.zeros(batch_size, dtype=np.uintc)
    
    # First pass to determine how big to make row/col/val buffers and the stride used for each thread to populate each respective row of DH
    for i in prange(batch_size):
        repeated_range = np.repeat(np.arange(kernel_samples).astype(np.intc), kernel_samples)
        cols = repeated_range.reshape(-1, kernel_samples).T.flatten() + (repeated_range * high_res_dim) + d_origins[i+batch_offset] - kernel_offset
        left_clip[i] = cols[cols < 0].shape[0]
        right_clip[i] = cols[cols >= high_res_dim**2].shape[0]
    
    samples_per_clipped_dh_row = kernel_samples**2 - (left_clip + right_clip)
    total_samples = samples_per_clipped_dh_row.sum()
    
    # Below for Numba annotated func, cumsum doesnt support type...
    buffer_strides = np.append([0], np.cumsum(samples_per_clipped_dh_row))
    # Below for regular func, cumsum requires dtype...
    # buffer_strides = np.append([0], np.cumsum(samples_per_clipped_dh_row, dtype=np.uintc)) # prepend 0 to allow for ranges
    
    row_buffer = np.zeros(total_samples, dtype=np.uintc)
    col_buffer = np.zeros(total_samples, dtype=np.uintc)
    val_buffer = np.zeros(total_samples, dtype=np.float32)
    
    # Second pass to populate the row/col/val buffers using predetermined strides and clipping parameters
    for i in prange(batch_size):
        repeated_range = np.repeat(np.arange(kernel_samples).astype(np.intc), kernel_samples)
        cols = repeated_range.reshape(-1, kernel_samples).T.flatten() + (repeated_range * high_res_dim) + d_origins[i+batch_offset] - kernel_offset
        row_buffer[buffer_strides[i] : buffer_strides[i+1]] = i+batch_offset
        col_buffer[buffer_strides[i] : buffer_strides[i+1]] = cols[left_clip[i] : kernel_samples**2 - right_clip[i]]
        val_buffer[buffer_strides[i] : buffer_strides[i+1]] = kernel[left_clip[i] : kernel_samples**2 - right_clip[i]]
        
    return row_buffer, col_buffer, val_buffer

### ===================================================================================================================
### 
### ===================================================================================================================

def produce_stacked_kernel(kernel, downsample):
    
    kernel_dim = kernel.shape[0]
    stacked = np.zeros((kernel_dim + downsample - 1, kernel_dim + downsample - 1), dtype=np.float32)
    for r in range(downsample):
        for c in range(downsample):
            stacked[r:r+kernel_dim, c:c+kernel_dim] += kernel
    
    return stacked

# Calculates the first column index per row of D if we were to use D as a genuine downsampling matrix
def calculate_d_origins(high_res_dim, downsample):
    low_res_dim = high_res_dim // downsample
    return np.tile(np.arange(0, high_res_dim, downsample, dtype=np.uintc), low_res_dim) + np.repeat(np.arange(low_res_dim, dtype=np.uintc) * high_res_dim * downsample, low_res_dim)


Overwriting dh_matrix_analysis.py


In [3]:
from dh_matrix_analysis import *
# Params
high_res_dim = 800
low_res_dim = 400
downsample = high_res_dim // low_res_dim

# Blur kernel
kernel_samples = 99
kernel = gaussian_2d(kernel_samples)
# show_image(kernel, "Gaussian")

dh_matrix_batch_size = (low_res_dim**2) // 16 # needs to be a factor of low_res_dim**2 to ensure full coverage

In [4]:
# %lprun -f generate_dh_matrix m = generate_dh_matrix(kernel, high_res_dim, downsample)
# %mprun -f generate_dh_matrix m = generate_dh_matrix(kernel, high_res_dim, downsample)

# m = generate_dh_matrix(kernel, high_res_dim, downsample)

# Around 1.5 mins for coo (no numba), 
# Around 1.3 mins for csr (no numba)
# Around 5 mins for bsr (no numba)
%lprun -f generate_dh_matrix_batched m = generate_dh_matrix_batched(kernel, high_res_dim, downsample, dh_matrix_batch_size)

Timer unit: 1e-06 s

Total time: 65.0961 s
File: /supy_res/notebooks/dh_matrix_analysis.py
Function: generate_dh_matrix_batched at line 77

Line #      Hits         Time  Per Hit   % Time  Line Contents
    77                                           def generate_dh_matrix_batched(kernel, high_res_dim, downsample, dh_matrix_batch_size):
    78                                               
    79         1          2.0      2.0      0.0      low_res_dim = high_res_dim // downsample
    80                                               
    81                                               # %lprun -f calculate_d_origins d_origins = calculate_d_origins(high_res_dim, downsample)
    82         1       1781.0   1781.0      0.0      d_origins = calculate_d_origins(high_res_dim, downsample)
    83                                               
    84                                               # %lprun -f produce_stacked_kernel stacked_kernel = produce_stacked_kernel(kernel, downsample)
  

In [5]:
print(matrix_memory(m))
print(type(m))
# 7378.160004 for bsr

12400.640004
<class 'scipy.sparse._csr.csr_matrix'>


In [6]:
# show_sparse_matrix(m, "Matrix")
# show_image(m.todense(), "Matrix")

In [7]:
# # # Original method of producing decimation matrix D
# d = generate_d_matrix(high_res_dim, downsample)
# # # show_image(d.todense(), "D Matrix")

# h = generate_h_matrix(high_res_dim, kernel)
# # # show_image(h.todense(), "H Matrix")

# dh = d @ h
# # show_sparse_matrix(dh, "DH Matrix")
# # show_image(dh.todense(), "DH Mat")

# # show_image((dh.todense() - m.todense()), "Diff")
# print(dh.power(2.0).sum())
# print(m.power(2.0).sum())

# # print((dh.power(2.0)).sum() - (m.power(2.0)).sum())