The codebase currently handles use_gpu = True by sending the whole lookup table (LUT) to the GPU when first loaded. This will not work for large LUTs.
Additionally, operations like LUT = lut[mask] need memory allocation as it generates a copy instead of a view.
Possible options:
- substitute masking for slicing or other ways to create view of array
- keep everything in CPU until blockLookUp is called -- the function would then need a use_gpu boolean argument
- dask for distributed arrays?