## working with `unyt` and `dask` 

In this notebook, we present an attempt to work `dask` arrays within `unyt`. 

In general, `dask` operations on `collections`: `dask.array`, `dask.dataframe`, `dask.bag` are all subclassed off of the general `DaskMethodsMixin` collection. The dask documentation provides an overview of how to add [custom collections](https://docs.dask.org/en/latest/custom-collections.html#example-dask-collection), so we could construct a `unyt_array` collection. But as we want all of the functionality of `dask.array` and `unyt_array`, for example, we should be able to extend `dask.array` to include `unyt` behavior. 

So let's give that a try:

In [1]:
from unyt.array import unyt_array

from dask.array.core import Array, finalize
import numpy as np

def unyt_from_dask(dask_object,
              units = None,
              registry = None,
              dtype = None,
              bypass_validation = False,
              input_units = None,
              name = None):
    (cls, args) = dask_object.__reduce__()
    da = unyt_dask_array(*args)
    da._attach_units(units, registry, dtype, bypass_validation, input_units, name)
    return da

class unyt_dask_array(Array):
    def __init__(self, dask, name, chunks, dtype=None, meta=None, shape=None):
        self.units = None
        self.unyt_name = None
        self.dask_name = name
        self.factor = 1.

    def _attach_units(self,units = None,
              registry = None,
              dtype = None,
              bypass_validation = False,
              input_units = None,
              name = None):
        x_np = np.array([1.])
        self._unyt_array = unyt_array(x_np, units, registry, dtype, bypass_validation, input_units, name)
        self.units = self._unyt_array.units
        self.unyt_name = self._unyt_array.name

    def to(self, units, equivalence=None, **kwargs):
        # tracks any time units are converted with a running conversion factor
        # that gets applied after calling dask methods
        init_val = self._unyt_array.value[0]
        self._unyt_array = self._unyt_array.to(units, equivalence, **kwargs)
        self.factor = self.factor * self._unyt_array.value[0] / init_val
        self.units = units
        self.unyt_name = self._unyt_array.name

    def min(self, axis=None, keepdims=False, split_every=None, out=None):
        result = np.array(super().min(axis, keepdims, split_every, out))
        return unyt_array(result*self.factor, self.units)

    def max(self, axis=None, keepdims=False, split_every=None, out=None):
        result = np.array(super().max(axis, keepdims, split_every, out))
        return unyt_array(result*self.factor, self.units)

    # def __dask_postcompute__(self):
    #     # a dask hook to catch after .compute(), likely useful here...
    #     # https://docs.dask.org/en/latest/custom-collections.html#example-dask-collection
    #     return finalize, ()


Above, we define two objects: the `unyt_from_dask` function and the `unyt_dask_array` class. Let's focus on the new class first. 

In our new subclass, `unyt_dask_array(Array)`, `Array` is the core array class of `dask`. This class only has a `__new__` constructor, and so in the `__init__` here:


```python
def __init__(self, dask, name, chunks, dtype=None, meta=None, shape=None):
        self.units = None
        self.unyt_name = None
        self.dask_name = name
        self.factor = 1.
```

All those arguments are those needed for the base `Array.__new__` constructor, and when we instantiate `unyt_dask_array` the super-class's `__new__` will be called with those arguments automatically before proceeding with `unyt_dask_array.__init__()`. 

These arguments are all related to the details of how dask constructs its graphs and chunks, but we want to be able to instantiate our `unyt_dask_array` more simply. Thus, the convenience function `unyt_from_dask` constructs our new `Array` subclass from an existing `dask` array without having to know the details of how `dask` works.

So, for example, we can do: 

In [2]:
import numpy as np 
import unyt; import dask.array as da

x = da.random.random((10000, 10000), chunks=(1000, 1000))
x_da = unyt_from_dask(x, unyt.m)
x_da

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,8.00 MB
Shape,"(10000, 10000)","(1000, 1000)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 800.00 MB 8.00 MB Shape (10000, 10000) (1000, 1000) Count 100 Tasks 100 Chunks Type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,8.00 MB
Shape,"(10000, 10000)","(1000, 1000)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray


which behaves as a `dask` array. e.g.,: 

In [3]:
x_da.sum()

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Count,239 Tasks,1 Chunks
Type,float64,numpy.ndarray
Array Chunk Bytes 8 B 8 B Shape () () Count 239 Tasks 1 Chunks Type float64 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Count,239 Tasks,1 Chunks
Type,float64,numpy.ndarray


In [4]:
x_da.sum().compute()

49998923.76015304

but we also have units attached as for a `unyt_array`: 

In [5]:
x_da.units

m

now in order to return the resutls of our `dask` array operations as `unyt_arrays`, we need to modify the superclass methods. For example, the new `unyt_dask_array` class has the following methods that call the corresponding superclass methods, converts to a standard `ndarray` and returns a normal `unyt_array`:

```python
    def min(self, axis=None, keepdims=False, split_every=None, out=None):
        result = np.array(super().min(axis, keepdims, split_every, out))
        return unyt_array(result*self.factor, self.units)

    def max(self, axis=None, keepdims=False, split_every=None, out=None):
        result = np.array(super().max(axis, keepdims, split_every, out))
        return unyt_array(result*self.factor, self.units)
```    

So when we do:

In [6]:
x_da.min()

unyt_array(3.87564647e-09, 'm')

we get our standard `unyt_array`. 

We're also using a bit of trickery to deal with unit conversions in this class within the `unyt_dask_array.to` method, copied here:

```python
def to(self, units, equivalence=None, **kwargs):
        # tracks any time units are converted with a running conversion factor
        # that gets applied after calling dask methods
        init_val = self._unyt_array.value[0]
        self._unyt_array = self._unyt_array.to(units, equivalence, **kwargs)
        self.factor = self.factor * self._unyt_array.value[0] / init_val
        self.units = units
        self.unyt_name = self._unyt_array.name
```        

so within our `unyt_dask_array`, we initial the `self._unyt_array` as a `unyt_array` with a single value of `1.`. And any time a conversion occours, we apply the conversion factor to this `self._unyt_array` and track a cumulative conversion factor. Now, whenever we return calculations from `dask` into memory, we simply multiply by this conversion factor and attach the appropriate units. There is likely a more elegant solution here, but the basic idea is to track the units separately from the dask functionaly as the dask operations on each chunk are generally independent of the units scaling. 

So here's an example conversion:

In [7]:
x_da.to(unyt.km)

In [8]:
x_da.units

km

In [9]:
x_da.min()

unyt_array(3.87564647e-12, 'km')

In [10]:
x_da.to(unyt.nanometer)

In [11]:
x_da.min()

unyt_array(3.87564647, 'nm')

from which we see our units changing appropriately. 

Now, because our `unyt_from_dask` class is built off of a `dask` collection, it will work with the parallel scheduling. 

So let's spin up a client:

In [12]:
from dask.distributed import Client
client = Client(threads_per_worker=2, n_workers=2)

In [13]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:42103  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 2  Cores: 4  Memory: 33.51 GB


and re-instantiate our arrays:

In [14]:
x = da.random.random((10000, 10000), chunks=(1000, 1000))
x_da = unyt_from_dask(x, unyt.m)
x_da

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,8.00 MB
Shape,"(10000, 10000)","(1000, 1000)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 800.00 MB 8.00 MB Shape (10000, 10000) (1000, 1000) Count 100 Tasks 100 Chunks Type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,800.00 MB,8.00 MB
Shape,"(10000, 10000)","(1000, 1000)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray


now when we compute our properties, `dask` will compute values from chunks independently. So when we take a min:

In [15]:
x_da.min()

unyt_array(3.03822589e-09, 'm')

on our `dask` dashboard, we can see the distributed tasks complete via the Task Graph:


![TaskStream](resources/unyt_dask_taskgraph.png)


In the above example for `min` and `max`, we are converting the result from the superclass call to a standard `ndarray` as the results here will generally be small enough to be held in memory, even when returning an array using the `axis` argument:

In [17]:
x_da.max(axis=0)

unyt_array([9.99882084e+08, 9.99968623e+08, 9.99857613e+08, ...,
            9.99746090e+08, 9.99904782e+08, 9.99953299e+08], 'nm')

So in general, this method shows a fairly straightforward approach for adding dask support to `unyt`. It could likely be approved by more automated methods of generating the subclass since we want to avoid having to manually override all the superclass methods -- there is likely a clever way to do this (maybe with decorators?). 