# Understand use of np.digitize with dask and xarray arrays by means of dask array map_blocks

Aiko Voigt, KIT, Aug 10, 2020

In [1]:
import numpy as np
import dask.array as da
import xarray as xr

Create numpy array and versions for dask arrays, unchunked xarray DataArray and chunked xarray DataArray

In [52]:
np_array = np.random.randn(1000)

da_array = da.from_array(np_array)
xr_array = xr.DataArray(np_array, dims=['ind'], coords={'ind': np.arange(1000)})
xr_array_chunked = xr_array.chunk({'ind': 10})

Bins for digitze

In [5]:
mybins=np.linspace(-1,1,10)

## np.digitze

Pure numpy world works

In [7]:
np_array_npdigit = np.digitize(np_array, bins=mybins)

## da.digitize

Pure dask world works

In [8]:
da_array_dadigit = da.digitize(da_array, bins=mybins)

Dask digitze on numpy array does not work, as np_array does not have map_blocks function used by da.digitze to wrap np.digitize

In [10]:
np_array_dadigit = da.digitize(np_array, bins=mybins)

AttributeError: 'numpy.ndarray' object has no attribute 'map_blocks'

One can make it work manually by isolating the crucial line of da.digitize, but I assume that would have no performance benefit

In [25]:
np_array_dadigit = da_array.map_blocks(np.digitize, dtype='int64', bins=mybins, right='right')

Unnamed: 0,Array,Chunk
Bytes,8.00 kB,8.00 kB
Shape,"(1000,)","(1000,)"
Count,2 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 8.00 kB 8.00 kB Shape (1000,) (1000,) Count 2 Tasks 1 Chunks Type int64 numpy.ndarray",1000  1,

Unnamed: 0,Array,Chunk
Bytes,8.00 kB,8.00 kB
Shape,"(1000,)","(1000,)"
Count,2 Tasks,1 Chunks
Type,int64,numpy.ndarray


Dask digitize does not work on xarray DataArray, independent of whether it is chunked or not

In [14]:
# unchunked xarray DataArray
da.digitize(xr_array, bins=mybins)

TypeError: map_blocks() got an unexpected keyword argument 'dtype'

In [15]:
# chunked xarray DataArray
da.digitize(xr_array_chunked, bins=mybins)

TypeError: map_blocks() got an unexpected keyword argument 'dtype'

One can make da.digitize work on xarray DataArray by converting to a dask_array. For the chunked version, this throws a warning for the chunked xr DataArray. Not sure if the warning is at all problematic, though.

In [16]:
da.digitize(da.from_array(xr_array), bins=mybins)

Unnamed: 0,Array,Chunk
Bytes,8.00 kB,8.00 kB
Shape,"(1000,)","(1000,)"
Count,3 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 8.00 kB 8.00 kB Shape (1000,) (1000,) Count 3 Tasks 1 Chunks Type int64 numpy.ndarray",1000  1,

Unnamed: 0,Array,Chunk
Bytes,8.00 kB,8.00 kB
Shape,"(1000,)","(1000,)"
Count,3 Tasks,1 Chunks
Type,int64,numpy.ndarray


In [18]:
da.digitize(da.from_array(xr_array_chunked), bins=mybins)



Unnamed: 0,Array,Chunk
Bytes,8.00 kB,8.00 kB
Shape,"(1000,)","(1000,)"
Count,3 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 8.00 kB 8.00 kB Shape (1000,) (1000,) Count 3 Tasks 1 Chunks Type int64 numpy.ndarray",1000  1,

Unnamed: 0,Array,Chunk
Bytes,8.00 kB,8.00 kB
Shape,"(1000,)","(1000,)"
Count,3 Tasks,1 Chunks
Type,int64,numpy.ndarray


In [24]:
xr.map_blocks(np.digitize, np_array, [mybins])

array([ 1,  3, 10,  2,  8, 10,  1,  8,  6,  7,  2,  0,  5,  2,  7,  3, 10,
        1,  0,  0,  5,  7,  1,  7,  6,  6, 10,  0,  4,  7,  5, 10,  5, 10,
       10,  3, 10,  9,  4, 10,  2,  5,  6,  0,  5, 10,  0,  1,  4,  2,  3,
        3,  5, 10,  3,  2,  8,  7,  5,  0,  6,  6,  0,  2,  4,  7, 10, 10,
        9,  0,  4,  4, 10,  8, 10,  5,  6,  4, 10,  7,  5,  5, 10,  6, 10,
        0,  3,  3,  3,  6,  3,  3, 10,  9,  6, 10,  3, 10,  2,  1,  0, 10,
        2,  3,  0,  1,  5,  0,  3,  1, 10,  1, 10,  8,  5,  9,  8,  0,  4,
       10,  7, 10,  2,  7,  5,  3, 10, 10,  8,  4,  9,  7,  0,  8,  8,  6,
        5, 10,  2,  1,  4,  4, 10,  0,  0,  3,  3,  7,  0,  3,  3, 10,  2,
        5, 10,  6,  7,  6,  2,  7,  3,  4,  2,  7,  4,  4,  1,  6, 10,  0,
        9,  4,  0,  3,  7,  4,  5,  0,  2,  7,  7, 10, 10,  0,  9,  3,  5,
        6,  6,  4,  9,  0, 10,  4, 10,  3,  3,  7,  3, 10,  3,  0,  2,  5,
        9,  2,  5,  4,  2,  8,  5,  2, 10, 10,  4,  2,  5,  0,  2,  0,  0,
        9,  8, 10, 10, 10

## xarray world

Once can use xr.map_blocks to wrap np.digitize for the unchunked xarray DataArray. Yet, this does not work for the chunked xarray DataArray.

In [23]:
xr_array_xrdigit = xr.map_blocks(np.digitize, xr_array, [mybins])

In [25]:
# would work if we unchunk by adding ".values"
# maybe way of calling map_blocks is not correct?
xr_array_chunked_xrdigit = xr.map_blocks(np.digitize, xr_array_chunked, [mybins])

TypeError: Function must return an xarray DataArray or Dataset. Instead it returned <class 'numpy.ndarray'>

One could also use xr.apply_ufunc here, but the calling is not quite correct yet ...

In [58]:
xr.apply_ufunc(np.digitize, xr_array_chunked, mybins, output_dtypes=[np.int64], dask='parallelized', 
               input_core_dims=[[], []],#"ind"], ["ind"]],  # list with one entry per arg
               output_core_dims=[[]],#["ind"]],
               exclude_dims=set(("ind",)) )  # dimensions allowed to change size. Must be set!

ValueError: each dimension in `exclude_dims` must also be a core dimension in the function signature