## Manual *yt* selections with dask arrays

This notebook is an initial field test of returning dask arrays when accessing fields in a *yt* dataset.

It uses the https://github.com/chrishavlin/yt/tree/dask_init_particle branch with a few small modifications. First, in `BaseIOHandler._read_particle_selection`, it does not call compute on the dask arrays in the field dictionary, so that delayed arrays are returned. Second, the following code in `data_selection_objects.YTSelectionContainer`:

```python
        for f, v in read_particles.items():
            self.field_data[f] = self.ds.arr(v, units=finfos[f].units)
            self.field_data[f].convert_to_units(finfos[f].output_units)            
```    

is replaced with 

```python
        from unyt import dask_array

        for f, v in read_particles.items():
            da_f = dask_array.unyt_from_dask(v, units=finfos[f].units, registry=self.ds.unit_registry)
            self.field_data[f] = da_f.to(finfos[f].output_units)
```      

this will result in returing `unyt_dask` arrays! 

In [1]:
import yt

In [2]:
ds = yt.load_sample("snapshot_033")
ad = ds.all_data()

yt : [INFO     ] 2021-06-08 11:02:20,522 Files located at /home/chris/hdd/data/yt_data/yt_sample_sets/snapshot_033.tar.gz.untar/snapshot_033/snap_033.
yt : [INFO     ] 2021-06-08 11:02:20,524 Default to loading snap_033.0.hdf5 for snapshot_033 dataset
yt : [INFO     ] 2021-06-08 11:02:20,679 Parameters: current_time              = 4.343952725460923e+17 s
yt : [INFO     ] 2021-06-08 11:02:20,680 Parameters: domain_dimensions         = [1 1 1]
yt : [INFO     ] 2021-06-08 11:02:20,681 Parameters: domain_left_edge          = [0. 0. 0.]
yt : [INFO     ] 2021-06-08 11:02:20,682 Parameters: domain_right_edge         = [25. 25. 25.]
yt : [INFO     ] 2021-06-08 11:02:20,683 Parameters: cosmological_simulation   = 1
yt : [INFO     ] 2021-06-08 11:02:20,684 Parameters: current_redshift          = -4.811891664902035e-05
yt : [INFO     ] 2021-06-08 11:02:20,684 Parameters: omega_lambda              = 0.762
yt : [INFO     ] 2021-06-08 11:02:20,685 Parameters: omega_matter              = 0.238
yt : [

In [3]:
den = ad[("PartType4","Density")]  # will use hmsl = 0 
den

Unnamed: 0,Array,Chunk
Bytes,1.25 MB,166.18 kB
Shape,"(155926,)","(20772,)"
Count,40 Tasks,8 Chunks
Type,float64,numpy.ndarray
Units,code_mass/code_length**3,code_mass/code_length**3
"Array Chunk Bytes 1.25 MB 166.18 kB Shape (155926,) (20772,) Count 40 Tasks 8 Chunks Type float64 numpy.ndarray Units code_mass/code_length**3 code_mass/code_length**3",155926  1,

Unnamed: 0,Array,Chunk
Bytes,1.25 MB,166.18 kB
Shape,"(155926,)","(20772,)"
Count,40 Tasks,8 Chunks
Type,float64,numpy.ndarray
Units,code_mass/code_length**3,code_mass/code_length**3


cool! we have our `unyt_dask` array! Can do dask and unyt things:

In [4]:
den = den.to('kg/m**3')
den

Unnamed: 0,Array,Chunk
Bytes,1.25 MB,166.18 kB
Shape,"(155926,)","(20772,)"
Count,48 Tasks,8 Chunks
Type,float64,numpy.ndarray
Units,kg/m**3,kg/m**3
"Array Chunk Bytes 1.25 MB 166.18 kB Shape (155926,) (20772,) Count 48 Tasks 8 Chunks Type float64 numpy.ndarray Units kg/m**3 kg/m**3",155926  1,

Unnamed: 0,Array,Chunk
Bytes,1.25 MB,166.18 kB
Shape,"(155926,)","(20772,)"
Count,48 Tasks,8 Chunks
Type,float64,numpy.ndarray
Units,kg/m**3,kg/m**3


In [5]:
den.mean()

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Count,59 Tasks,1 Chunks
Type,float64,numpy.ndarray
Units,kg/m**3,kg/m**3
Array Chunk Bytes 8 B 8 B Shape () () Count 59 Tasks 1 Chunks Type float64 numpy.ndarray Units kg/m**3 kg/m**3,,

Unnamed: 0,Array,Chunk
Bytes,8 B,8 B
Shape,(),()
Count,59 Tasks,1 Chunks
Type,float64,numpy.ndarray
Units,kg/m**3,kg/m**3


In [6]:
den.mean().compute()

unyt_quantity(4.28121508e-21, 'kg/m**3')

Ok, that's kinda neat. 

### slicing instead of selection objects?? 

Now, this is happening on the `all_data()` selection object. If we wanted to do a sphere selection, we could of course do:

In [7]:
sp = ds.sphere(ds.domain_center,ds.quan(5,'code_length'))


In [8]:
%%time
den_sp = sp[("PartType4","Density")]

CPU times: user 3.07 s, sys: 30.6 ms, total: 3.1 s
Wall time: 3.08 s


In [9]:
den_sp

Unnamed: 0,Array,Chunk
Bytes,233.64 kB,34.34 kB
Shape,"(29205,)","(4293,)"
Count,40 Tasks,8 Chunks
Type,float64,numpy.ndarray
Units,code_mass/code_length**3,code_mass/code_length**3
"Array Chunk Bytes 233.64 kB 34.34 kB Shape (29205,) (4293,) Count 40 Tasks 8 Chunks Type float64 numpy.ndarray Units code_mass/code_length**3 code_mass/code_length**3",29205  1,

Unnamed: 0,Array,Chunk
Bytes,233.64 kB,34.34 kB
Shape,"(29205,)","(4293,)"
Count,40 Tasks,8 Chunks
Type,float64,numpy.ndarray
Units,code_mass/code_length**3,code_mass/code_length**3


and what yt does behind the scenes is apply the selection object to each chunk of the dask array, so that we only return the values within the array. Note that the initial instantiation of `den_sp` actually takes a bit of time -- that's because creating the dask array requires knowing the length of each chunk that will be concatenated into our total dask array. So even though we get a delayed array, there is an initial embedded compute to get the expected lengths.

**Ok, that's all well and good**, but since our dask array doesn't actually hold the array in memory until we call compute, we can actually do our selections with array-slicing syntax, and dask will go and slice by each chunk, kind of similar to how the yt native selection objects work. 

Let's pull out the coordinates from all the data:

In [10]:
xyz = ad[("PartType4","Coordinates")]
xyz

Unnamed: 0,Array,Chunk
Bytes,3.74 MB,498.53 kB
Shape,"(155926, 3)","(20772, 3)"
Count,40 Tasks,8 Chunks
Type,float64,numpy.ndarray
Units,code_length,code_length
"Array Chunk Bytes 3.74 MB 498.53 kB Shape (155926, 3) (20772, 3) Count 40 Tasks 8 Chunks Type float64 numpy.ndarray Units code_length code_length",3  155926,

Unnamed: 0,Array,Chunk
Bytes,3.74 MB,498.53 kB
Shape,"(155926, 3)","(20772, 3)"
Count,40 Tasks,8 Chunks
Type,float64,numpy.ndarray
Units,code_length,code_length


and manually calculate a distance from the center. As it turns out, it seems that there's a bug in the new unyt dask arrays, where the array becomes a normal dask array when slicing. So we'll do these operations in a unyt-less way:

In [11]:
import numpy as np 

C = ds.domain_center.value
R = float(ds.quan(5,'code_length').value)

In [12]:
dist = np.sqrt( (xyz[:,0] - C[0])**2 + (xyz[:,1] - C[1])**2 + (xyz[:,2]- C[2])**2 )
dist

Unnamed: 0,Array,Chunk
Bytes,1.25 MB,166.18 kB
Shape,"(155926,)","(20772,)"
Count,136 Tasks,8 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 1.25 MB 166.18 kB Shape (155926,) (20772,) Count 136 Tasks 8 Chunks Type float64 numpy.ndarray",155926  1,

Unnamed: 0,Array,Chunk
Bytes,1.25 MB,166.18 kB
Shape,"(155926,)","(20772,)"
Count,136 Tasks,8 Chunks
Type,float64,numpy.ndarray


and now we get a new `unyt_dask` array for density (so we get back to the initial units) and mask out our sphere:

In [13]:
den = ad[("PartType4","Density")]  
den_sp_manual = den[dist <= R]
den_sp_manual

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,200 Tasks,8 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes unknown unknown Shape (nan,) (nan,) Count 200 Tasks 8 Chunks Type float64 numpy.ndarray",,

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,200 Tasks,8 Chunks
Type,float64,numpy.ndarray


Let's pull our density into memory for our manually sliced sphere:

In [14]:
%%time
den_in_mem = den_sp_manual.compute()

CPU times: user 45.4 ms, sys: 8.81 ms, total: 54.3 ms
Wall time: 48.8 ms


and now for our yt-natively selected sphere:

In [15]:
%%time
den_sp_selector = den_sp.compute()

CPU times: user 21.8 ms, sys: 273 µs, total: 22.1 ms
Wall time: 16.9 ms


do our arrays match?

In [16]:
den_in_mem

array([ 7156342.   , 15433073.   ,  2540943.   , ...,    47703.96 ,
          37973.906,    36136.465], dtype=float32)

In [17]:
den_sp_selector.value

array([ 7156342.   , 15433073.   ,  2540943.   , ...,    47703.96 ,
          37973.906,    36136.465], dtype=float32)

In [18]:
den_in_mem.shape

(29205,)

In [19]:
den_sp_selector.shape

(29205,)

In [20]:
np.all(den_in_mem == den_sp_selector.value)

True

yes! we get the same selection!

One thing that the yt native selection object does that the manual dask array method does not do is limit the chunks that are checked. The dataset indexing records the spatial regions covered by each chunk, so that if the large scale chunk does not intersect the selection object, it doesnt bother checking that chunk and saves some computation there. The dask-slicing approach will check each chunk, so it does some extra work there but it should be possible to add some indexing logic to avoid checking chunks. 

A further complication is that some particle types use a "smoothing length" that may be a bit harder to adapt to a slicing syntax.


All that said, the dask slicing method is faster in this case because the pre-allocation is much faster for `all_data` (because it just reads an attribute from the hdf file). To emphasize this, here are all the above operations collected together for the standard way:

In [22]:
%%timeit
sp = ds.sphere(ds.domain_center,ds.quan(5,'code_length'))
den_sp = sp[("PartType4","Density")]
den_sp_selector = den_sp.compute()

3.18 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


and the dask-slicing method:

In [23]:
%%timeit
ad = ds.all_data()
xyz = ad[("PartType4","Coordinates")]

C = ds.domain_center.value
R = float(ds.quan(5,'code_length').value)
dist = np.sqrt( (xyz[:,0] - C[0])**2 + (xyz[:,1] - C[1])**2 + (xyz[:,2]- C[2])**2 )

den = ad[("PartType4","Density")]  
den_sp_manual = den[dist <= R]
den_in_mem = den_sp_manual.compute()


48.6 ms ± 6.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


that's an impressive speedup... but one caveat is that the slowness for the native approach could be coming from inefficient pickling protocol. The selection object must be pickled and passed to the selection routines for dask to do it's thing, so I suspect that improving how that works could speed up the native approach. 

A final note: by "native" I actually mean "daskified-native", as the branch I'm working on has dask functionality within the particle reader. 

### leveraging the morton index with dask arrays for selections?



In [4]:
import yt
ds = yt.load_sample("snapshot_033")
sp = ds.sphere(ds.domain_center,ds.quan(5,'code_length'))


yt : [INFO     ] 2021-06-08 10:56:27,651 Files located at /home/chris/hdd/data/yt_data/yt_sample_sets/snapshot_033.tar.gz.untar/snapshot_033/snap_033.
yt : [INFO     ] 2021-06-08 10:56:27,652 Default to loading snap_033.0.hdf5 for snapshot_033 dataset
yt : [INFO     ] 2021-06-08 10:56:27,775 Parameters: current_time              = 4.343952725460923e+17 s
yt : [INFO     ] 2021-06-08 10:56:27,776 Parameters: domain_dimensions         = [1 1 1]
yt : [INFO     ] 2021-06-08 10:56:27,777 Parameters: domain_left_edge          = [0. 0. 0.]
yt : [INFO     ] 2021-06-08 10:56:27,778 Parameters: domain_right_edge         = [25. 25. 25.]
yt : [INFO     ] 2021-06-08 10:56:27,779 Parameters: cosmological_simulation   = 1
yt : [INFO     ] 2021-06-08 10:56:27,779 Parameters: current_redshift          = -4.811891664902035e-05
yt : [INFO     ] 2021-06-08 10:56:27,780 Parameters: omega_lambda              = 0.762
yt : [INFO     ] 2021-06-08 10:56:27,780 Parameters: omega_matter              = 0.238
yt : [

In [5]:
ds.index._identify_base_chunk(sp)

In [5]:
# dfi, file_masks, addfi = self.regions.identify_file_masks(
#                         dobj.selector
#                     )
dfi, file_masks, addfi = ds.index.regions.identify_file_masks(
                        sp.selector
                    )

In [6]:
file_masks

array([<yt.utilities.lib.ewah_bool_wrap.BoolArrayCollection object at 0x7f95960b12f0>,
       <yt.utilities.lib.ewah_bool_wrap.BoolArrayCollection object at 0x7f95960b1330>,
       <yt.utilities.lib.ewah_bool_wrap.BoolArrayCollection object at 0x7f95960ab630>,
       <yt.utilities.lib.ewah_bool_wrap.BoolArrayCollection object at 0x7f9596009c70>,
       <yt.utilities.lib.ewah_bool_wrap.BoolArrayCollection object at 0x7f95960092f0>,
       <yt.utilities.lib.ewah_bool_wrap.BoolArrayCollection object at 0x7f9596009f30>,
       <yt.utilities.lib.ewah_bool_wrap.BoolArrayCollection object at 0x7f9596009fb0>,
       <yt.utilities.lib.ewah_bool_wrap.BoolArrayCollection object at 0x7f9596009eb0>,
       <yt.utilities.lib.ewah_bool_wrap.BoolArrayCollection object at 0x7f95960091b0>],
      dtype=object)

In [7]:
len(file_masks)

9

In [8]:
len(ds.index.data_files)

12

In [9]:
#                 for i in range(nfiles):
#                     domain_id = i + 1
#                     dobj._chunk_info[i] = ParticleContainer(
#                         base_region,
#                         base_selector,
#                         [self.data_files[dfi[i]]],
#                         domain_id=domain_id,
#                     )

full_mask = []
ds.index.data_files[dfi[0]]

<yt.data_objects.static_output.ParticleFile at 0x7f95a0152f50>

In [41]:
dfi

array([ 0,  1,  2,  4,  6,  8,  9, 10, 11], dtype=uint32)

In [20]:
import numpy as np

file_np_mask = np.array([False, ]*len(ds.index.data_files))
file_np_mask[dfi] = True

In [21]:
file_np_mask

array([ True,  True,  True, False,  True, False,  True, False,  True,
        True,  True,  True])

In [22]:
from dask import array as da

In [23]:
test_array = da.random.random((12000), chunks=1000)
test_array

Unnamed: 0,Array,Chunk
Bytes,96.00 kB,8.00 kB
Shape,"(12000,)","(1000,)"
Count,12 Tasks,12 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 96.00 kB 8.00 kB Shape (12000,) (1000,) Count 12 Tasks 12 Chunks Type float64 numpy.ndarray",12000  1,

Unnamed: 0,Array,Chunk
Bytes,96.00 kB,8.00 kB
Shape,"(12000,)","(1000,)"
Count,12 Tasks,12 Chunks
Type,float64,numpy.ndarray


In [42]:
test_array.blocks?

[0;31mType:[0m        property
[0;31mString form:[0m <property object at 0x7f95a5a0bdd0>
[0;31mDocstring:[0m  
Slice an array by blocks

This allows blockwise slicing of a Dask array.  You can perform normal
Numpy-style slicing but now rather than slice elements of the array you
slice along blocks so, for example, ``x.blocks[0, ::2]`` produces a new
dask array with every other block in the first row of blocks.

You can index blocks in any way that could index a numpy array of shape
equal to the number of blocks in each dimension, (available as
array.numblocks).  The dimension of the output array will be the same
as the dimension of this array, even if integer indices are passed.
This does not support slicing with ``np.newaxis`` or multiple lists.

Examples
--------
>>> import dask.array as da
>>> x = da.arange(10, chunks=2)
>>> x.blocks[0].compute()
array([0, 1])
>>> x.blocks[:3].compute()
array([0, 1, 2, 3, 4, 5])
>>> x.blocks[::2].compute()
array([0, 1, 4, 5, 8, 9])
>>> x.block

In [45]:
test_array.numblocks

(12,)

In [43]:
index_limited = test_array.blocks[dfi]
index_limited

Unnamed: 0,Array,Chunk
Bytes,72.00 kB,8.00 kB
Shape,"(9000,)","(1000,)"
Count,21 Tasks,9 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 72.00 kB 8.00 kB Shape (9000,) (1000,) Count 21 Tasks 9 Chunks Type float64 numpy.ndarray",9000  1,

Unnamed: 0,Array,Chunk
Bytes,72.00 kB,8.00 kB
Shape,"(9000,)","(1000,)"
Count,21 Tasks,9 Chunks
Type,float64,numpy.ndarray


those would be the data files hit by the selector.

for each chunk, could get the selector mask

```python
mask = selector.select_points(pos[:, 0], pos[:, 1], pos[:, 2], hsml)
```

maybe with map_blocks?

Unnamed: 0,Array,Chunk
Bytes,72.00 kB,8.00 kB
Shape,"(9000,)","(1000,)"
Count,21 Tasks,9 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 72.00 kB 8.00 kB Shape (9000,) (1000,) Count 21 Tasks 9 Chunks Type float64 numpy.ndarray",9000  1,

Unnamed: 0,Array,Chunk
Bytes,72.00 kB,8.00 kB
Shape,"(9000,)","(1000,)"
Count,21 Tasks,9 Chunks
Type,float64,numpy.ndarray


# applying a selector to a daskified all_data



In [1]:
import yt 

ds = yt.load_sample("snapshot_033")
ad = ds.all_data()

yt : [INFO     ] 2021-06-08 14:19:33,427 Files located at /home/chris/hdd/data/yt_data/yt_sample_sets/snapshot_033.tar.gz.untar/snapshot_033/snap_033.
yt : [INFO     ] 2021-06-08 14:19:33,428 Default to loading snap_033.0.hdf5 for snapshot_033 dataset
yt : [INFO     ] 2021-06-08 14:19:33,539 Parameters: current_time              = 4.343952725460923e+17 s
yt : [INFO     ] 2021-06-08 14:19:33,540 Parameters: domain_dimensions         = [1 1 1]
yt : [INFO     ] 2021-06-08 14:19:33,540 Parameters: domain_left_edge          = [0. 0. 0.]
yt : [INFO     ] 2021-06-08 14:19:33,541 Parameters: domain_right_edge         = [25. 25. 25.]
yt : [INFO     ] 2021-06-08 14:19:33,542 Parameters: cosmological_simulation   = 1
yt : [INFO     ] 2021-06-08 14:19:33,542 Parameters: current_redshift          = -4.811891664902035e-05
yt : [INFO     ] 2021-06-08 14:19:33,542 Parameters: omega_lambda              = 0.762
yt : [INFO     ] 2021-06-08 14:19:33,543 Parameters: omega_matter              = 0.238
yt : [

In [2]:
def get_all_data(ds, field, all_data):
        
    # instantiate the dask arrays that store the field as well as the particle coordinates and hmsl
    # each dask array will have the same number of dask chunks, each dask chunk points to a single yt index.data_files object
    ptype = field[0]    
    coordfield=(ptype, "Coordinates") # needed for final mask
    hmlsfield=(ptype, "smoothing_length") # needed for final mask
    fields = [field, coordfield, hmlsfield]    
    ad.get_data(fields=fields) # will instantiate all the dask arrays
    
    # pull out the references
    data = ad[field]
    pos = ad[coordfield]
    hmsl = ad[hmlsfield]
    
    
    return data, pos, hmsl
    
  

In [4]:
field = ("PartType0","Density")

data, pos, hmsl = get_all_data(ds, field, ad)

({'PartType0': 262144}, {'PartType0': 419}, {'PartType0': 255819}, {'PartType0': 0}, {'PartType0': 251598}, {'PartType0': 0}, {'PartType0': 244445}, {'PartType0': 0}, {'PartType0': 239908}, {'PartType0': 233206}, {'PartType0': 227868}, {'PartType0': 225819})


In [26]:
data

Unnamed: 0,Array,Chunk
Bytes,15.53 MB,2.10 MB
Shape,"(1941226,)","(262144,)"
Count,45 Tasks,9 Chunks
Type,float64,numpy.ndarray
Units,code_mass/code_length**3,code_mass/code_length**3
"Array Chunk Bytes 15.53 MB 2.10 MB Shape (1941226,) (262144,) Count 45 Tasks 9 Chunks Type float64 numpy.ndarray Units code_mass/code_length**3 code_mass/code_length**3",1941226  1,

Unnamed: 0,Array,Chunk
Bytes,15.53 MB,2.10 MB
Shape,"(1941226,)","(262144,)"
Count,45 Tasks,9 Chunks
Type,float64,numpy.ndarray
Units,code_mass/code_length**3,code_mass/code_length**3


In [22]:
sp = ds.sphere(ds.domain_center,ds.quan(1,'code_length'))

dfi, file_masks, addfi = ds.index.regions.identify_file_masks(
                        sp.selector
                    )     
dfi

array([ 0,  2,  4,  6,  8,  9, 11], dtype=uint32)

In [23]:
# ok, but the all_data call returns non-zero chunks only, so not all chunks are there and we get an index error
data.blocks[dfi]

IndexError: Index out of bounds 9

need to adjust dfi to account for removal of chunks with zero particles, let's build a dict to map from the full index to the index when zero-particle data_files are removed from all_data:

In [24]:
offset = 0
mapping = {}
for i in range(len(ds.index.data_files)):
    if ds.index.data_files[i].total_particles[field[0]] > 0:
        mapping[i] = i - offset
    else:
        offset += 1
        
# dfi_new = np.array(dfi_new)        
mapping

{0: 0, 1: 1, 2: 2, 4: 3, 6: 4, 8: 5, 9: 6, 10: 7, 11: 8}

In [29]:
import numpy as np
dfi_new = np.array([mapping[i] for i in dfi if i in mapping]) # if it's not in mapping, it's a zero
dfi_new

array([0, 2, 3, 4, 5, 6, 8])

In [27]:
len(dfi)

7

In [28]:
len(dfi_new)

7

In [30]:
data.blocks[dfi_new]

Unnamed: 0,Array,Chunk
Bytes,13.70 MB,2.10 MB
Shape,"(1712939,)","(262144,)"
Count,52 Tasks,7 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 13.70 MB 2.10 MB Shape (1712939,) (262144,) Count 52 Tasks 7 Chunks Type float64 numpy.ndarray",1712939  1,

Unnamed: 0,Array,Chunk
Bytes,13.70 MB,2.10 MB
Shape,"(1712939,)","(262144,)"
Count,52 Tasks,7 Chunks
Type,float64,numpy.ndarray


## and now let's collect this in a more sensible way

In [41]:
import numpy as np

def sanitize_file_mask(ds, dfi):
    
    # the initial read of data from all_data() already culls data_file objects with no particles, but 
    # ds.index.regions.identify_file_masks only limits based on bitmap hits. So we need to offset the 
    # ds.index.regions.identify_file_mask output or we will try to access elements out of range. 
    offset = 0
    mapping = {}
    for i in range(len(ds.index.data_files)):
        if ds.index.data_files[i].total_particles[field[0]] > 0:
            mapping[i] = i - offset
        else:
            offset += 1
            
    return np.array([mapping[i] for i in dfi if i in mapping]) # if it's not in mapping, it's a zero

    
def get_all_data(ds, field, all_data):
        
    # instantiate the dask arrays that store the field as well as the particle coordinates and hmsl
    # each dask array will have the same number of dask chunks, each dask chunk points to a single yt index.data_files object
    ptype = field[0]    
    coordfield=(ptype, "Coordinates") # needed for final mask
    hmlsfield=(ptype, "smoothing_length") # needed for final mask
    fields = [field, coordfield, hmlsfield]    
    ad.get_data(fields=fields) # will instantiate all the dask arrays
    
    # pull out the references
    data = ad[field]
    pos = ad[coordfield]
    hmsl = ad[hmlsfield]
    
    return data, pos, hmsl

def get_limited_data(ds, field, all_data, selector):
    # returns dask arrays with dask-chunks culled by the bitmap file masks
    
    # get dask arrays where each dask-chunk refers to a yt index.data_Files object 
    data, pos, hmsl = get_all_data(ds, field, all_data)   
    
    # get our list of files that are hit by the selector
    dfi, file_masks, addfi = ds.index.regions.identify_file_masks(
                        selector.selector
                    )    
    dfi = sanitize_file_mask(ds, dfi) # accounts for the fact that all_data already removes some data_files
    
    # cull the dask chunks to remove data_files not hit by selector
    data = data.blocks[dfi]
    pos  =  pos.blocks[dfi]
    hmsl = hmsl.blocks[dfi]
                     
    return data, pos, hmsl

In [3]:

import yt 

ds = yt.load_sample("snapshot_033")
ad = ds.all_data()

yt : [INFO     ] 2021-06-08 15:03:07,367 Files located at /home/chris/hdd/data/yt_data/yt_sample_sets/snapshot_033.tar.gz.untar/snapshot_033/snap_033.
yt : [INFO     ] 2021-06-08 15:03:07,368 Default to loading snap_033.0.hdf5 for snapshot_033 dataset
yt : [INFO     ] 2021-06-08 15:03:07,489 Parameters: current_time              = 4.343952725460923e+17 s
yt : [INFO     ] 2021-06-08 15:03:07,490 Parameters: domain_dimensions         = [1 1 1]
yt : [INFO     ] 2021-06-08 15:03:07,490 Parameters: domain_left_edge          = [0. 0. 0.]
yt : [INFO     ] 2021-06-08 15:03:07,491 Parameters: domain_right_edge         = [25. 25. 25.]
yt : [INFO     ] 2021-06-08 15:03:07,492 Parameters: cosmological_simulation   = 1
yt : [INFO     ] 2021-06-08 15:03:07,492 Parameters: current_redshift          = -4.811891664902035e-05
yt : [INFO     ] 2021-06-08 15:03:07,493 Parameters: omega_lambda              = 0.762
yt : [INFO     ] 2021-06-08 15:03:07,493 Parameters: omega_matter              = 0.238
yt : [

In [4]:
field = ("PartType0","Density")
sp = ds.sphere(ds.domain_center,ds.quan(1,'code_length'))
data, pos, hmsl = get_limited_data(ds, field, ad, sp)

In [5]:
data

Unnamed: 0,Array,Chunk
Bytes,13.70 MB,2.10 MB
Shape,"(1712939,)","(262144,)"
Count,52 Tasks,7 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 13.70 MB 2.10 MB Shape (1712939,) (262144,) Count 52 Tasks 7 Chunks Type float64 numpy.ndarray",1712939  1,

Unnamed: 0,Array,Chunk
Bytes,13.70 MB,2.10 MB
Shape,"(1712939,)","(262144,)"
Count,52 Tasks,7 Chunks
Type,float64,numpy.ndarray


In [6]:
pos

Unnamed: 0,Array,Chunk
Bytes,41.11 MB,6.29 MB
Shape,"(1712939, 3)","(262144, 3)"
Count,52 Tasks,7 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 41.11 MB 6.29 MB Shape (1712939, 3) (262144, 3) Count 52 Tasks 7 Chunks Type float64 numpy.ndarray",3  1712939,

Unnamed: 0,Array,Chunk
Bytes,41.11 MB,6.29 MB
Shape,"(1712939, 3)","(262144, 3)"
Count,52 Tasks,7 Chunks
Type,float64,numpy.ndarray


In [7]:
pos[:, 1]

Unnamed: 0,Array,Chunk
Bytes,13.70 MB,2.10 MB
Shape,"(1712939,)","(262144,)"
Count,59 Tasks,7 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 13.70 MB 2.10 MB Shape (1712939,) (262144,) Count 59 Tasks 7 Chunks Type float64 numpy.ndarray",1712939  1,

Unnamed: 0,Array,Chunk
Bytes,13.70 MB,2.10 MB
Shape,"(1712939,)","(262144,)"
Count,59 Tasks,7 Chunks
Type,float64,numpy.ndarray


In [8]:
hmsl

Unnamed: 0,Array,Chunk
Bytes,13.70 MB,2.10 MB
Shape,"(1712939,)","(262144,)"
Count,61 Tasks,7 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 13.70 MB 2.10 MB Shape (1712939,) (262144,) Count 61 Tasks 7 Chunks Type float64 numpy.ndarray",1712939  1,

Unnamed: 0,Array,Chunk
Bytes,13.70 MB,2.10 MB
Shape,"(1712939,)","(262144,)"
Count,61 Tasks,7 Chunks
Type,float64,numpy.ndarray


## applying a selector object to mask the data 

so those dask chunks are the data_files hit by the selector AND with non-zero particle counts for this ptype. now we need to apply the selector to get a mask for each chunk using `sp.selector.select_points`.

We can do this using dask.array.map_blocks!!! But remember the first argument to the function we want to call is the array itself. Normally, we'd call `sp.selector.select_points(pos_x, pos_y, pos_z, hmsl)`. So to use this `map_blocks`, we can pull out the x slice dask array and use it to call `map_blocks`:

In [22]:
pos_x = pos[:,0]
pos_y = pos[:,1]
pos_z = pos[:,2]

mask_by_chunk = pos_x.map_blocks(sp.selector.select_points, pos_y, pos_z, hmsl, meta=np.array((), dtype=bool))
mask_by_chunk

Unnamed: 0,Array,Chunk
Bytes,1.71 MB,262.14 kB
Shape,"(1712939,)","(262144,)"
Count,77 Tasks,7 Chunks
Type,bool,numpy.ndarray
"Array Chunk Bytes 1.71 MB 262.14 kB Shape (1712939,) (262144,) Count 77 Tasks 7 Chunks Type bool numpy.ndarray",1712939  1,

Unnamed: 0,Array,Chunk
Bytes,1.71 MB,262.14 kB
Shape,"(1712939,)","(262144,)"
Count,77 Tasks,7 Chunks
Type,bool,numpy.ndarray


which will call `sp.selector.select_points(pos_x, pos_y, pos_z, hmsl)` for each chunk. Now let's apply our mask and finally compute it!

In [30]:
masked_data = data[mask_by_chunk]
masked_data

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,189 Tasks,7 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes unknown unknown Shape (nan,) (nan,) Count 189 Tasks 7 Chunks Type float64 numpy.ndarray",,

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,189 Tasks,7 Chunks
Type,float64,numpy.ndarray


In [31]:
final_data = masked_data.compute()

In [32]:
final_data.shape

(3221,)

In [28]:
final_data

array([30.862764 , 35.818253 , 29.61068  , ...,  3.996618 ,  3.162214 ,
        6.6422577], dtype=float32)

In [36]:
normal_path = sp[field].compute().value # this is a unyt-dask array, just get a nd array. 
normal_path

array([30.862764 , 35.818253 , 29.61068  , ...,  3.996618 ,  3.162214 ,
        6.6422577], dtype=float32)

In [37]:
np.all(normal_path==final_data)

True

## so what?

OK, so this is neat and all, but what of it? One of the most obvious implications of this to me at least is that it would mean that yt's frontend `io` class could be refactored so that one only has to specify how to read a single `data_file`. Right now, the `_read_particle_selection` applies selections directly on read. Many of the frontends end up converting hte chunk iterator to a list and then looping. But the above workflow would allow someone to write a front end without having to understand how the selection works. In the case of the gadget hdf io, the current `io._read_particle_selection`:

In [39]:
def _read_particle_fields(self, chunks, ptf, selector):
        # Now we have all the sizes, and we can allocate
        data_files = set()
        for chunk in chunks:
            for obj in chunk.objs:
                data_files.update(obj.data_files)
        for data_file in sorted(data_files, key=lambda x: (x.filename, x.start)):
            si, ei = data_file.start, data_file.end
            f = h5py.File(data_file.filename, mode="r")
            for ptype, field_list in sorted(ptf.items()):
                if data_file.total_particles[ptype] == 0:
                    continue
                g = f[f"/{ptype}"]
                if getattr(selector, "is_all_data", False):
                    mask = slice(None, None, None)
                    mask_sum = data_file.total_particles[ptype]
                    hsmls = None
                else:
                    coords = g["Coordinates"][si:ei].astype("float64")
                    if ptype == "PartType0":
                        hsmls = self._get_smoothing_length(
                            data_file, g["Coordinates"].dtype, g["Coordinates"].shape
                        ).astype("float64")
                    else:
                        hsmls = 0.0
                    mask = selector.select_points(
                        coords[:, 0], coords[:, 1], coords[:, 2], hsmls
                    )
                    if mask is not None:
                        mask_sum = mask.sum()
                    del coords
                if mask is None:
                    continue
                for field in field_list:

                    if field in ("Mass", "Masses") and ptype not in self.var_mass:
                        data = np.empty(mask_sum, dtype="float64")
                        ind = self._known_ptypes.index(ptype)
                        data[:] = self.ds["Massarr"][ind]
                    elif field in self._element_names:
                        rfield = "ElementAbundance/" + field
                        data = g[rfield][si:ei][mask, ...]
                    elif field.startswith("Metallicity_"):
                        col = int(field.rsplit("_", 1)[-1])
                        data = g["Metallicity"][si:ei, col][mask]
                    elif field.startswith("GFM_Metals_"):
                        col = int(field.rsplit("_", 1)[-1])
                        data = g["GFM_Metals"][si:ei, col][mask]
                    elif field.startswith("Chemistry_"):
                        col = int(field.rsplit("_", 1)[-1])
                        data = g["ChemistryAbundances"][si:ei, col][mask]
                    elif field == "smoothing_length":
                        # This is for frontends which do not store
                        # the smoothing length on-disk, so we do not
                        # attempt to read them, but instead assume
                        # that they are calculated in _get_smoothing_length.
                        if hsmls is None:
                            hsmls = self._get_smoothing_length(
                                data_file,
                                g["Coordinates"].dtype,
                                g["Coordinates"].shape,
                            ).astype("float64")
                        data = hsmls[mask]
                    else:
                        data = g[field][si:ei][mask, ...]

                    yield (ptype, field), data
            f.close()

could be replaced with a function that only defines how to read all the data from a single data_file

In [40]:
def _read_particle_fields(self, data_file, ptf):
    
    si, ei = data_file.start, data_file.end
    f = h5py.File(data_file.filename, mode="r")
    
    for ptype, field_list in sorted(ptf.items()):
        if data_file.total_particles[ptype] == 0:
            continue
            
        g = f[f"/{ptype}"]
        
        for field in field_list:

            if field in ("Mass", "Masses") and ptype not in self.var_mass:
                data = np.empty(mask_sum, dtype="float64")
                ind = self._known_ptypes.index(ptype)
                data[:] = self.ds["Massarr"][ind]
            elif field in self._element_names:
                rfield = "ElementAbundance/" + field
                data = g[rfield][si:ei][mask, ...]
            elif field.startswith("Metallicity_"):
                col = int(field.rsplit("_", 1)[-1])
                data = g["Metallicity"][si:ei, col][mask]
            elif field.startswith("GFM_Metals_"):
                col = int(field.rsplit("_", 1)[-1])
                data = g["GFM_Metals"][si:ei, col][mask]
            elif field.startswith("Chemistry_"):
                col = int(field.rsplit("_", 1)[-1])
                data = g["ChemistryAbundances"][si:ei, col][mask]
            elif field == "smoothing_length":
                # This is for frontends which do not store
                # the smoothing length on-disk, so we do not
                # attempt to read them, but instead assume
                # that they are calculated in _get_smoothing_length.
                if hsmls is None:
                    hsmls = self._get_smoothing_length(
                        data_file,
                        g["Coordinates"].dtype,
                        g["Coordinates"].shape,
                    ).astype("float64")
                data = hsmls[mask]
            else:
                data = g[field][si:ei][mask, ...]

            yield (ptype, field), data
    f.close()

IMO, this makes writing a new frontend easier for someone less familiar with yt and better separates the `index` from the `io` classes into their intended purposes. 