This notebook demonstrates a more streamlined dask data loader of particle data that could be implemented directly in `BaseIOHandler._read_particle_selection()`. The summary is that we can create dask data frames from delayed reads of chunks! 

In [1]:
import yt 
from dask import dataframe as ddf, delayed
import pandas as pd
import numpy as np

In [61]:
ds = yt.load_sample("snapshot_033") 

yt : [INFO     ] 2020-10-20 16:20:20,642 Files located at /home/chavlin/hdd/data/yt_data/yt_sample_sets/snapshot_033.tar.gz.untar/snapshot_033/snap_033.
yt : [INFO     ] 2020-10-20 16:20:20,643 Default to loading snap_033.0.hdf5 for snapshot_033 dataset
yt : [INFO     ] 2020-10-20 16:20:20,699 Parameters: current_time              = 4.343952725460923e+17 s
yt : [INFO     ] 2020-10-20 16:20:20,700 Parameters: domain_dimensions         = [1 1 1]
yt : [INFO     ] 2020-10-20 16:20:20,700 Parameters: domain_left_edge          = [0. 0. 0.]
yt : [INFO     ] 2020-10-20 16:20:20,700 Parameters: domain_right_edge         = [25. 25. 25.]
yt : [INFO     ] 2020-10-20 16:20:20,701 Parameters: cosmological_simulation   = 1
yt : [INFO     ] 2020-10-20 16:20:20,701 Parameters: current_redshift          = -4.811891664902035e-05
yt : [INFO     ] 2020-10-20 16:20:20,701 Parameters: omega_lambda              = 0.762
yt : [INFO     ] 2020-10-20 16:20:20,701 Parameters: omega_matter              = 0.238
yt :

a quick mock chunk assembler

In [62]:
class MockChunkObject:
    def __init__(self, data_file):
        self.data_files = [data_file]
        
class MockChunk:
    def __init__(self, data_file):
        self.objs = [MockChunkObject(data_file)]
        
chunks = [MockChunk(data_file) for data_file in ds.index.data_files]

yt : [INFO     ] 2020-10-20 16:20:22,079 Allocating for 4.194e+06 particles
Loading particle index: 100%|██████████| 12/12 [00:00<00:00, 213.98it/s]


and now let's set choose some particle types and fields: 

In [63]:
ptf = {'PartType0':['Density','Nitrogen'],
       'PartType4':['Oxygen','MetallicityWeightedRedshift']}

and a selection:

In [86]:
sp = ds.sphere('max',5.) 
selector = sp.selector

Ok, so we're going to use `dask.dataframe.from_delayed` to create a dask dataframe. Given that fields of different particle types within a single chunk may not have the same length, we will read in by particle type. Additionally, since some chunks may not have particles at all, we need to pass `from_delayed` a `meta` schema declaring the types of the field, otherwise aggregation operations across chunks may fail. So let's create a `meta` dictionary for each particle type:

In [87]:
# build the meta dictionary for each chunk's dataframe so that empty chunks 
# don't cause problems. 
ptype_meta = {}
for ptype,flds in ptf.items():
    meta_dict = {}
    for fld in flds: 
        meta_dict[fld] = pd.Series([],dtype=np.float64)
    ptype_meta[ptype] = meta_dict

Now we need to create a function to delay -- this function needs to return a dataframe for a single chunk and particle type with columns for each field: 

In [88]:
def _read_single_ptype(chunk, this_ptf, selector, meta_dict):
        # read into a pandas dataframe so that we can use dask.dataframe.from_delayed!        
        chunk_results = pd.DataFrame(meta_dict)        
        for field_r, vals in ds.index.io._read_particle_fields([chunk], this_ptf, selector):
            chunk_results[field_r[1]] = vals

        return chunk_results

Now we're ready to create our delayed dataframes! We'll store each particle type in a dict:

In [89]:
ptypes = list(ptf.keys()) 
delayed_dfs = {}
for ptype in ptypes: 
    # build a dataframe from delayed for each particle type
    this_ptf = {ptype:ptf[ptype]}
    delayed_chunks = [
        delayed(_read_single_ptype)(ch, this_ptf, selector, ptype_meta[ptype])
        for ch in chunks
    ]
    delayed_dfs[ptype] = ddf.from_delayed(delayed_chunks,meta=ptype_meta[ptype])
    


And now we've got a dict of delayed dask dataframes:

In [90]:
delayed_dfs

{'PartType0': Dask DataFrame Structure:
                 Density Nitrogen
 npartitions=12                  
                 float64  float64
                     ...      ...
 ...                 ...      ...
                     ...      ...
                     ...      ...
 Dask Name: from-delayed, 24 tasks,
 'PartType4': Dask DataFrame Structure:
                  Oxygen MetallicityWeightedRedshift
 npartitions=12                                     
                 float64                     float64
                     ...                         ...
 ...                 ...                         ...
                     ...                         ...
                     ...                         ...
 Dask Name: from-delayed, 24 tasks}

we can pull out a single delayed frame:

In [91]:
df4 = delayed_dfs['PartType4']
df4

Unnamed: 0_level_0,Oxygen,MetallicityWeightedRedshift
npartitions=12,Unnamed: 1_level_1,Unnamed: 2_level_1
,float64,float64
,...,...
...,...,...
,...,...
,...,...


until this point, everything is still delayed. but when we perform operations like `mean()` and `sum()`, dask will handle all the cross-chunk aggregation. For example:

In [92]:
df4.MetallicityWeightedRedshift.mean()

dd.Scalar<series-..., dtype=float64>

is still delayed. So let's compute it!

In [93]:
df4.MetallicityWeightedRedshift.mean().compute()

1.963667205141687

In [94]:
df4.MetallicityWeightedRedshift.values.compute()

array([1.01397729, 1.55316567, 2.71978283, ..., 0.        , 2.81968474,
       1.96937144])

Cool! Won't work in parallel tho, as dask will try to pickle up `ds.index.io._read_particle_fields`. The selector object can be pickled (with the changes on `pickleableSelects`), but `ds.index.io._read_particle_fields` likely needs some pickle fixes... 

The `BaseIOHandler._read_particle_selection` expects a dictionary of arrays on output, which we can read into memory with

In [95]:
rv = {}
for ptype in ptypes:
    for col in delayed_dfs[ptype].columns:
        df = delayed_dfs[ptype]
        rv[(ptype, col)] = df[col].values.compute()

In [96]:
rv

{('PartType0',
  'Density'): array([1.18035096e+08, 7.38237760e+08, 6.64672064e+08, ...,
        8.06066418e+00, 7.77778673e+00, 7.48051739e+00]),
 ('PartType0',
  'Nitrogen'): array([0.00184869, 0.00545696, 0.00762129, ..., 0.        , 0.00028541,
        0.        ]),
 ('PartType4',
  'Oxygen'): array([0.0438178 , 0.01489225, 0.01456669, ..., 0.        , 0.00435656,
        0.00267334]),
 ('PartType4',
  'MetallicityWeightedRedshift'): array([1.01397729, 1.55316567, 2.71978283, ..., 0.        , 2.81968474,
        1.96937144])}