-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: dask + file formats implementation #222
Conversation
this includes a class for reading binary files (of a certain format), and a function to convert that file to a dask array
@rainwoodman -- take a look at this new potential syntax for "file formats" and dask interface This was mostly to explore dask some more -- there's an example main in the file that I've tested reading RunPB DM snapshots on Cori. Takes just a few seconds. A few initial thoughts:
|
this shows a potential API that seems to work well --> need to use dask.delayed to delay the column functions
s.add_argument("bunchsize", type=int, help="number of particles to read per rank in a bunch") | ||
|
||
|
||
def Position(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You sure it is OK to mix the names of columns and methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but for the moment, I couldn't think of anything
I am leaning towards datasource/FileTypePlugin returns a list/dict of list of dask.arrays / dask.delayed for all columns, normalized to a given set of units (partitions scheduled to this rank). Then we will have a Transformer that does RSD, RA-DEC->XYZ, and other renaming stuff, returns a dict of list of daskarrays / dask.delayed. A painter will take a transformer's output dictionary, pick the columns according to the painter's documented protocol, compute them, and calls ParticleMesh to paint them. |
We can call Transformer PointDataSource, and the first object that normalizes files FileTypePlugin? |
Yes, I am thinking along the same lines...something to read and return data as read from file and then something to transform it. A few thoughts:
|
The sliceable file format straight to dask array is very very nice. I've got it working for csv files and binary files Should be relatively easy to add HDF5, Bigfile, and FITS file formats too and then we have a pretty good representation. HDF5 and Bigfile already can be converted straight to dask arrays |
Also I was brainstorming names for the module to holding the file formats. I kind of like |
Does it make sense to have hermes a separate package that nbodykit depends The name is already taken though. On Wed, Aug 10, 2016 at 11:32 PM, Nick Hand notifications@github.com
|
closing this in favor of #225 |
we'll use this PR to explore using dask and using file formats