Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: dask + file formats implementation #222

Closed
wants to merge 19 commits into from

Conversation

nickhand
Copy link
Member

@nickhand nickhand commented Aug 9, 2016

we'll use this PR to explore using dask and using file formats

this includes a class for reading binary files (of a certain format),
and a function to convert that file to a dask array
@nickhand
Copy link
Member Author

nickhand commented Aug 9, 2016

@rainwoodman -- take a look at this new potential syntax for "file formats" and dask interface

This was mostly to explore dask some more -- there's an example main in the file that I've tested reading RunPB DM snapshots on Cori. Takes just a few seconds.

A few initial thoughts:

  • I think the dask array is the right choice over data frames:
    • The lack of vector data types is really ugly to get around
    • The interface to the DataFrame is just not as intuitive as the array syntax -- I've used pandas a lot and interacting with DataFrames still isn't easy as simple arrays
    • Using dask arrays is more in line with our current interface, where the "read" functions in different datasources are all operating on structured arrays
  • I think one existing problem is that the DataSource right now is handling both the reading of the file and the transformation of what is read, which are really two different tasks
  • I think we want several builtin FileFormat classes like this BinaryFile class and then maybe the DataSource should specify which file format(s) it can read?
  • It think it should be possible to define a fixed interface for different file types (binary, h4py, csv, bigfile) and functions to compute them to dask arrays

s.add_argument("bunchsize", type=int, help="number of particles to read per rank in a bunch")


def Position(self, data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You sure it is OK to mix the names of columns and methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but for the moment, I couldn't think of anything

@rainwoodman
Copy link
Member

I am leaning towards datasource/FileTypePlugin returns a list/dict of list of dask.arrays / dask.delayed for all columns, normalized to a given set of units (partitions scheduled to this rank). Then we will have a Transformer that does RSD, RA-DEC->XYZ, and other renaming stuff, returns a dict of list of daskarrays / dask.delayed.

A painter will take a transformer's output dictionary, pick the columns according to the painter's documented protocol, compute them, and calls ParticleMesh to paint them.

@rainwoodman
Copy link
Member

We can call Transformer PointDataSource, and the first object that normalizes files FileTypePlugin?
Does this proposal look sane?

@nickhand
Copy link
Member Author

Yes, I am thinking along the same lines...something to read and return data as read from file and then something to transform it.

A few thoughts:

  • I think we should be able to pass along delayed objects. When reading is expensive, you want to use dask.compute(*delayed_columns) to compute all columns at once, so just keep that in mind
  • I think the "columns as delayed functions" was the easiest interface with dask...dask array is still limited for our purposes, (you can't mutate a dask array), but the delayed feature works nicely
  • We should think about how things like ZeldovichSim and UniformBox fit into this. Perhaps we use a FileSource and SimulationSource and the transformation can handle any "Source" of data?

@nickhand
Copy link
Member Author

The sliceable file format straight to dask array is very very nice. I've got it working for csv files and binary files

Should be relatively easy to add HDF5, Bigfile, and FITS file formats too and then we have a pretty good representation. HDF5 and Bigfile already can be converted straight to dask arrays

@nickhand
Copy link
Member Author

Also I was brainstorming names for the module to holding the file formats. I kind of like nbodykit.hermes -- the idea being something like the module passing data quickly from disk to the algorithms...

@rainwoodman
Copy link
Member

Does it make sense to have hermes a separate package that nbodykit depends
on?

The name is already taken though.

On Wed, Aug 10, 2016 at 11:32 PM, Nick Hand notifications@github.com
wrote:

Also I was brainstorming names for the module to holding the file formats.
I kind of like nbodykit.hermes -- the idea being something like the
module passing data quickly from disk to the algorithms...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#222 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIbTGppqZEe5rW-YjKdq0AxvsEom9zxks5qesIPgaJpZM4JghZG
.

@nickhand
Copy link
Member Author

closing this in favor of #225

@nickhand nickhand closed this Aug 13, 2016
@nickhand nickhand deleted the dask-backend branch September 7, 2016 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants