WIP: dask + file formats implementation #222

nickhand · 2016-08-09T21:06:52Z

we'll use this PR to explore using dask and using file formats

this includes a class for reading binary files (of a certain format), and a function to convert that file to a dask array

nickhand · 2016-08-09T21:17:11Z

@rainwoodman -- take a look at this new potential syntax for "file formats" and dask interface

This was mostly to explore dask some more -- there's an example main in the file that I've tested reading RunPB DM snapshots on Cori. Takes just a few seconds.

A few initial thoughts:

I think the dask array is the right choice over data frames:
- The lack of vector data types is really ugly to get around
- The interface to the DataFrame is just not as intuitive as the array syntax -- I've used pandas a lot and interacting with DataFrames still isn't easy as simple arrays
- Using dask arrays is more in line with our current interface, where the "read" functions in different datasources are all operating on structured arrays
I think one existing problem is that the DataSource right now is handling both the reading of the file and the transformation of what is read, which are really two different tasks
I think we want several builtin FileFormat classes like this BinaryFile class and then maybe the DataSource should specify which file format(s) it can read?
It think it should be possible to define a fixed interface for different file types (binary, h4py, csv, bigfile) and functions to compute them to dask arrays

this shows a potential API that seems to work well --> need to use dask.delayed to delay the column functions

rainwoodman · 2016-08-10T18:12:24Z

nbodykit/dask/test_tpm.py

+        s.add_argument("bunchsize", type=int, help="number of particles to read per rank in a bunch")
+
+
+    def Position(self, data):


You sure it is OK to mix the names of columns and methods?

I am not sure, but for the moment, I couldn't think of anything

rainwoodman · 2016-08-10T18:32:48Z

I am leaning towards datasource/FileTypePlugin returns a list/dict of list of dask.arrays / dask.delayed for all columns, normalized to a given set of units (partitions scheduled to this rank). Then we will have a Transformer that does RSD, RA-DEC->XYZ, and other renaming stuff, returns a dict of list of daskarrays / dask.delayed.

A painter will take a transformer's output dictionary, pick the columns according to the painter's documented protocol, compute them, and calls ParticleMesh to paint them.

rainwoodman · 2016-08-10T18:35:43Z

We can call Transformer PointDataSource, and the first object that normalizes files FileTypePlugin?
Does this proposal look sane?

nickhand · 2016-08-10T18:54:05Z

Yes, I am thinking along the same lines...something to read and return data as read from file and then something to transform it.

A few thoughts:

I think we should be able to pass along delayed objects. When reading is expensive, you want to use dask.compute(*delayed_columns) to compute all columns at once, so just keep that in mind
I think the "columns as delayed functions" was the easiest interface with dask...dask array is still limited for our purposes, (you can't mutate a dask array), but the delayed feature works nicely
We should think about how things like ZeldovichSim and UniformBox fit into this. Perhaps we use a FileSource and SimulationSource and the transformation can handle any "Source" of data?

nickhand · 2016-08-11T06:26:10Z

The sliceable file format straight to dask array is very very nice. I've got it working for csv files and binary files

Should be relatively easy to add HDF5, Bigfile, and FITS file formats too and then we have a pretty good representation. HDF5 and Bigfile already can be converted straight to dask arrays

nickhand · 2016-08-11T06:32:46Z

Also I was brainstorming names for the module to holding the file formats. I kind of like nbodykit.hermes -- the idea being something like the module passing data quickly from disk to the algorithms...

rainwoodman · 2016-08-11T17:33:15Z

Does it make sense to have hermes a separate package that nbodykit depends
on?

The name is already taken though.

On Wed, Aug 10, 2016 at 11:32 PM, Nick Hand notifications@github.com
wrote:

Also I was brainstorming names for the module to holding the file formats.
I kind of like nbodykit.hermes -- the idea being something like the
module passing data quickly from disk to the algorithms...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#222 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIbTGppqZEe5rW-YjKdq0AxvsEom9zxks5qesIPgaJpZM4JghZG
.

nickhand · 2016-08-13T19:56:01Z

closing this in favor of #225

adds a potential file format syntax, using binary file format example

6b0665c

this includes a class for reading binary files (of a certain format), and a function to convert that file to a dask array

nickhand added 9 commits August 9, 2016 15:18

slice data type properly

f8078bd

fix chunksize setting

bb87e23

testing out an iterator

9817aae

return chunks of dask partitions

7ade541

clean up a few bugs

86028ec

fixing a memory error

8437967

bug fix

ce2484a

testing out the user side of reading from DataSource with TPMSnapshot

1709c90

this shows a potential API that seems to work well --> need to use dask.delayed to delay the column functions

use the itemsize of returned columns to determine auto chunk size

c40b4ff

rainwoodman reviewed Aug 10, 2016
View reviewed changes

nickhand added 7 commits August 10, 2016 13:36

partition the chunks on each rank

39a07f4

typo fix

0bb090c

testing diff partitioning scheme

f6270de

divide the LOCAL partition into chunks on each rank

98de16e

missing numpy import

845913c

implementing __getitem__ for BinaryFile

82bf2ff

csv file, similar to binary file --> sliceable, built on pandas.read_csv

fa933ca

using dask partitioning for csv reading

3d1fa27

underlying dask dataframe does work now

acbce5b

nickhand closed this Aug 13, 2016

nickhand deleted the dask-backend branch September 7, 2016 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: dask + file formats implementation #222

WIP: dask + file formats implementation #222

nickhand commented Aug 9, 2016

nickhand commented Aug 9, 2016 •

edited

rainwoodman Aug 10, 2016

nickhand Aug 10, 2016

rainwoodman commented Aug 10, 2016

rainwoodman commented Aug 10, 2016

nickhand commented Aug 10, 2016

nickhand commented Aug 11, 2016

nickhand commented Aug 11, 2016

rainwoodman commented Aug 11, 2016

nickhand commented Aug 13, 2016

		s.add_argument("bunchsize", type=int, help="number of particles to read per rank in a bunch")


		def Position(self, data):

WIP: dask + file formats implementation #222

WIP: dask + file formats implementation #222

Conversation

nickhand commented Aug 9, 2016

nickhand commented Aug 9, 2016 • edited

rainwoodman Aug 10, 2016

Choose a reason for hiding this comment

nickhand Aug 10, 2016

Choose a reason for hiding this comment

rainwoodman commented Aug 10, 2016

rainwoodman commented Aug 10, 2016

nickhand commented Aug 10, 2016

nickhand commented Aug 11, 2016

nickhand commented Aug 11, 2016

rainwoodman commented Aug 11, 2016

nickhand commented Aug 13, 2016

nickhand commented Aug 9, 2016 •

edited