WIP: reorganizing, simplifying + io module #225

nickhand · 2016-08-13T19:45:04Z

The goals here are:

reorganize the code in a more logical way
simplify and clean up outdated and potentially unused code
make the plugin/extension points framework more robust, i.e, handling subclasses better
add FileTypes and an IO module that interfaces nicely with dask

I will leave Transformations to a second PR

this includes a class for reading binary files (of a certain format), and a function to convert that file to a dask array

this shows a potential API that seems to work well --> need to use dask.delayed to delay the column functions

…odule

I think this should be sufficient to handle more complex cases -- just need a byte offset for each dtype column

… they are there already if you overwrite a module, python 2 sets all of its globals() to None :(

…sed as index

rainwoodman · 2016-08-17T19:00:00Z

merge again?

nickhand · 2016-08-17T21:18:42Z

okay I think I am getting close to merging this. The remaining things I want to add first:

HDF file: h5py and pandas support
FITS file
pickling support? some of the files do initialize some other attributes beyond the required API so this could be tricky

h5py and FITS don't read data until sliced, but the default pandas HDF needs to read all of the data into memory, which is a bit of an annoyance.

Turns out you can save pandas HDF in "table" format, which is readable and by h5py and sliceable, but the default is "fixed" which forces the whole DataFrame to be read into memory. Maybe we just don't support "fixed" pandas HDF (or throw a warning) . Dask does not support "fixed" pandas HDF format (since you can't chunk it)

rainwoodman · 2016-08-17T21:47:06Z

If we don't support fixed pandas HDF, does it mean we can read it as a regular HDF5?
HDF5 files are hairy because we can have two types of columns:

Column as dataset (DataGroup.keys())
e.g. nersc: ~yfeng1/m779/yfeng1/imaginglss/dr2lss/FDRPP_ELG.hdf5
bigfile falls into this catagory.

and Column as column in a dataset (Dataset.dtype.names).
e.g. current nbkit group/subsample catalogs. Most fits files fall into this catagory.

It is not very clear to me how h5py resolves dataset paths. It probably can resolve the first case, but cannot deal with the second case.

The second case, column as dtype names doesn't allow one to efficiently slice by columns, so maybe the right thing to do is to simply drop the support on those. Also, if we switch to bigfile for the group catalog and subsample catalogues (which already works to some extent?), then there is less motivation to support the second type of h5py.

If we had required all attributes saved in a special dictionary object, then we can easily pickle them.

nickhand · 2016-08-17T23:10:59Z

yes, I think we can use h5py to load pandas HDF files that are saved in the "table" format, assuming the pandas API doesn't drastically change
the nice thing about h5py and FITS files is that the data is memory-mapped by default, so only the data needed for each "read" call is loaded at a single time. I think this is true of most file types we would want to implement, but it is not true of the "fixed" format of pandas HDF.
h5py datasets can apparently be "chunked" when saving to disk, which allows for faster evaluation for the "type 2" case
I have the "all columns as a single dataset" (type 2) coded up just by giving the file path and the dataset key, i.e., /path/to/group/dataset1. I think it wouldn't be much harder to handle type 1 by having the user pass the group key instead /path/to/group and then loading all valid datasets (or some specified subset if the user specifies so)

I think I would vote for supporting type 1 and type 2 file types for HDF/FITS and then not supporting the "fixed" pandas HDF files.

One issue I'm seeing for the picking is that the CSVFile is actually built on a dask.dataframe attribute, which allows for much faster parsing of the different partitions. So we'll have to see how to handle that.

rainwoodman · 2016-08-17T23:51:25Z

A few opinions on mmap: It is at most unclear if there is any performance gain using mmap over a parallel file system like lustre. For one, lustre used to have a bug that reading from mmap is broke into very small pages and much slower than you would have thought. (https://www.google.com/search?q=lustre+mmap+slow&ie=utf-8&oe=utf-8)

When we are closer to the hardware (local file system), mmap saves a few syscalls and copying (http://stackoverflow.com/questions/9817233/why-mmap-is-faster-than-sequential-io)
But in my opinion, the biggest gain of mmap is it allows the user application to reuse the OS vm manager to handle the overlapping/swapping issue of accessing a larger than memory file. Any performance gain can be neglected if real processing of the data is used -- unless one is just doing strange things such as benchmarking.

If we are calling a library then the complexity is already hidden behind the library api, so we don't care much. As for the mmap magic in fits, Rollin did some benchmark before and showed it actually still have to read the full file anyways (due to the lack of a jump table for starting point of blocks). mmap may have helped for particular use cases (why otherwise would some implement it), but definitely not as useful as it sounds.

chunked : yes. but why not avoid the problem altogether when we have the flexibility of defining the storage of a column?
what about type 1 and type 2 mixed up? We may want the transformer to be able to directly query any path like /DataGroup1/DataSet1/DtypeFieldName in a hdf5 file.

Let's not support Pandas HDF5, but support mixed Type 1 and Type 2 HDF5? -- Basically, the column name passed into read method can be a full path to a column.

It is a bit strange how CSVFile relies on FileStack but BinaryFile doesn't. Also, it feels like Why is there an init shall call fromfiles, or mayit it shall be from_paths and init?

nickhand · 2016-08-18T00:15:56Z

mixing Type 1 and Type 2 columns in the read function could potentially be difficult. To interface with dask, the dtype and size need to be known once the file is initialized. It might be possible, but would we need to walk the directory structure of the HDF file at initialization? I don't know if that's possible
there would need to be requirements on the arrays, same length, etc
Maybe it is better to have HDFFile handle a single group or dataset key, and then write a FileUnion class that allows continuous access across a several files (the FileUnion is also how I imagine a binary file similar to Gadget's layout would be achieved)
I agree that the CSVFile dependency on FileStack is not great -- been meaning to clean in up once I got the partitioning to work nicely

How concerned should we be in supporting arbitrary types that the user might want to read in? The user can always write their own plugins, but I think there is something to be said for having some support built-in for the use case of small data in FITS format in row-major format, where it doesn't hurt to just search through the file every time....It does seem nice if we could somehow detect column-major vs row-major and then warn the user in the latter case or something...

nickhand · 2016-08-18T00:18:09Z

and thanks for the mmap knowledge 👍...really interesting

rainwoodman · 2016-08-18T00:33:39Z

We can obtain a list of column paths from the transformation object. Then request those columns from the file object. For each column path it is relatively easy to obtain the correct dtype. But I thought we would make each column a seperate dask array anyways?

nickhand · 2016-08-18T01:44:27Z

I was imagining passing the file itself directly into dask, since it has a well-defined shape and dtype. You could run into problems if the full column doesn't fit into memory (and thus, can't pass it to dask), unless you are careful partitioning?

I think the passing the file directly to dask, and then requesting columns from dask is the potentially more elegant solution.

Also, at this point, requesting a specific string column from a FileType returns a "view" of the initial file with the single column, so it returns a FileType as well (see the logic here and here). So then you can mimic the behavior of numpy array, i.e.,

# original file has three named fields
>> ff.dtype
dtype([('ra', '<f4'), ('dec', '<f4'), ('z', '<f4')])
>> ff.shape
(1000,)
>> ff.columns
['ra', 'dec', 'z']
>> ff[:3]
 array([(235.63442993164062, 59.39099884033203, 0.6225500106811523),
(140.36181640625, -1.162310004234314, 0.5026500225067139),
(129.96627807617188, 45.970130920410156, 0.4990200102329254)],
dtype=(numpy.record, [('ra', '<f4'), ('dec', '<f4'), ('z', '<f4')]))

# string indexing returns a new file instance that points to original
>>ra = ff['ra']
>> ra
<CSVFile of shape (1000,)>
# slicing returns the underlying array
>>ra[:3]
array([ 235.63442993,  140.36181641,  129.96627808], dtype=float32)
# return a structured array when indexed with a list
>>ff[['ra']][:3]
array([(235.63442993164062,), (140.36181640625,), (129.96627807617188,)],
      dtype=(numpy.record, [('ra', '<f4')]))

rainwoodman · 2016-08-18T05:04:44Z

I guess this is supposed to work. Can you have a 'path' field name in dask?

rainwoodman · 2016-08-18T07:00:39Z

Do you want me to merge #233 first or you merge this PR first?

nickhand · 2016-08-18T07:03:58Z

you can go ahead and merge #233

rainwoodman · 2016-08-18T07:09:42Z

Sure. I have a few other fixes in mind too.. What about merging this PR soon and file the new DataSource implementations under a new PR?

nickhand · 2016-08-18T07:18:32Z

Okay, I'll merge this now and we can go from there

nickhand added 30 commits August 9, 2016 14:02

adds a potential file format syntax, using binary file format example

6b0665c

this includes a class for reading binary files (of a certain format), and a function to convert that file to a dask array

slice data type properly

f8078bd

fix chunksize setting

bb87e23

testing out an iterator

9817aae

return chunks of dask partitions

7ade541

clean up a few bugs

86028ec

fixing a memory error

8437967

bug fix

ce2484a

testing out the user side of reading from DataSource with TPMSnapshot

1709c90

this shows a potential API that seems to work well --> need to use dask.delayed to delay the column functions

use the itemsize of returned columns to determine auto chunk size

c40b4ff

partition the chunks on each rank

39a07f4

typo fix

0bb090c

testing diff partitioning scheme

f6270de

divide the LOCAL partition into chunks on each rank

98de16e

missing numpy import

845913c

implementing __getitem__ for BinaryFile

82bf2ff

csv file, similar to binary file --> sliceable, built on pandas.read_csv

fa933ca

using dask partitioning for csv reading

3d1fa27

underlying dask dataframe does work now

acbce5b

some reorganization

b266f4b

Merge branch 'newpm' into overhaul

b34069f

Merge branch 'newpm' into overhaul

0a219bb

add an explicit (singleton) plugin manager and import it to nbodykt m…

207a6e1

…odule

add a PluginBase abstract class for plugins to subclass

9f17ee9

more Transfer extension point to __init__ of "transfer" module

f8df631

more Painter base class to __init__ of "painter" module

066c630

move Algorithm base class and update imports

71b2f0c

most of code in extensionpoints.py now removed

405a5b4

move DataSource plugin base class and update imports

06c0aef

add function which returns the core extension points as a dict

71de6af

nickhand added 4 commits August 16, 2016 16:58

can now create BinaryFile with an dict of offsets

faaf3fb

I think this should be sufficient to handle more complex cases -- just need a byte offset for each dtype column

strange python 2 bug fix -- don't overwrite modules in sys.modules if…

b78cf8c

… they are there already if you overwrite a module, python 2 sets all of its globals() to None :(

back to using relative imports

7e780a6

set the mode when creating NamedTemporaryFiles when testing io

460ef78

nickhand mentioned this pull request Aug 17, 2016

PaintGrid test case in for k-space input failing on python 2 (but not 3??) #229

Closed

nickhand added 4 commits August 17, 2016 00:03

cosmetic changes

71ed6c3

mimic numpy structured array behavior when list of single string is u…

18da46b

…sed as index

add a repr function

48cf449

Merge branch 'master' into overhaul

8e65d7c

Merge branch 'master' into overhaul

dae863a

nickhand added 2 commits August 18, 2016 00:14

a more verbose repr

012cae7

Merge branch 'master' into overhaul

89c2fbd

nickhand mentioned this pull request Aug 18, 2016

remove auto assign? #226

Closed

nickhand merged commit 51081f0 into bccp:master Aug 18, 2016

nickhand deleted the overhaul branch August 18, 2016 07:26

nickhand mentioned this pull request Aug 18, 2016

Integrate Datasource with Transformation and FileType #234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: reorganizing, simplifying + io module #225

WIP: reorganizing, simplifying + io module #225

nickhand commented Aug 13, 2016

rainwoodman commented Aug 17, 2016

nickhand commented Aug 17, 2016

rainwoodman commented Aug 17, 2016

nickhand commented Aug 17, 2016

rainwoodman commented Aug 17, 2016

nickhand commented Aug 18, 2016

nickhand commented Aug 18, 2016

rainwoodman commented Aug 18, 2016

nickhand commented Aug 18, 2016

rainwoodman commented Aug 18, 2016

rainwoodman commented Aug 18, 2016

nickhand commented Aug 18, 2016

rainwoodman commented Aug 18, 2016

nickhand commented Aug 18, 2016

WIP: reorganizing, simplifying + io module #225

WIP: reorganizing, simplifying + io module #225

Conversation

nickhand commented Aug 13, 2016

rainwoodman commented Aug 17, 2016

nickhand commented Aug 17, 2016

rainwoodman commented Aug 17, 2016

nickhand commented Aug 17, 2016

rainwoodman commented Aug 17, 2016

nickhand commented Aug 18, 2016

nickhand commented Aug 18, 2016

rainwoodman commented Aug 18, 2016

nickhand commented Aug 18, 2016

rainwoodman commented Aug 18, 2016

rainwoodman commented Aug 18, 2016

nickhand commented Aug 18, 2016

rainwoodman commented Aug 18, 2016

nickhand commented Aug 18, 2016