Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: reorganizing, simplifying + io module #225

Merged
merged 80 commits into from
Aug 18, 2016
Merged

Conversation

nickhand
Copy link
Member

The goals here are:

  • reorganize the code in a more logical way
  • simplify and clean up outdated and potentially unused code
  • make the plugin/extension points framework more robust, i.e, handling subclasses better
  • add FileTypes and an IO module that interfaces nicely with dask

I will leave Transformations to a second PR

this includes a class for reading binary files (of a certain format),
and a function to convert that file to a dask array
this shows a potential API that seems to work well --> need to use
dask.delayed to delay the column functions
I think this should be sufficient to handle more complex cases -- just need a byte offset for each dtype column
… they are there already

if you overwrite a module, python 2 sets all of its globals() to None :(
@rainwoodman
Copy link
Member

merge again?

@nickhand
Copy link
Member Author

okay I think I am getting close to merging this. The remaining things I want to add first:

  1. HDF file: h5py and pandas support
  2. FITS file
  3. pickling support? some of the files do initialize some other attributes beyond the required API so this could be tricky

h5py and FITS don't read data until sliced, but the default pandas HDF needs to read all of the data into memory, which is a bit of an annoyance.

Turns out you can save pandas HDF in "table" format, which is readable and by h5py and sliceable, but the default is "fixed" which forces the whole DataFrame to be read into memory. Maybe we just don't support "fixed" pandas HDF (or throw a warning) . Dask does not support "fixed" pandas HDF format (since you can't chunk it)

@rainwoodman
Copy link
Member

If we don't support fixed pandas HDF, does it mean we can read it as a regular HDF5?
HDF5 files are hairy because we can have two types of columns:

Column as dataset (DataGroup.keys())
e.g. nersc: ~yfeng1/m779/yfeng1/imaginglss/dr2lss/FDRPP_ELG.hdf5
bigfile falls into this catagory.

and Column as column in a dataset (Dataset.dtype.names).
e.g. current nbkit group/subsample catalogs. Most fits files fall into this catagory.

It is not very clear to me how h5py resolves dataset paths. It probably can resolve the first case, but cannot deal with the second case.

The second case, column as dtype names doesn't allow one to efficiently slice by columns, so maybe the right thing to do is to simply drop the support on those. Also, if we switch to bigfile for the group catalog and subsample catalogues (which already works to some extent?), then there is less motivation to support the second type of h5py.

If we had required all attributes saved in a special dictionary object, then we can easily pickle them.

@nickhand
Copy link
Member Author

  • yes, I think we can use h5py to load pandas HDF files that are saved in the "table" format, assuming the pandas API doesn't drastically change
  • the nice thing about h5py and FITS files is that the data is memory-mapped by default, so only the data needed for each "read" call is loaded at a single time. I think this is true of most file types we would want to implement, but it is not true of the "fixed" format of pandas HDF.
  • h5py datasets can apparently be "chunked" when saving to disk, which allows for faster evaluation for the "type 2" case
  • I have the "all columns as a single dataset" (type 2) coded up just by giving the file path and the dataset key, i.e., /path/to/group/dataset1. I think it wouldn't be much harder to handle type 1 by having the user pass the group key instead /path/to/group and then loading all valid datasets (or some specified subset if the user specifies so)

I think I would vote for supporting type 1 and type 2 file types for HDF/FITS and then not supporting the "fixed" pandas HDF files.

One issue I'm seeing for the picking is that the CSVFile is actually built on a dask.dataframe attribute, which allows for much faster parsing of the different partitions. So we'll have to see how to handle that.

@rainwoodman
Copy link
Member

  • A few opinions on mmap: It is at most unclear if there is any performance gain using mmap over a parallel file system like lustre. For one, lustre used to have a bug that reading from mmap is broke into very small pages and much slower than you would have thought. (https://www.google.com/search?q=lustre+mmap+slow&ie=utf-8&oe=utf-8)

When we are closer to the hardware (local file system), mmap saves a few syscalls and copying (http://stackoverflow.com/questions/9817233/why-mmap-is-faster-than-sequential-io)
But in my opinion, the biggest gain of mmap is it allows the user application to reuse the OS vm manager to handle the overlapping/swapping issue of accessing a larger than memory file. Any performance gain can be neglected if real processing of the data is used -- unless one is just doing strange things such as benchmarking.

If we are calling a library then the complexity is already hidden behind the library api, so we don't care much. As for the mmap magic in fits, Rollin did some benchmark before and showed it actually still have to read the full file anyways (due to the lack of a jump table for starting point of blocks). mmap may have helped for particular use cases (why otherwise would some implement it), but definitely not as useful as it sounds.

  • chunked : yes. but why not avoid the problem altogether when we have the flexibility of defining the storage of a column?
  • what about type 1 and type 2 mixed up? We may want the transformer to be able to directly query any path like /DataGroup1/DataSet1/DtypeFieldName in a hdf5 file.

Let's not support Pandas HDF5, but support mixed Type 1 and Type 2 HDF5? -- Basically, the column name passed into read method can be a full path to a column.

  • It is a bit strange how CSVFile relies on FileStack but BinaryFile doesn't. Also, it feels like Why is there an init shall call fromfiles, or mayit it shall be from_paths and init?

@nickhand
Copy link
Member Author

  • mixing Type 1 and Type 2 columns in the read function could potentially be difficult. To interface with dask, the dtype and size need to be known once the file is initialized. It might be possible, but would we need to walk the directory structure of the HDF file at initialization? I don't know if that's possible
  • there would need to be requirements on the arrays, same length, etc
  • Maybe it is better to have HDFFile handle a single group or dataset key, and then write a FileUnion class that allows continuous access across a several files (the FileUnion is also how I imagine a binary file similar to Gadget's layout would be achieved)
  • I agree that the CSVFile dependency on FileStack is not great -- been meaning to clean in up once I got the partitioning to work nicely

How concerned should we be in supporting arbitrary types that the user might want to read in? The user can always write their own plugins, but I think there is something to be said for having some support built-in for the use case of small data in FITS format in row-major format, where it doesn't hurt to just search through the file every time....It does seem nice if we could somehow detect column-major vs row-major and then warn the user in the latter case or something...

@nickhand
Copy link
Member Author

and thanks for the mmap knowledge 👍...really interesting

@rainwoodman
Copy link
Member

We can obtain a list of column paths from the transformation object. Then request those columns from the file object. For each column path it is relatively easy to obtain the correct dtype. But I thought we would make each column a seperate dask array anyways?

@nickhand
Copy link
Member Author

I was imagining passing the file itself directly into dask, since it has a well-defined shape and dtype. You could run into problems if the full column doesn't fit into memory (and thus, can't pass it to dask), unless you are careful partitioning?

I think the passing the file directly to dask, and then requesting columns from dask is the potentially more elegant solution.

Also, at this point, requesting a specific string column from a FileType returns a "view" of the initial file with the single column, so it returns a FileType as well (see the logic here and here). So then you can mimic the behavior of numpy array, i.e.,

# original file has three named fields
>> ff.dtype
dtype([('ra', '<f4'), ('dec', '<f4'), ('z', '<f4')])
>> ff.shape
(1000,)
>> ff.columns
['ra', 'dec', 'z']
>> ff[:3]
 array([(235.63442993164062, 59.39099884033203, 0.6225500106811523),
(140.36181640625, -1.162310004234314, 0.5026500225067139),
(129.96627807617188, 45.970130920410156, 0.4990200102329254)],
dtype=(numpy.record, [('ra', '<f4'), ('dec', '<f4'), ('z', '<f4')]))

# string indexing returns a new file instance that points to original
>>ra = ff['ra']
>> ra
<CSVFile of shape (1000,)>
# slicing returns the underlying array
>>ra[:3]
array([ 235.63442993,  140.36181641,  129.96627808], dtype=float32)
# return a structured array when indexed with a list
>>ff[['ra']][:3]
array([(235.63442993164062,), (140.36181640625,), (129.96627807617188,)],
      dtype=(numpy.record, [('ra', '<f4')]))

@rainwoodman
Copy link
Member

I guess this is supposed to work. Can you have a 'path' field name in dask?

@rainwoodman
Copy link
Member

Do you want me to merge #233 first or you merge this PR first?

@nickhand
Copy link
Member Author

you can go ahead and merge #233

@rainwoodman
Copy link
Member

Sure. I have a few other fixes in mind too.. What about merging this PR soon and file the new DataSource implementations under a new PR?

@nickhand
Copy link
Member Author

Okay, I'll merge this now and we can go from there

@nickhand nickhand mentioned this pull request Aug 18, 2016
@nickhand nickhand merged commit 51081f0 into bccp:master Aug 18, 2016
@nickhand nickhand deleted the overhaul branch August 18, 2016 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants