-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: reorganizing, simplifying + io module #225
Conversation
this includes a class for reading binary files (of a certain format), and a function to convert that file to a dask array
this shows a potential API that seems to work well --> need to use dask.delayed to delay the column functions
I think this should be sufficient to handle more complex cases -- just need a byte offset for each dtype column
… they are there already if you overwrite a module, python 2 sets all of its globals() to None :(
merge again? |
okay I think I am getting close to merging this. The remaining things I want to add first:
h5py and FITS don't read data until sliced, but the default pandas HDF needs to read all of the data into memory, which is a bit of an annoyance. Turns out you can save pandas HDF in "table" format, which is readable and by h5py and sliceable, but the default is "fixed" which forces the whole DataFrame to be read into memory. Maybe we just don't support "fixed" pandas HDF (or throw a warning) . Dask does not support "fixed" pandas HDF format (since you can't chunk it) |
If we don't support fixed pandas HDF, does it mean we can read it as a regular HDF5? Column as dataset (DataGroup.keys()) and Column as column in a dataset (Dataset.dtype.names). It is not very clear to me how h5py resolves dataset paths. It probably can resolve the first case, but cannot deal with the second case. The second case, column as dtype names doesn't allow one to efficiently slice by columns, so maybe the right thing to do is to simply drop the support on those. Also, if we switch to bigfile for the group catalog and subsample catalogues (which already works to some extent?), then there is less motivation to support the second type of h5py. If we had required all attributes saved in a special dictionary object, then we can easily pickle them. |
I think I would vote for supporting type 1 and type 2 file types for HDF/FITS and then not supporting the "fixed" pandas HDF files. One issue I'm seeing for the picking is that the CSVFile is actually built on a dask.dataframe attribute, which allows for much faster parsing of the different partitions. So we'll have to see how to handle that. |
When we are closer to the hardware (local file system), mmap saves a few syscalls and copying (http://stackoverflow.com/questions/9817233/why-mmap-is-faster-than-sequential-io) If we are calling a library then the complexity is already hidden behind the library api, so we don't care much. As for the mmap magic in fits, Rollin did some benchmark before and showed it actually still have to read the full file anyways (due to the lack of a jump table for starting point of blocks). mmap may have helped for particular use cases (why otherwise would some implement it), but definitely not as useful as it sounds.
Let's not support Pandas HDF5, but support mixed Type 1 and Type 2 HDF5? -- Basically, the column name passed into read method can be a full path to a column.
|
How concerned should we be in supporting arbitrary types that the user might want to read in? The user can always write their own plugins, but I think there is something to be said for having some support built-in for the use case of small data in FITS format in row-major format, where it doesn't hurt to just search through the file every time....It does seem nice if we could somehow detect column-major vs row-major and then warn the user in the latter case or something... |
and thanks for the mmap knowledge 👍...really interesting |
We can obtain a list of column paths from the transformation object. Then request those columns from the file object. For each column path it is relatively easy to obtain the correct dtype. But I thought we would make each column a seperate dask array anyways? |
I was imagining passing the file itself directly into dask, since it has a well-defined I think the passing the file directly to dask, and then requesting columns from dask is the potentially more elegant solution. Also, at this point, requesting a specific string column from a FileType returns a "view" of the initial file with the single column, so it returns a FileType as well (see the logic here and here). So then you can mimic the behavior of numpy array, i.e., # original file has three named fields
>> ff.dtype
dtype([('ra', '<f4'), ('dec', '<f4'), ('z', '<f4')])
>> ff.shape
(1000,)
>> ff.columns
['ra', 'dec', 'z']
>> ff[:3]
array([(235.63442993164062, 59.39099884033203, 0.6225500106811523),
(140.36181640625, -1.162310004234314, 0.5026500225067139),
(129.96627807617188, 45.970130920410156, 0.4990200102329254)],
dtype=(numpy.record, [('ra', '<f4'), ('dec', '<f4'), ('z', '<f4')]))
# string indexing returns a new file instance that points to original
>>ra = ff['ra']
>> ra
<CSVFile of shape (1000,)>
# slicing returns the underlying array
>>ra[:3]
array([ 235.63442993, 140.36181641, 129.96627808], dtype=float32)
# return a structured array when indexed with a list
>>ff[['ra']][:3]
array([(235.63442993164062,), (140.36181640625,), (129.96627807617188,)],
dtype=(numpy.record, [('ra', '<f4')])) |
I guess this is supposed to work. Can you have a 'path' field name in dask? |
Do you want me to merge #233 first or you merge this PR first? |
you can go ahead and merge #233 |
Sure. I have a few other fixes in mind too.. What about merging this PR soon and file the new DataSource implementations under a new PR? |
Okay, I'll merge this now and we can go from there |
The goals here are:
I will leave Transformations to a second PR