Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide examples #9

Closed
SuperFluffy opened this issue Jan 26, 2016 · 28 comments
Closed

Provide examples #9

SuperFluffy opened this issue Jan 26, 2016 · 28 comments

Comments

@SuperFluffy
Copy link
Contributor

What's the current state of this? It seems like out of all the hdf5 bindings out there, yours seems to be the most actively developed.

Are you planning to provide some example on how to interact with hdf5-rs, or do you think it is not ready for consumption yet?

@aldanor
Copy link
Owner

aldanor commented Jan 26, 2016

I'm planning to work on the datasets reading/writing in February, at which point it will become fully useable. I've put quite a lot of effort to make the bindings thread-safe (which other wrapping libraries don't really care about), this includes locking operations with reentrant mutexes and providing some helper macros.

I guess for now you can look at the tests for each module (file/group) to see how it works.

There's one major stumbling point (or a problem of choice, rather) which prevented me from implementing datasets earlier -- how exactly to deal with reading/writing structured datasets, I don't have a good idea of how the API should look like (given there's no proper runtime struct introspection in Rust).

@aldanor
Copy link
Owner

aldanor commented Jan 26, 2016

I was also thinking of using rust-ndarray as the in-memory backend of choice, now that it's actively developed: https://github.com/bluss/rust-ndarray

@mokasin
Copy link

mokasin commented Mar 8, 2016

@aldanor Maybe I don't understand correctly, what you are getting at. But why do you need runtime introspection? Wouldn't rust-serialize or serde help you, to do the job at compile time?

@aldanor
Copy link
Owner

aldanor commented Mar 9, 2016

@mokasin Because that's how HDF5 works:

  • datasets are homogeneous collections of data
  • you have to create datasets explicitly before pushing any data to them, which requires specifying the memory layout
  • you're unlikely to serialize data item-by-item, the whole point of using HDF5 is to able to read/write large amounts of similarly-typed objects at once
  • you can kind of hack around serde/rustc-serialize to get field names etc for compound types, but you won't get field offsets without further hacks; all in all it's not worth it

As a matter of fact, I've written a little utility package which should help with all this: https://github.com/aldanor/typeinfo

@Enet4
Copy link

Enet4 commented Apr 6, 2016

I was particularly interested in reading datasets from HDF5 files, for which so far there is no high-level API for Rust. Is there any progress on this particular feature?

@aldanor
Copy link
Owner

aldanor commented Apr 6, 2016

@Enet4 As a matter of fact, I'm sort of working on that right now because I need it too :)

There's a few stumbling blocks to which I have no particular solutions yet, for instance how to represent the variable-length types (strings in particular) -- and whether to support them at all.

@aldanor
Copy link
Owner

aldanor commented Apr 6, 2016

It's more of a matter of figuring things out than anything. If anyone in this thread has time/desire to discuss possible ways of handling compound / vlen datatypes in Rust -- this could speed things up 😉

@Enet4
Copy link

Enet4 commented Apr 6, 2016

I'm a beginner on Rust, unfortunately. Still, I am interested in this and would like to keep in touch with any discussion that will hopefully take place, perhaps even managing to contribute with my thoughts.

@aldanor
Copy link
Owner

aldanor commented Apr 6, 2016

The main question is what to do with types that are not Copy or with variable-length datatypes in HDF5 if you want to get some objects with expected Rust semantics in the end. E.g, a variable-length string as a field in a compound datatype.

For element of a variable-length datatype, I think H5Dread will allocate structs like this

struct hvl_t {
    len: size_t,
    p: *mut c_void,
}

However it then has to be somehow translated to Rust structs with proper drop semantics etc.

@aldanor
Copy link
Owner

aldanor commented Apr 6, 2016

Wrapping structs like this in Rust is possible, but if there's a Drop impl, Rust compiler will add an extra field so the struct cannot have the same layout is in C (#[repr(C)]). This will make it incompatible with what HDF5 expects.

As of today, the only way to avoid that is to add #![feature(unsafe_no_drop_flag)] crate attribute which will enable using #[unsafe_no_drop_flag]. This will imply that hdf5-rs would no longer be compatible with stable Rust -- for some time, until the Rust compiler team implements "dynamic drop" semantics which they are planning to do in the foreseeable future (issue 5016 on rust-lang/rust).

I wonder if that'd be an acceptable solution.

@aldanor
Copy link
Owner

aldanor commented Apr 7, 2016

@mokasin
Copy link

mokasin commented Apr 7, 2016

IMHO: If this is the only feasible solution, I'd say go for it. A crate only working on nightly is better than a non working crate. And either this stabilise itself or someone finds a different solution later.

@Enet4
Copy link

Enet4 commented Apr 7, 2016

Wrapping structs like this in Rust is possible, but if there's a Drop impl, Rust compiler will add an extra field so the struct cannot have the same layout is in C (#[repr(C)]). This will make it incompatible with what HDF5 expects.

Couldn't there be an implicit conversion between a type with normal Rust semantics and an internal, "unsafe" implementation whenever that is needed? Or am I just missing the point?

@aldanor
Copy link
Owner

aldanor commented Apr 7, 2016

@Enet4 Would you care to elaborate on what exactly do you mean by "implicit conversion"? Could you provide an example?

Imagine a case where you read the dataset where the datatype is compound, one field is an int, another a variable-length string. H5Dread will write into a buffer that looks like this:

struct A {
    int x;     // normal field
    struct {   // vlen field
        void *p;
        size_t len;
    } y;
};

Now assume you have 1 billion of these (so calling 1 billion constructors is out of question) taking almost all of your RAM (so copying all of the data is out of question), you have a C pointer to this data which you received from HDF5, what would you do next to construct a zero-copy Rust view on this which behaves like a bunch of normal Rust objects and ensures the proper cleanup?

For each of the variable-length elements, HDF5 will malloc() the memory and leave it to the user to free. This implies that the wrapper type for y field in Rust will need to be droppable and should call free() when dropped. However, Drop is not compatible with #[repr(C)] as it essentially adds a third 8-byte element to the end of y and stores the drop flag there indicating destructor status. Therefore, the struct doesn't have the same layout as C anymore. You can hack around this by adjusting the field offsets etc, but what's worse, you cannot safely memset this struct or pass it to the C library as is since the hidden drop byte may be affected.

The only way to have both Drop and #[repr(C)] avoiding the drop flags is the #[unsafe_no_drop_flag] as of today which requires nightly channel.

The user-code would look like this, I think:

h5def! {                  // macro that enables extracting offset/type information at runtime
    pub struct A {
        pub x: u32,
        pub y: VLString,  // variable-length zero-copy string view on a malloc'd buffer
    }
}

@clamydo
Copy link

clamydo commented Apr 7, 2016

Does this also mean, that a user of this crate would have to drop the data himself?

@aldanor
Copy link
Owner

aldanor commented Apr 7, 2016

Does this also mean, that a user of this crate would have to drop the data himself?

No, quite the opposite -- the whole point here is to be able to cast C-allocated buffers (potentially nested in weird ways) into Rust-managed structs that would automatically free C memory in accordance with normal drop semantics.

Hypothetical example:

{
    let data = dataset.read::<A>().unwrap();
} // nested heap-allocated elements owned by `data` dropped here, e.g. var-len strings

The #[unsafe_no_drop_flag] only tells Rust "don't track whether this object has been already dropped, I'll figure it out myself, and don't insert stupid drop flags into my structs".

I'll still have to double-check that rust-ndarray crate is fine with all this (I think it should be), since that's what will be used as an in-memory data storage backend.

@aldanor
Copy link
Owner

aldanor commented Apr 9, 2016

I pondered a bit more on this and prototyped the possible implementation for fixed size strings / var len strings / var len arrays, and I think the compromise is to only support fixed-length strings and fixed-size arrays on stable, and also support variable-length strings/arrays on nightly. This way the library would still work on the stable / beta channels, and once the drop flag business is solved upstream, everything including var-len datatypes would work on stable as well.

@aldanor
Copy link
Owner

aldanor commented Jun 1, 2016

A little update in case anyone's still interested :) Good news, I've been able to make the type system work (the types branch) and to read some structured data into an array backed by rust-ndarray, this includes strings and compound types etc. Will take a little bit of time to shape it into an API but things seem to generally function. There's a few things about strings that I'm not sure about but hopefully it will get sorted out (e.g. you can only read bytestrings into bytestrings and unicode into unicode, hdf5 refuses to do conversions...).

Here's an example:

import h5py
f = h5py.File('foo.h5')
arr = np.core.rec.fromarrays([[1, 2, 3, 4], ['foo', 'bar', 'x', ''], 
    [True, True, False, True]], names='a,b,c')
f['test'] = arr
f.close()

This now works (local dev version, this hasn't been pushed yet):

#[macro_use]
extern crate hdf5_rs;

use hdf5_rs::new_datatype;
use hdf5_rs::Container;
use hdf5_rs::FixedString;

fn main() {
    let f = hdf5_rs::File::open("foo.h5", "r").unwrap();
    let ds = f.dataset("/test").unwrap();

    h5def!(
        #[derive(Debug)]
        struct T {
            a: i64,
            b: FixedString<[u8; 3]>,
            c: bool,
        }
    );

    let arr = ds.read::<T>().unwrap();
    println!("{:?}", arr);
}

which prints (with whitespace formatted)

[T { a: 1, b: "foo", c: true }, 
 T { a: 2, b: "bar", c: true }, 
 T { a: 3, b: "x", c: false }, 
 T { a: 4, b: "", c: true }] shape=[4], strides=[1]

@bromrector
Copy link

^ This looks good. Any plans to push this?

@andysureway-10x
Copy link

@aldanor May I ask what is the read/write functionality status now? I am working on a project heavily involved with hdf5, and really wish to have it coded all in Rust. Thanks!

@Enet4
Copy link

Enet4 commented Oct 27, 2016

A tiny re-bump here. How would one read and write simple datasets of a scalar type at this time? Would I need to be aware of chunks? In my use case, I'm only using chunks in order to make the dataset resizable, and I would prefer using an abstraction over them. As in, treating the dataset as a contiguous, elastic n-dimensional array of data.

@aldanor
Copy link
Owner

aldanor commented Nov 7, 2016

@Enet4 @andysureway-10x sorry for the delay folks, I fell off the face of the earth for a bit with the pycon and other stuff; haven't had time to finish the types/read/write branch as of recent but hope to do it (even if partially) reasonably soon :)

@Enet4 No, generally you won't need to be aware of the chunks for reading, HDF5 takes care of that (unless you're trying to do something very smart). Resizable datasets are generally not a great idea in HDF5 from my experience; plus, they've only just added the functionality to reclaim space in 1.10, in previous versions you end up having to repack.

@Enet4
Copy link

Enet4 commented Nov 8, 2016

Resizable datasets are generally not a great idea in HDF5 from my experience; plus, they've only just added the functionality to reclaim space in 1.10, in previous versions you end up having to repack.

I'm actually just stacking data to a 4-dimensional dataset on a specific axis. I end up relying on a chunked dataset because I do not know in advance how many volumes I'll be stacking nor how large they are, and I may wish to increase it at a later time without creating copies (think GB scale). So I don't need to remove or modify existing data.

Nevertheless, I'm looking forward to this. Keep up the good work. :)

@aldanor
Copy link
Owner

aldanor commented Nov 9, 2016

By the way, in regard to strings, it looks like there will have to be four different string types unfortunately and not just two, something like FixedString(n), VarLenString, FixedAscii(n), VarLenAscii, for unicode/ascii and fixed/varlen (h5py support all except fixed-size utf-8, I believe). HDF5 will not convert between ascii and unicode implicitly; whether it will convert fixed to varlen I don't know yet, will have to check.

So you would basically have to use one of these four as struct fields in order to deal with strings.

As for attributes / dataset names, could probably just use ASCII..

@Enet4
Copy link

Enet4 commented Nov 10, 2016

I suppose that would be fine, as long as those elements and attributes can be trivially converted (even if explicitly) to string slices.

@aldanor
Copy link
Owner

aldanor commented Nov 10, 2016

@Enet4 Yes most traits you'd expect from a string type are already there: https://github.com/aldanor/hdf5-rs/blob/feature/types/src/types/fixed_string.rs#L106

@TeslasGhost
Copy link

@aldanor - Sir, is there any chance of an updated example? The keywords and some datatypes from what I can see, don't seem to be in the documentation. I attempted to use your above example, however "FixedString" or the "h5def!" macro doesn't seem to work... Appreciated! :)

@ghost ghost mentioned this issue Jul 6, 2017
@aldanor
Copy link
Owner

aldanor commented Jan 11, 2019

An example has been added to the README.

@aldanor aldanor closed this as completed Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants