Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping Rust tuples to HDF5 compound types #19

Closed
aldanor opened this issue Dec 17, 2018 · 11 comments
Closed

Mapping Rust tuples to HDF5 compound types #19

aldanor opened this issue Dec 17, 2018 · 11 comments

Comments

@aldanor
Copy link
Owner

aldanor commented Dec 17, 2018

While working on a test suite for dataset reading/writing, I've encountered a curious problem.

Facts:

  • We'd like Rust's tuple types to be mapped to HDF5 types automatically (by deriving H5Type).
  • We don't have any control over Rust tuple layout, it will be optimized by the compiler

Here's an example (playground): field offsets for tuple (i8, u64, f32) are (8, 0, 12) – Rust reorders the fields so the largest one comes the first, and is rightfully free to do so.

Now, we want to bind this to an HDF5 compound datatype, with fields named "0", "1" and "2". In HDF5, however, the offsets must be increasing strictly; providing a decreasing offset will be considered as datatype shrinking and will most likely yield an error.

So, we have a few options:

  • Compound datatype with fields ["1", "0", "2"] and offsets [0, 8, 12].

    This can be mapped directly to the Rust tuple and won't require any conversion (no-op). However, the fields are ordered in a weird way. This ordering also depends on the internals of the Rust compiler (although this part isn't likely to change).

  • Compound datatype with fields ["0", "1", "2"] and offsets [0, 8, 16].

    The fields now have the same order as in Rust, however the memory layout is different (and the data element has a bigger size). However, due to incompatible memory layout, this will require a soft conversion each time the dataset is being read/written (an extra step and an extra memory allocation). This is pretty weird and confusing as well, e.g. you want to create a new dataset with this tuple type, and suddenly you're being asked to enable soft conversion. It would also be hard for the crate user to predict whether a given tuple type would require enabling soft conversion for reading/writing – and this would require knowledge of compiler internals.

  • Compound datatype with fields ["0", "1", "2"] and offsets [0, 8, 12].

    This doesn't require soft conversion, the fields in HDF5 are named 0-1-2, but "0" is now field 1, and "1" is now field 0, which is confusing.

  • ? (any other options I've missed?)

For reference: tuple type binding implementation if it's of any help (it's pretty hairy macro stuff).

@aldanor
Copy link
Owner Author

aldanor commented Dec 17, 2018

@Enet4, @pmarks ^ 😄

@Enet4
Copy link

Enet4 commented Dec 18, 2018

I'm not quite sure of how the native library handles this, but I hope the spec does not require fields to be packed (as in, no padding between fields). Otherwise, that would leave a lot of room for UB, and all hopes for no-ops would go down the drain.

  • Compound datatype with fields ["1", "0", "2"] and offsets [0, 8, 12].

This can be mapped directly to the Rust tuple and won't require any conversion (no-op). However, the fields are ordered in a weird way. This ordering also depends on the internals of the Rust compiler (although this part isn't likely to change).

I would prefer not to rely on this, since in theory it's something that can change between compiler versions, and we're doing persistent data objects meant to be portable across multiple systems through space and time (I know, this last part was exaggeratedly poetic).

  • Compound datatype with fields ["0", "1", "2"] and offsets [0, 8, 16].

The fields now have the same order as in Rust, however the memory layout is different (and the data element has a bigger size). However, due to incompatible memory layout, this will require a soft conversion each time the dataset is being read/written [...] – and this would require knowledge of compiler internals.

I'm probably overlooking something, but we can ensure this layout in user code with repr(C), so it's not like we're really requiring them to know about compiler internals. This would only require everyone to use tuple structs instead of actual tuples, but it's not like that's a bad thing either.

@aldanor
Copy link
Owner Author

aldanor commented Dec 18, 2018

I'm not quite sure of how the native library handles this, but I hope the spec does not require fields to be packed (as in, no padding between fields)

There's not many requirements, mostly just common-sense kind of restrictions; HDF5 doesn't require structs to be packed, and it doesn't require any kind of fixed layout either. The offsets of consecutive fields should not decrease, and fields should not overlap.

I'm probably overlooking something, but we can ensure this layout in user code with repr(C), so it's not like we're really requiring them to know about compiler internals.

As a matter of fact, we're already requiring #[repr(C)] for user structs in #[derive(H5Type)].

I wonder if that's overly restrictive, since there's also #[repr(packed)]. Given that numpy uses packed layouts by default, it may be nice to support that too (although there's always automatic soft conversions...). Looking at design docs for #[repr(packed)] though, and at the notes in Nomicon, one may wonder whether packed repr should be supported at all... (some instant side-effects I could think of: some of HDF5-supported types are Drop, like varlen arrays/strings, that wouldn't work; some of the machinery in hdf5-types figures out struct field offsets by taking references and converting them to pointers -- that wouldn't work either with packed types, etc).

we can ensure this layout in user code with repr(C)

A little correction, not "we", but rather the users... as for "us", we can't really ensure it :)

This would only require everyone to use tuple structs instead of actual tuples, but it's not like that's a bad thing either.

Of course the users can use structs instead of tuples, and use #[repr(C)] (or #[repr(packed)] if we choose to suport it). But this basically means -- we don't support writing plain tuples? That's not too nice :( (e.g., a user may wonder why he can't write a coord: (f32, f32) field to HDF5 without first creating a struct, which should actually be perfectly fine.

Another compromise, I guess, would be to support all tuples where the fields don't get reordered (this will be have checked at runtime, something like <T as H5Type>::type_descriptor().is_ordered()). This way you can still read/write things like (i32, u8) or (f32, f32, f32), but the pathological cases that get reordered by the compiler will be rejected when trying to construct an HDF5 datatype. So it would be like "no-op or go away". (This, or option 2 out of the ones I've listed above)

@Enet4
Copy link

Enet4 commented Dec 18, 2018

A little correction, not "we", but rather the users... as for "us", we can't really ensure it :)

Right, that's what I meant. Although we can, in a way, check whether the type's layout is compatible at run-time (totally not ideal, O static_assert where art thou).

Of course the users can use structs instead of tuples, and use #[repr(C)] (or #[repr(packed)] if we choose to suport it). But this basically means -- we don't support writing plain tuples? That's not too nice :( (e.g., a user may wonder why he can't write a coord: (f32, f32) field to HDF5 without first creating a struct, which should actually be perfectly fine.

We'd have to explain that tuples have a very specific behaviour in Rust, and we cannot enforce how they are aligned in memory. Can't think of anything better without delving into obscure procedural macros.

The way I currently see things is that if there is to be a way to work with packed structs and tuples, conversions have to take place when reading and writing. Attempts of no-op may seem promising, but all code relying on it would be standing on thin ice.

@aldanor
Copy link
Owner Author

aldanor commented Dec 18, 2018

to work with packed structs and tuples

Packed where? In memory or in file?

If it's packed in file - HDF5 soft conversions would take care of it. As for packed in memory - I don't really wish to support that out of the box, so that we just strictly require repr(C) for memory types. If the users really want/need to, H5Type can always be implemented manually for weird user types.

We'd have to explain that tuples have a very specific behaviour in Rust, and we cannot enforce how they are aligned in memory. Can't think of anything better without delving into obscure procedural macros.

Yea. So with some (most?) tuples it will "just work", and the conversion will be no-op. With other tuples it would require a soft conversion.

One important takeaway here is that perhaps H5Type::type_descriptor() is not enough. It should probably be changed to:

unsafe pub trait H5Type {
    // This is "file" descriptor; this is how the type will be stored in HDF5 if we create a new 
    // dataset with it. It may not be a direct memory mapping of the underlying type and 
    // may require conversion.
    fn type_descriptor() -> TypeDescriptor;

    // This is "memory" descriptor; it is a description for HDF5 internal routines of how our 
    // types are laid out in memory -- this can depend e.g. on Rust compiler version.
    // Datatypes based on this descriptor never get stored in files.
    // Defaults to being the same as file descriptor (true for most types), but can be overridden.
    fn memory_descriptor() -> TypeDescriptor {
        Self::type_descriptor()
    }
}

Hmm... I guess we could even add

    fn requires_conversion() -> Option<Conversion> {
        None
    }

This way, we can at least query at runtime whether a tuple type is "irregular" (some other types like varlen arrays/strings also always require soft conversion because they require allocations).

Detailed example:

// this would create a dataset with "file type":
// file type = [("0", i8, 0), ("1", u64, 8), ("2", f32, 16)], sizeof=20
ds = file.new_dataset::<(i8, u64, f32)>::create_anon(2)?;

// this would attempt to write from a memory buffer of "memory type": 
// memory type = [("1", u64, 0), ("0", i8, 8), ("2", f32, 12)], sizeof=16
ds.write(&[(1, 2, 3.), (4, 5, 6.)])?;  // <-- this fails ("soft conversion required");

ds.as_writer().soft().write(&[(1, 2, 3.), (4, 5, 6.)])?;  // <-- this works

On a side-note, I really don't want to make soft-conversions default (like they are in h5py/HDF5), this being Rust I'd prefer to be warned of any non-zero-cost actions, especially when they may be very obscure and unobvious (I honestly didn't know about clip-by-default policy when shrinking integer bytesize being the default until I've started digging into this...)

@Enet4
Copy link

Enet4 commented Dec 18, 2018

to work with packed structs and tuples

Packed where? In memory or in file?

Yes, in file. We agree here that there must be a soft conversion in this case. 👍

So with some (most?) tuples it will "just work", and the conversion will be no-op. With other tuples it would require a soft conversion.

For the case of tuples however, it's hard for me to endorse the idea, even with the suggested work-around:

    // This is "memory" descriptor; it is a description for HDF5 internal routines of how our 
    // types are laid out in memory -- this can depend e.g. on Rust compiler version.
    // Datatypes based on this descriptor never get stored in files.
    // Defaults to being the same as file descriptor (true for most types), but can be overridden.
    fn memory_descriptor() -> TypeDescriptor {
        Self::type_descriptor()
    }

How would this one be implemented for the example tuple (i8, u64, f32)? And can this be done in a way as to ensure consistency among multiple compiler versions (even new ones)?

@aldanor
Copy link
Owner Author

aldanor commented Dec 18, 2018

How would this one be implemented for the example tuple (i8, u64, f32)? And can this be done in a way as to ensure consistency among multiple compiler versions (even new ones)?

You mean the other descriptor? Because the memory descriptor must literallly describe what’s in the memory. In this case it has to be option 1 from my first message.

As for the “in-file descriptor” if you’re creating new datasets with this tuple, I see at least two ways:

  1. If memory descriptor is ordered, use that as file descriptor. Otherwise, convert it to repr(C), see below.

  2. Always convert it to repr(C), see below. This is more predictable but I believe it would also have less chances for noop conversion (need to experiment a bit to check).

By “converting to repr(C)”, I mean one of the two things:

  1. Implement it manually; it’s well defined so it can be done. Or even steal it from rustc.
  2. Create a repr-C tuple struct with the same fields (can be automatically done in the macro) and steal the layout from there.

@aldanor
Copy link
Owner Author

aldanor commented Dec 18, 2018

@aldanor
Copy link
Owner Author

aldanor commented Dec 18, 2018

And can this be done in a way as to ensure consistency among multiple compiler versions (even new ones)?

That’s my whole point, we don’t need to ensure consistency between memory descriptors. They map to internal memory representation, whatever’s hardcoded in the current compiler version, and never get published anywhere; whereas “file descriptors” describe the type of new datasets that get stored in the file.

@aldanor
Copy link
Owner Author

aldanor commented Dec 26, 2018

@Enet4 so I'm almost done with it, it seems, I think it will work -- had to implement C alignment and layout logic manually due to current stupid limitations ("can't use outer type parameters") but it wasn't too bad...

Another thing I noticed was -- currently if you provide a one-element tuple, like (T,), it will be just treated as T; so a (u32,) will be serialized as a scalar field/dataset. Wonder if for consistency purposes it would make more sense to make it a single-field compound type? (with a single field called "0")

// obviously, empty/unit tuples are banned as well since they make no sense

@aldanor
Copy link
Owner Author

aldanor commented Dec 26, 2018

Ok, done (changes pushed to 2018 branch); summary:

  • In-memory descriptors (returned by H5Type::type_descriptor()) for compound types are now always sorted by offset to prevent HDF5 errors when creating datatypes. This may mean that the order of fields is like [1, 0, 2].
  • Added TypeDescriptor::to_c_repr() -- which changes layout of compound types, reversing the order of fields to the original (i.e. [0, 1, 2]), and adjusting offsets and type size according to C layout rules. This also includes any nested compound types, or compound types embedded within FixedArray or VarLenArray. The end result should be equivalent to using #[repr(C)] (if it was a struct to start with).
  • Added TypeDescriptor::to_packed_repr() -- which changes layout same way as if it was #[repr(packed)].
  • Dataset builder now uses .to_c_repr() on the memory descriptor when instantiating a dataset
  • Added .packed() option to DatasetBuilder which allows storing datasets as packed (this is useful e.g. when working with NumPy, where all compound types are packed by default).
  • All of the repr business will only affect creating new datasets. Reading/writing would only involve in-memory descriptors (the in-file descriptor is already provided by HDF5 and is locked).
  • Length-1 tuples are stored as a single-field compound types now, to be consistent with structs and tuple structs (we can always revert it later if anyone wants to).

Will probably require some further tests to be added at a later point, but so far everything looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants