-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed vs. variable length strings in snirf files #72
Comments
How does this correspond to the current specification for strings? I copy from the spec below for your convenience. My understanding of these low level types is poor.
I guess this is a MATLAB Homer implementation detail? HDF5 supports both fixed and variable length strings, so surely matlab does? How hard is it to make Matlab read variable length data exported from python (genuine, not smart arse question)? We can't be the first people to run in to this issue. Also just a note, the length for each matlab files isn't fixed at 14. It's always one more than the visible characters. Eg "Cm"=3, "Robert"=7. I guess the extra is the null character. If you wish to support just one, then I vote for variable length. Otherwise the spec will need to be updated to specify the length for each variable, or a global maximum, or calculate the length per variable. But I don't like setting a maximum string length. What if I want to encode a participant name and they have a long hyphenated name. And calculating a different fixed length for each variable is unnecessarily complicated and will discourage new users. My python code reads either format fine. And exporting in fixed is a trivial change as you describe (I already did it mne-tools/mne-nirs#364). So this isn't a technical blocker for me either way.
I think the most user friendly option is to support reading both fixed and variable length. This gives users the ability to easily read and write their own files themselves with minimum fuss. So I vote to support both, even if it's a small amount of extra work for us software developers. |
I 100% agree. I see this as a MATLAB implementation detail, but that is where it may be user-unfriendly. Users' code will need to handle both cases-- where the load API function returns the array compared to a 1x1 cell array {with the array inside}. Here's what my sleuthing through the HDF5 docs came up with to explain this: Variable length is a feature of HDF5 format, at the "Class" level of datatypes or, subtly, as a property avaialble to the String Class (bolded below), leading to several ways to store a string that agrees with our specification for null-terminated ASCII strings. From the HDF5 types docs section 5.1 Strings:
I've put in italics what I believe the Homer3 MATLAB HDF5 API does the way we've used it in Homer3, and in bold what the Python HDF5 library was doing (without the NumPy length option being set) EDIT: I like the solution of figuring out how to make MATLAB do the fourth alternative in bold and implementing all our Homer/AtlasViewer stuff that way. |
Thanks for sleuthing and summarising so clearly. What you say makes sense to me. I will leave you to evaluate the engineering effort to implement this option in MATLAB. If its not practically feasible then we can take the other routes. My preference is currently ordered:
Keep me posted @sstucker and let me know if you want me to test anything. I will happily implement whatever option is decided on, we just need to make a decision and document it in the spec. |
I don't like your first preference because it seems to open the door to incompatible code. Keeping it as vague as "String" could entice users to write files using one of four(!) options I list above. Maybe that is fine? But already it has given rise to two different valid files with a variable (aux names) that is exposed differently via a (naive to all this, basic attempt at "load string") use of the MATLAB HDF interface. To write a working reader a user would have to test on both possible file types and/or be aware of the HDF docs I posted above. It seems straightforward to use the variable length string interface in MATLAB, though.
@jayd1860 @dboas It would be possible and straightforward to rework the Homer3/AtlasViewer code to do this from now on (and remain backwards compatible of course) |
Cool, lets strike option one off the list then. Looks like we are converging to a decision (we both seem to prefer the variable string option). I will let the Homer camp mull over your comments. Ping me if needed. |
@sstucker what was the conclusion here? |
There is some ambiguity about how strings should be encoded in SNIRF files (see discussion here). There was talk about enforcing variable HDF5 length strings in SNIRF (fNIRS/snirf#72 (comment)) but that has not been confirmed. Files exported in SNIRF format by MNE-Python use variable length strings. Which is loaded in MATLAB as some sort of cell structure (excuse my rusty matlab skills). This PR aims to enable loading of SNIRF files from python with HDF5 variable length strings in to HOMER.
the pysnirf2 validator will raise a warning (but not invalidate a file) if a fixed length string is used. It is easy to tell the difference. We can elevate the error to a fatal level if we choose to make the breaking change. |
A draft PR #99 resolves this for good. |
Although I admit the actual HDF5 string formatting that represents the difference under the hood is a bit nebulous to me. At any rate, I have figured out how to create the same "variable length" outcome in both MATLAB (and therefore the base C interface) and Python. |
We have run into an issue in which strings saved to SNIRF files by Python interfaces appear differently after being loaded by the Homer3 interface in the MATLAB environment. These strings are loaded by the HDF5 code employed by the Homer3 SNIRF interface as 1x1 MATLAB cells containing the string itself. Strings saved by the Homer3 SNIRF interface do not exhibit this behavior.
Investigation w/ HDF5View suggets the issue is due to the strings being saved as fixed length vs variable length (see HDF5 datatypes guide ch6 section 5.1):
Data Type description in HDFView for SNIRF files generated by Python interface-- "variable" length
Data Type description in HDFView for SNIRF files generated by MATLAB Homer3 interface, fixed length = 14
Zahra Aghajan implemented a fix which used NumPy's interface to create a string with fixed maximum length:
np.array(length_unit.encode("UTF-8"), dtype="S10")
But really this is not a more correct string per the specification, it just appeases the Homer3 code.
Both strings passed the now-outdated Python validator.
Any thoughts? @dboas raised the point that probably the SNIRF specification does not fully specify an HDF5 file. Perhaps we should choose one string format and stick to that?
The text was updated successfully, but these errors were encountered: