Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed vs. variable length strings in snirf files #72

Closed
sstucker opened this issue Sep 2, 2021 · 9 comments · Fixed by #99
Closed

Fixed vs. variable length strings in snirf files #72

sstucker opened this issue Sep 2, 2021 · 9 comments · Fixed by #99

Comments

@sstucker
Copy link
Collaborator

sstucker commented Sep 2, 2021

We have run into an issue in which strings saved to SNIRF files by Python interfaces appear differently after being loaded by the Homer3 interface in the MATLAB environment. These strings are loaded by the HDF5 code employed by the Homer3 SNIRF interface as 1x1 MATLAB cells containing the string itself. Strings saved by the Homer3 SNIRF interface do not exhibit this behavior.

Investigation w/ HDF5View suggets the issue is due to the strings being saved as fixed length vs variable length (see HDF5 datatypes guide ch6 section 5.1):

Data Type description in HDFView for SNIRF files generated by Python interface-- "variable" length
image

Data Type description in HDFView for SNIRF files generated by MATLAB Homer3 interface, fixed length = 14
image

Zahra Aghajan implemented a fix which used NumPy's interface to create a string with fixed maximum length:
np.array(length_unit.encode("UTF-8"), dtype="S10")
But really this is not a more correct string per the specification, it just appeases the Homer3 code.

Both strings passed the now-outdated Python validator.

Any thoughts? @dboas raised the point that probably the SNIRF specification does not fully specify an HDF5 file. Perhaps we should choose one string format and stick to that?

@rob-luke
Copy link
Member

rob-luke commented Sep 3, 2021

How does this correspond to the current specification for strings? I copy from the spec below for your convenience. My understanding of these low level types is poor.

string: either a H5T.C_S1 (null terminated string) type or as ASCII encoded 8-bit char array or UNICODE UTF-16 array. Defined by the H5T.NATIVE_CHAR or H5T.H5T_NATIVE_B16 datatypes in H5T. (note, at this time HDF5 does not have a UTF16 native type, so H5T_NATIVE_B16 will need to be converted to/from unicode-16 within the read/write code).

I guess this is a MATLAB Homer implementation detail? HDF5 supports both fixed and variable length strings, so surely matlab does? How hard is it to make Matlab read variable length data exported from python (genuine, not smart arse question)? We can't be the first people to run in to this issue.

Also just a note, the length for each matlab files isn't fixed at 14. It's always one more than the visible characters. Eg "Cm"=3, "Robert"=7. I guess the extra is the null character.

If you wish to support just one, then I vote for variable length. Otherwise the spec will need to be updated to specify the length for each variable, or a global maximum, or calculate the length per variable. But I don't like setting a maximum string length. What if I want to encode a participant name and they have a long hyphenated name. And calculating a different fixed length for each variable is unnecessarily complicated and will discourage new users.

My python code reads either format fine. And exporting in fixed is a trivial change as you describe (I already did it mne-tools/mne-nirs#364). So this isn't a technical blocker for me either way.

Perhaps we should choose one string format and stick to that?

I think the most user friendly option is to support reading both fixed and variable length. This gives users the ability to easily read and write their own files themselves with minimum fuss. So I vote to support both, even if it's a small amount of extra work for us software developers.

@sstucker
Copy link
Collaborator Author

sstucker commented Sep 3, 2021

If you wish to support just one, then I vote for variable length.

I 100% agree.

I see this as a MATLAB implementation detail, but that is where it may be user-unfriendly. Users' code will need to handle both cases-- where the load API function returns the array compared to a 1x1 cell array {with the array inside}.

Here's what my sleuthing through the HDF5 docs came up with to explain this:

Variable length is a feature of HDF5 format, at the "Class" level of datatypes or, subtly, as a property avaialble to the String Class (bolded below), leading to several ways to store a string that agrees with our specification for null-terminated ASCII strings. From the HDF5 types docs section 5.1 Strings:

Text data is represented by arrays of characters, called strings. Many programming languages support different conventions for storing strings, which may be fixed or variable length, and may have different rules for padding unused storage. HDF5 can represent strings in several ways.

First, a dataset may have a dataset with datatype H5T_NATIVE_CHAR, with each character of the string as an element of the dataset. This will store an unstructured block of text data, but gives little indication of any structure in the text (Figure 34a).

A second alternative is to store the data using the datatype class H5T_STRING, with each element a fixed length (Figure 34b). In this approach, each element might be a word or a sentence, addressed by the dataspace. The dataset reserves space for the specified number of characters, although some strings may be shorter. This approach is simple and usually is fast to access, but can waste storage space if the length of the Strings varies.

A third alternative is to use a variable-length datatype (Figure 34c). This can be done using the standard mechanisms described above (e.g., using H5T_NATIVE_CHAR instead of H5T_NATIVE_INT in Figure 29 above). The program would use vl_t structures to write and read the data.

A fourth alternative is to use a special feature of the string datatype class, to set the size of the datatype to H5T_VARIABLE (Figure 34c). Figure 35 shows a declaration of a datatype of type H5T_C_S1, which is set to H5T_VARIABLE. The HDF5 Library automatically translates between this and the vl_t structure. (Note: the H5T_VARIABLE size can only be used with string datatypes.)

I've put in italics what I believe the Homer3 MATLAB HDF5 API does the way we've used it in Homer3, and in bold what the Python HDF5 library was doing (without the NumPy length option being set)

EDIT: I like the solution of figuring out how to make MATLAB do the fourth alternative in bold and implementing all our Homer/AtlasViewer stuff that way.

@rob-luke
Copy link
Member

rob-luke commented Sep 3, 2021

Thanks for sleuthing and summarising so clearly. What you say makes sense to me.

I will leave you to evaluate the engineering effort to implement this option in MATLAB. If its not practically feasible then we can take the other routes.

My preference is currently ordered:

  1. Support reading either variable or fixed lengths. Then people can write in whatever format they wish. But may be some juggling on the Homer side to read these and convert cells or something.
  2. Support variable length strings. For reasons described above. Having one option supported simplifies reading/writing. And can easily be specified in the spec.
  3. Support only fixed. This requires some adjustment to the python code, it's not hard, but it is ugly.

Keep me posted @sstucker and let me know if you want me to test anything. I will happily implement whatever option is decided on, we just need to make a decision and document it in the spec.

@sstucker
Copy link
Collaborator Author

sstucker commented Sep 3, 2021

I don't like your first preference because it seems to open the door to incompatible code. Keeping it as vague as "String" could entice users to write files using one of four(!) options I list above. Maybe that is fine? But already it has given rise to two different valid files with a variable (aux names) that is exposed differently via a (naive to all this, basic attempt at "load string") use of the MATLAB HDF interface. To write a working reader a user would have to test on both possible file types and/or be aware of the HDF docs I posted above.

It seems straightforward to use the variable length string interface in MATLAB, though.

% Generate cellstr (a cell array of strings)
% First element is '0123456789'
% Second element is 'abcdefghijklmnopqrstuvwxyz'
str = {'0':'9';'a':'z'};
% Generate a file
fid = H5F.create('vlen_string_test.h5','H5F_ACC_TRUNC',...
    'H5P_DEFAULT','H5P_DEFAULT');
% Set variable length string type
VLstr_type = H5T.copy('H5T_C_S1');
H5T.set_size(VLstr_type,'H5T_VARIABLE');
% Create a dataspace for cellstr
H5S_UNLIMITED = H5ML.get_constant_value('H5S_UNLIMITED');
dspace = H5S.create_simple(1,numel(str),H5S_UNLIMITED);
% Create a dataset plist for chunking
plist = H5P.create('H5P_DATASET_CREATE');
H5P.set_chunk(plist,2); % 2 strings per chunk
% Create dataset
dset = H5D.create(fid,'VLstr',VLstr_type,dspace,plist);
% Write data
H5D.write(dset,VLstr_type,'H5S_ALL','H5S_ALL','H5P_DEFAULT',str);
% Close file & resources
H5P.close(plist);
H5T.close(VLstr_type);
H5S.close(dspace);
H5D.close(dset);
H5F.close(fid);
%% READ PORTION OF THIS EXAMPLE
% Open file
fid = H5F.open('vlen_string_test.h5','H5F_ACC_RDONLY','H5P_DEFAULT');
% Set variable length string type
VLstr_type = H5T.copy('H5T_C_S1');
H5T.set_size(VLstr_type,'H5T_VARIABLE');
% Open dataset
dset = H5D.open(fid,'VLstr');
% Read data
str_data = H5D.read(dset,VLstr_type,'H5S_ALL','H5S_ALL','H5P_DEFAULT');
% Close file & resources
H5T.close(VLstr_type);
H5D.close(dset);
H5F.close(fid);
% NOTE: Because variable strings are treated as 1-column cellstr, no 
% transpose is needed.
disp('Original cellstr:');
disp(str);
disp('Read cellstr from file:');
disp(str_data);

source

@jayd1860 @dboas It would be possible and straightforward to rework the Homer3/AtlasViewer code to do this from now on (and remain backwards compatible of course)

@rob-luke
Copy link
Member

rob-luke commented Sep 3, 2021

I don't like your first preference because it seems to open the door to incompatible code.

Cool, lets strike option one off the list then.

Looks like we are converging to a decision (we both seem to prefer the variable string option). I will let the Homer camp mull over your comments. Ping me if needed.

@rob-luke
Copy link
Member

rob-luke commented Oct 4, 2021

@sstucker what was the conclusion here?

sstucker pushed a commit to BUNPC/DataTree that referenced this issue Nov 20, 2021
There is some ambiguity about how strings should be encoded in SNIRF files (see discussion here). There was talk about enforcing variable HDF5 length strings in SNIRF (fNIRS/snirf#72 (comment)) but that has not been confirmed.

Files exported in SNIRF format by MNE-Python use variable length strings. Which is loaded in MATLAB as some sort of cell structure (excuse my rusty matlab skills).

This PR aims to enable loading of SNIRF files from python with HDF5 variable length strings in to HOMER.
@sstucker
Copy link
Collaborator Author

sstucker commented Jan 4, 2022

the pysnirf2 validator will raise a warning (but not invalidate a file) if a fixed length string is used. It is easy to tell the difference. We can elevate the error to a fatal level if we choose to make the breaking change.

@sstucker
Copy link
Collaborator Author

A draft PR #99 resolves this for good.

@sstucker
Copy link
Collaborator Author

sstucker commented Jan 12, 2022

Although I admit the actual HDF5 string formatting that represents the difference under the hood is a bit nebulous to me. At any rate, I have figured out how to create the same "variable length" outcome in both MATLAB (and therefore the base C interface) and Python.

@sstucker sstucker linked a pull request Jan 26, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants