-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhead of channel descriptor groups #103
Comments
This seems to be a fundamental problem with HDF5 itself (there are many). I see the following possible resolutions to the huge file size:
|
@samuelpowell , excellent issue you are raising. This is very important to resolve. I suspect you are going to say no because this is a text file and so it will still be much larger than 28 bytes per channel. But it should be much less than the 3 KiB per channel that you say SNIRF has. I suspect this tsv file should be less than 100 bytes per channel. That would be more than 30x smaller, or about 7 MiB for you. Is that reasonable as an initial solution? Note that the BIDS spec has the property of inheritance such that you wouldn't have to repeat this tsv file for every run, you could just specify it once per subject or even once per group if all runs/files use in that subject / group use the same geometry. Of course you also need the optode description tsv file, but that is only number of optodes long and not number of channels. If this works, then it allows us to use the SNIRF and BIDS standards without having to change the SNIRF spec other than recognizing in the SNIRF spec that if the probe information is missing then one should look for it in the tsv files. This would be much much cleaner than the other options. Note that option 1 mentioned by @sstucker was debated quite a bit early on in the SNIRF specification by a number of people and it was felt that the detailed descriptor was better than the matrix approach we used in the past in the .nirs file for instance. BUT, we didn't appreciate the HDF5 overhead for such things. |
@sstucker @dboas thank you for your prompt responses. Array of a compound type Switching from a group per descriptor to an array of a compound type containing the same information reduces the file size in the above scenario from ~200MiB to ~2MiB. It also reduces the time taken to construct the enumeration from ~24 seconds down to ~1s (take the timings with a pinch of salt, it could be that the disparity is because the large amount of data generated in the former case means that HDF5 decides to flush to disk, I haven't done any detailed profiling so I don't know). If this approach to storing channel metadata is viable, I'd be pleased. Of course, I'm not party to the earlier discussions that you mentioned David, so perhaps there are technical reasons of which I am unaware which preclude this approach. Use of BIDS channel descriptor Our motivation for evaluating SNIRF as a native output format (by which I mean written during acquisition by the instrument) is to make things simpler for our end users. Considerations include:
As soon as we start including additional metadata files, thing become complex again, and it makes less sense. I make the distinction of a native format because the other option is simply to provide a conversion script such that users can employ SNIRF as an archival format, should they choose. In this scenario I don't particularly care about he metadata overhead - storage is cheap. But as a native format we have to consider its usage not just on a workstation or in the cloud, but also in embedded systems where processing, power, and storage are limited. |
@samuelpowell, I am surprised that the channel descriptor could add such a big overhead in HDF5. Maybe try out the JSON/binary JSON wrapper (JSNIRF) of the SNIRF format - see this spec and sample file. JSNIRF is meant to map losslessly to all SNIRF data fields, can be used as the "source-code format" of SNIRF (because JSON is human-readable), and serve as a bridge to NoSQL (MongoDB/CouchDB) databases in the future. JSNIRF depends on SNIRF, but provides a much portable/lightweight way to read/write, especially suited for lightweight hierarchical data. Check out this poster for details. Depends on the programming language of your code, I can provide examples of reading/writing such data. We have libraries for Python, MATLAB, C, C++ and JavaScript. |
Hi @fangq, your use of a structure of arrays rather than an array of structures would certainly keep the size of the metadata manageable (this is basically equivalent to the suggestion explored in the previous post), and I do see the value of the human readable metadata. However as noted we are looking at formats from the perspective of online recording, where the channel data (and other auxiliary fields) are updated in real time. I appreciate that in some applications HDF5 is used as a sledgehammer to crack a nut, but for this particular case its use of a chunked data model, backed by a B-tree, is spot on. Please correct me if I'm wrong, but I don't see how incremental writing of data could work in the JSNIRF specification as it stands, so whilst we'd certainly look at its use as an archival format I'm not sure it fits the bill for what we want right now. (Another consideration for me when considering commercial usage is that I would want to see your BJData format gaining traction in a couple of high quality and performant C/C++ libraries - I can see you are making good progress through e.g. nlohmann/json#3336.) I don't want to stray too much from the main purpose of this issue - but always happy to discuss further elsewhere. |
@samuelpowell , I was arguing to keep the array representation of the measurement list information, but several others were in support of the specification now adopted by SNIRF. The rationale was simply that storage was not an issue and the adopted specification was more explicit. This was a discussion at the very beginning of the SNIRF spec, circa 2016, with many from the fNIRS field and we didn’t have the HDF5 experience and weren’t thinking of the overhead of large measurement lists… Well, I might say I was since I was arguing for the array representation :) I am fine trying to get the array representation into SNIRF as an either / or proposition to what is presently there. I’d prefer that myself. And I suspect we can do it in a way that keeps backward compatibility. But I think we need to get buy-in from others who are now using SNIRF. Also, I think we need to better understand @fangq’s tools as they are likely quite useful given his history of such contributions and his present funding to develop this further. |
@dboas I understand, thank you. To clarify, the approach I tested above was to have an array of a compound type, where the compound type contains each of the fields (i.e. an array of structs). I don't know if this was the original proposal, or if you were considering a single group with arrays for each field (i.e. a struct of arrays). Either approach solves the problem and could be optional as far as I can see. I'll leave this issue open for now, and in the mean time am discussing JSNIRF further with @fangq on that repo. |
@samuelpowell, if you can share a sample snirf file with a large inflation ratio (the file itself does not have to be big), I think it is worthwhile to use h5dump or other tools to find out what had caused the overhead. It might be just unnecessary padding. I will be happy to take a look. on the other side, see my comment in another thread regarding the use of |
@fangq thank you for offering to take a look at the files, please see attached. Any insights welcome! group_enumeration.zip contains SNIRF compliant metadata (~200MiB) |
Brief note that creating the HDF5 file with compatibility limited to more recent versions of the library, e.g., hid_t fapl = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
fid = H5Fcreate("file.h5", ... enables the compact storage for the group data. This reduces the file size from 200MiB to ~160MiB, which is not particularly significant. |
@sstucker and I were discussing that one possible solution to the overhead issue is to have the option that each element of It seems to me that we can allow for the addition of this usage while keeping backward compatibility, i.e. people can still write snirf files with many elements for Thoughts? |
@dboas this would certainly fix the problem I have described, and is probably the simplest / least intrusive way to do so. One thought. Perhaps when using the structure of arrays style you suggest, the index But in any case, I’d be very pleased with this change. |
This change sounds good to me. It would be a bit annoying programmatically (to recognize which format is in use), but no more strained than the notion of Indexed Groups already is. |
The reason that we coded the snirf file like this (despite knowing it would be more overhead) is that I saw file I/O and RAM usage to be more of a bottle neck than disk storage. Specifically, I was thinking in the context of two scenarios that will likely come up in future data mining methods. 1) If you put the data into one large chuck (eg a multidimensional data array/matrix instead of a series of single channels, then you need to unpack the whole thing to access the data. Since the data fields can be compressed (filtered in HDF5 notation) this would prevent extraction of channel subsets. I was specifically thinking of (eg) 70k channel case that I was considering in this context. I was envisioning data mining/machine learning methods in this context. 2) This self-contained single channel notation allows you to add/remove data inside the HDF5 file without having to rewrite the file. So you can dynamically prune channels from the file without having to unpack the file. Putting the metadata into a complex array and having the measurement link per channel refer to the index in the array might be a solution. I wonder if you can use symbolic links to do this (since HDF5 has symbolic links/memory maps as a class). It’s possible that if you did it with symbolic links, you wouldn’t even need to worry about different formats because the HDF5 backend would take care of it I think if you added compression to the actual data (per channel), you would probably be able to reduce the file size by more than the 500Mb and might appreciate not having to unpack/decompress the whole 70k by timepoints matrix just to access one channel |
@huppertt are you saying that the choice of the current design is such that one might choose to prune a particular channel I understand that this is certainly a low cost operation in the current specification. What I don’t understand is that as soon as you actually want to use the data, you still have to access the time series which the specification states is a 2D (channels x time points) matrix. This cannot necessarily be done efficiently in terms of I/O (e.g. through memory mapping the array) because the SNIRF specification doesn’t prohibit the use of chunked and/or compressed datasets (which preclude this approach). |
Going ahead and drafting an implementation of our solution. We can benchmark the proposed format and compare to your experience @samuelpowell |
Splendid, thank you @sstucker . We're about to release a MATLAB package for conversion of LUMO data - I'll make sure we include the proposed solution as an option and keep it up to date with whatever happens. Ping me here if you need anything. |
See above for an early draft fix |
It’s been a while since I looked at the spec. I could have sworn we had put the data vector as a sub field to each channel rather than as a matrix entry. I think I am just losing my mind because that is not how we did it. Everything I said was based on the assumption that the data for each channel was stored as a child of that channel’s definition. But, since I am clearly forgetting how we actually did it in the end and that the data is not separated by channel, just forget everything I wrote before 🤦♂️🤦♂️🤦♂️
Theodore Huppert, PhD
University of Pittsburgh
Department of Electrical and Computer Engineering
Email: ***@***.***
Phone: 1-412-647-8459
Website: www.huppertlab.net
…________________________________
From: Samuel Powell ***@***.***>
Sent: Monday, March 21, 2022 6:56:11 PM
To: fNIRS/snirf ***@***.***>
Cc: huppertt ***@***.***>; Mention ***@***.***>
Subject: Re: [fNIRS/snirf] Overhead of channel descriptor groups (Issue #103)
@huppertt<https://github.com/huppertt> are you saying that the choice of the current design is such that one might choose to prune a particular channel k by deleting its entry in the metadata, e.g., /nirs(i)/data(j)/measurementList(k)?
I understand that this is certainly a low cost operation in the current specification.
What I don’t understand is that as soon as you actually want to use the data, you still have to access the time series which the specification states is a 2D (channels x time points) matrix. This cannot necessarily be done efficiently in terms of I/O (e.g. through memory mapping the array) because the SNIRF specification doesn’t prohibit the use of chunked and/or compressed datasets (which preclude this approach).
—
Reply to this email directly, view it on GitHub<#103 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALNU3OTG2CWCTJP725PZSUDVBD5AXANCNFSM5QQE5ERA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
We have been internally evaluating the use of SNIRF as a native output format for Gowerlabs' Lumo system.
Lumo is a high density system, and our full head adult caps contain 54 modules, each with 3 dual-wavelength sources and 4 detectors. We are able to provide a dense output, which results in (54 x 4 x 54 x 6 = ) circa 70k channels.
The use of an HDF5 group per channel descriptor (e.g.
/data1/measurementList{i}
) appears to incur significant overhead. For example, a SNIRF file containing only metadata (no channel data) for a full head system system amounts to ~200MiB, or ~3KiB per channel. The actual information content of each descriptor (containing only the required fields plus module indices) amounts to only (7 x 4 = ) 28 bytes, so this is an overhead of approximately 99%.Our results appear vaguely consistent with this analysis:
Evidently the size of the metadata grows linearly with the number of channels, as does the data rate of the channel time series, and hence for longer recordings the size of the metadata becomes proportionally smaller. However in absolute terms we find that (with appropriate chunking and online compression) the metadata corresponds to around four minutes of compressed raw channel data. Given the length of a typical measurement session, the overhead remains significant.
I appreciate that the majority of systems (such as those of the manufacturers listed on the SNIRF specification page) are of a much lower density than Lumo, and that even high density systems often produce sparse data, but evidently the trend is towards increasing density and the number of wavelengths. Our future products would, based on the current SNIRF specification, generate over 0.5GiB of metadata.
The text was updated successfully, but these errors were encountered: