-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lazy loading of raw data #43
Comments
What's actually the problem you're describing:
In the first case, your suggestion is valid. |
Let me first add that I tend to deal with large (>100MB, < 2GB) tdms files. typically only one or two groups (consisting of maybe 5 channels) make the size. In data processing, often, there are only two channels used at a time. I'd guess only loading the metadata is very quick. If you imagine a user interface where you can select between many tdms files (that may not have the most descriptive filename) you can access the tdms comment and the available channels without loading the data. (update: strictly this does not require selective/lazy loading) update: Don't get the question wrong, I posted this here as I'm not aware of any other place to discuss these things for npTDMS. I want to see if there's need for that or if I have any major misconceptions about the file format, which might (regarding to your answer) be the case. |
Your saying you're using two channels "at a time", but that doesn't mean you're not using all of them ;-) I can see that lazy loading can be really useful. I'm not planning to implement this (I more or less quit the engineering world. Not enough free software, too much Windows bloat), but I'm willing to finish my C++ implementation. If this issue really is about you wanting things faster, I think it's easier to finish the C++ implementation than to implement lazy loading. My tests on ~1GB files gave performance increases of a factor 5, without even optimizing. |
you might compare the same two channels in different measurement files and dont bother with the other ones e.g.. (Or if you open a file you will remember "Oh yeah this measurement did not go so well" after looking at one channel :p ) In any case, 'm looking forward to your C++ implementation. And I'll probably have a look at loading metadata separately (but not loading channels selectively). Looks quite easy to implement and the performance cost is probably low. |
I looked into implementing this a while back but never got around to it and also found it wasn't that straightforward, due to the way the metadata is stored in segments and having to support interleaved data (although it might be best to force loading all data if it is interleaved, as performance would take a big hit in this case). It should definitely be doable though and would be useful. npTDMS already loads the metadata before loading the actual data so it can allocate a complete array once for a channel and then read directly into that array. For certain TDMS files with many very small segments this turns out to not work that well due to the large amount of memory taken up by the metadata of each segment (eg. see #19), but for most files it seems to work fairly well. If you run tdmsinfo with the |
Add support for lazily loading data for a channel after opening a TDMS file: * Adds an explicit read_data method to TdmsObject to make it obvious this is an expensive operation, and allow extending this later to support loading a subset of the data (#39) * Supports DAQmx and interleaved data by reading whole chunks and discarding the unneeded data * More optimised implementation for contiguous data This also introduces some static helper methods to make initialising a TdmsFile object more straightforward. There is now TdmsFile.read to read all data, TdmsFile.open to read metadata and keep the file open for reading, and TdmsFile.read_metadata to read metadata only. Fixes #43
Is it possible to implement "lazy" loading of the TDMS (segment) raw data? I imagine reading just the metadata first (potentially using the tdms_index if present) and on calling channel_data only load the raw data of this channel. Any thoughts?
The text was updated successfully, but these errors were encountered: