-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analysing the performance of different methods to get windows #62
Comments
Cool, thanks for the clear info! Yes, diving a bit deeper may be helpful. Keep in mind: we will need fast access mainly during the training loop, so directly before returning some tensor/ndarray (in the usual case) that will be passed to the deep network. So for You could additionally do the following on reasonable GPU to know better what kind of times we may need to reach in the end: Forward one dummy batch size (64,22,1000) through the deep and shallow network, compute classification loss with dummy targets, and do the backward, measure the wall clock time (don't use profilers here for now, they may not work well with GPU). Then we have a rough time we want to reach... |
Yes, I agree that the interesting case here is It looks like we could get something pretty close to HDF5 lazy loading. Here's what the method looks like: def get_single_epoch(self, idx, postprocess=False):
"""
Get a single epoch.
Parameters
----------
idx: int
Index of epoch to extract.
postprocess: bool
If True, apply detrending + offset + decim when loading a new epoch
from raw; also, apply projection if configured to do so.
Returns
-------
epoch : array of shape (n_channels, n_times)
The specific window that was extracted.
"""
assert isinstance(idx, int)
if self.preload:
epoch = self._data[idx]
else:
epoch = self._get_epoch_from_raw(idx)
if postprocess:
epoch = self._detrend_offset_decim(epoch)
if postprocess and not self._do_delayed_proj:
epoch = self._project_epoch(epoch)
return epoch |
Also, I'm looking at the test you suggested @robintibor, will give an update soon. |
@hubertjb would you prefer if we fixed this at the MNE end? I enjoy (at least making an attempt at) solving these problems |
@larsoner that'd be amazing! :) Out of curiosity, what kind of optimization do you have in mind? Would it go through a special method as above, or something else? |
@robintibor I've made a script to test what you suggested: https://github.com/hubertjb/braindecode/blob/profiling-mne-epochs/test/others/time_training_iteration.py I haven't tested the Deep4 architecture yet, but with the shallow one I get about 35 ms per minibatch of 256 examples on a Tesla V100. From the previous test we know a conservative lower bound for getting windows is about 1 ms per window. If we need to apply some transforms (e.g., filtering, normalization, augmentation, etc.) on-the-fly, this will likely go up quite a bit. This means with the shallow architecture we'd be too slow for maximizing GPU usage at this batch size. Does this fit with the numbers you had previously? |
Why 256 examples? in script it is 64 right? |
Sorry, I meant 64! And yep, good point concerning multithreading. I agree with you, if we can get that close to HDF5 performance with a few modifications to MNE then we're on a good path. |
No I will try to optimize the |
Ok, cool. If that can be useful, on my end the |
Ahh yes that would make sense. Then in would scale with the number of epochs, etc. I'll see how bad it will be to fix this |
One possible API would be a new argument
To get epoch index 0. You could pass to If I
I get this: |
Thanks @larsoner, very elegant solution! I think this should satisfy our requirements. I have one question: I understand that you need to call |
Not quite. It loads each epoch one at a time and then discards it from memory (does not keep it around). Is that okay? |
It's also based on things like whether or not the entire epoch is actually extractable from raw (e.g., a 10 second epoch from an event occurring one second from the end of the raw instance would be discarded as TOO_SHORT or so. If it's desirable to just check these limits without actually loading the data we might be able to, but it again would be a larger refactoring. |
Ok, got it. So in our case, I guess we would run I think this should be fine for now. If calling |
yes I agree with @hubertjb that case (loading into and dropping from memory once at start of training) may be fine for now for us, we will see if it causes any practically relevant problems once we have more implemented full examples. |
Okay great, let us know of it does indeed end up being a bottleneck |
I started working on an example that shows how to do lazy vs eager loading on real data, and that compares the running times of both approaches (https://github.com/hubertjb/braindecode/blob/lazy-loading-example/examples/plot_lazy_vs_eager_loading.py). It's still not a perfectly realistic scenario - for instance, there is no preprocessing/transforms applied - but it should be better than my previous tests. Here are some preliminary results (they are subject to change)! So far, using the first 10 subjects of the TUH Abnormal dataset, and training a shallow net for 5 epochs on a normal/abnormal classification task, the breakdown is (average of 15 runs, in seconds):
Observations:
Next steps:
|
Great @hubertjb . Seems we are getting to a reasonable training time range. Would also be interesting how big the difference is for Deep4. And as you said, maybe num_workers would already close the gap enough to consider it finished. I would say a gap of 1.5x for deep4 to me is acceptable. |
Some good news, using Without (
With (
The two types of loading are much closer now. Also I noticed GPU usage stayed at 100% during training, which means the workers did what they are supposed to do. I'm not sure why data preparation time is so much longer for lazy loading with Also, @robintibor, I tried with Deep4, keeping all other parameters equal:
Surprisingly, it seems more efficient than the Shallow net... Is that expected? I haven't played with the arguments much, maybe I did something weird. |
No the behavior is very unexpected, time spent on GPU forward/backward pass should be longer for deep, therefore difference should be smaller. It is also very strange that for you deep is faster than shallow, that should not be. Before investigating further, please add following line somewhere before your main loop: torch.backends.cudnn.benchmark = True (see https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936 for why for our cases it should always be set to True) Numbers should improve overall. However it does not explain any of the current behavior, as said deep should be slower than shallow, not ~3 times faster (in eager mode) |
Latest results can be found in #75. |
@hubertjb I have rerun your perf code after the fix in #89 (which improves
As can be seen for eager
I don't know for these win_len settings which one is a realistic one? What do they correspond to? Would be great if we can get all of the ratios to <2 at some point, but good that we have a running implementation in any case |
outdated |
I've started looking at the performance of various ways of getting windows:
1- MNE:
epochs.get_data(ind)[0]
with lazy loading (preload=False
)2- MNE:
epochs.get_data(ind)[0]
with eager loading (preload=True
)3- MNE: direct access to the internal numpy array with
epochs._data[index]
(requires eager loading)4- HDF5: using h5py (lazy loading)
The script that I used to run the comparison is here:
https://github.com/hubertjb/braindecode/blob/profiling-mne-epochs/test/others/profiling_mne_epochs.py
Also, I ran the comparison on a single CPU using:
>>> taskset -c 0 python profiling_mne_epochs.py
Here's the resulting figure, where the x-axis is the number of time samples in the continuous recording:
For the moment, it looks like:
1-
._data[index]
is unsurprisingly the fastest, however it requires to load the entire data into memory.2- hdf5 is very close, with around 0.5 ms per loop, which is great knowing it's able to only load one window at a time.
3-
get_data(index)
is much slower, but this is expected as we know it creates a newmne.Epochs
object every time it's called. Also, the gap betweenpreload=True
andpreload=False
is about 1.5 ms, which might be OK. The main issue though seems to be the linear increase of execution time as the continuous data gets bigger and bigger.Next steps
Considering the benefits of using MNE for handling the EEG data inside the Dataset classes, I think it would be important to dive deeper into the inner workings of
get_data()
to see whether simple changes could make this more efficient. I can do some actual profiling on that. What do you think @agramfort @robintibor @gemeinl ?Note: I haven't included the extraction of labels in this test.
The text was updated successfully, but these errors were encountered: