Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a new parameter in the class AugmentedDataloader to send transformation to the device. #406

Merged
merged 6 commits into from
Sep 14, 2022

Conversation

bruAristimunha
Copy link
Collaborator

As suggested by @martinwimpff, this PR tries to close issue #404.

@codecov
Copy link

codecov bot commented Sep 12, 2022

Codecov Report

Merging #406 (2118c23) into master (990d574) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #406      +/-   ##
==========================================
+ Coverage   82.81%   82.83%   +0.01%     
==========================================
  Files          54       54              
  Lines        3789     3792       +3     
==========================================
+ Hits         3138     3141       +3     
  Misses        651      651              

@bruAristimunha
Copy link
Collaborator Author

I agree with you @robintibor! I incorporated this suggestion into the code and improved the documentation a bit. See if it's good for the merge.

@agramfort
Copy link
Collaborator

@bruAristimunha from your experience how much speed up do you observe with this? can you give us some details on practical impact? 🙏

@cedricrommel
Copy link
Collaborator

cedricrommel commented Sep 13, 2022

Thanks for the PR @bruAristimunha !! :)
From what I recall, I tried GPU-hosted augmentation when implementing this and did not see speed ups. This could make sense since not all augmentation operations are GPU-optimized. But I might have been doing something wrong as commented by @martinwimpff in the issue, so it would be great if we could have a quick impact analysis again as proposed by @agramfort :)

Also, when adding a new functionality like this you should try to add the corresponding test somewhere to avoid degrading the test coverage ☝️

@cedricrommel
Copy link
Collaborator

Also tested this on GPU on my side (since GHA does not support that). Works fine

@martinwimpff
Copy link
Collaborator

@bruAristimunha from your experience how much speed up do you observe with this? can you give us some details on practical impact? 🙏

One additional thing that I observed for the GaussianNoise augmentation is that there is no speed up on the GPU if we keep using the numpy rng (from sklearns check_random_state). This rng always generates the random numbers on the cpu and then transfers them to the gpu.

rng = check_random_state(random_state)
if isinstance(std, torch.Tensor):
    std = std.to(X.device)
noise = torch.from_numpy(
      rng.normal(
          loc=np.zeros(X.shape),
          scale=1
      ),
  ).float().to(X.device) * std
transformed_X = X + noise
return transformed_X, y

Using something like

noise = torch.normal(mean=torch.zeros_like(X, device=X.device), std=torch.as_tensor(std, device=X.device), generator=rng)

instead would result in a substantial speed up. The rng would have to be a torch.Generator.
So it might be good to also change the rng class to benefit from this change in augmentations with large random tensors.

@cedricrommel
Copy link
Collaborator

cedricrommel commented Sep 13, 2022

@bruAristimunha from your experience how much speed up do you observe with this? can you give us some details on practical impact? 🙏

One additional thing that I observed for the GaussianNoise augmentation is that there is no speed up on the GPU if we keep using the numpy rng (from sklearns check_random_state). This rng always generates the random numbers on the cpu and then transfers them to the gpu.

rng = check_random_state(random_state)
if isinstance(std, torch.Tensor):
    std = std.to(X.device)
noise = torch.from_numpy(
      rng.normal(
          loc=np.zeros(X.shape),
          scale=1
      ),
  ).float().to(X.device) * std
transformed_X = X + noise
return transformed_X, y

Using something like

noise = torch.normal(mean=torch.zeros_like(X, device=X.device), std=torch.as_tensor(std, device=X.device), generator=rng)

instead would result in a substantial speed up. The rng would have to be a torch.Generator. So it might be good to also change the rng class to benefit from this change in augmentations with large random tensors.

Good point @martinwimpff!

But we are potentially touching a point involving way more modifications to the code, because this remark involves all augmentations :)
Indeed, the torch.from_numpy is there because, when these augmentations were created, torch did not have the torch.Generator class yet and we wanted to be able to control the RNG object passed to the augmentation. But of course, pytorch has evolved and we should take advantage of it.

However, it is still probably worth seeing whether there are indeed any speed ups compared to the native CPU-multi-threading happening in DataLoaders before doing the change.

@bruAristimunha
Copy link
Collaborator Author

bruAristimunha commented Sep 13, 2022

Hello @agramfort, @martinwimpff and @cedricrommel,

I conducted a little analysis for each augmentation method listed below, performing a runtime analysis with five repetitions, passing either the GPU or the CPU as the device to the AugmentedDataLoader class.

  • BandstopFilter,
  • ChannelsDropout,
  • ChannelsShuffle,
  • ChannelsSymmetry,
  • FrequencyShift,
  • FTSurrogate,
  • GaussianNoise,
  • SensorsXRotation,
  • SignFlip,
  • SmoothTimeMask,
  • TimeReverse

The code to do the analysis is in the end. I studied using the entire BCI competition 4a dataset to stress memory consumption and see how the model behaves in a scenario closer to the real one. I ran in my research computer with i7-9700K CPU @ 3.60GHz and GPU: GeForce RTX 2080 Ti 11Gb. The expected is decrease the time when the augmentation is sent to the GPU.

The improvement ratio can be seen below:
image

It seems that for some raises, there is a good gain in general, except for Sensor*Rotation. I think this is related to Martin's discussion with Cedric. Some methods need to be GPU optimized, possibly in another PR.

Code:
https://gist.github.com/bruAristimunha/f56aecd62e660f292263a2bed2e18e98

@bruAristimunha
Copy link
Collaborator Author

It was necessary to test MixUp, as it has a slightly different logic, I could not run with this code.

@agramfort
Copy link
Collaborator

thx @bruAristimunha

@cedricrommel I let you merge if you're happy

@cedricrommel
Copy link
Collaborator

I'm happy :) Thanks @bruAristimunha
Indeed, not only does SensorsRotation not benefit from GPU, it is also the worst transform currently in terms of computation. One day I'll take some time and optimize it haha

LGTM, merging!

Let's leave the rng stuff for another PR as proposed. Not obvious either whether the proposed change is worth in terms of computation.

@cedricrommel cedricrommel merged commit 9bbf1c2 into braindecode:master Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants