New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU-Compatible Batch Versions of Existing Transforms #85
Conversation
Items are automatically converted to batches for the transform
Includes & requires a port of the numpy colored noise generation to PyTorch.
Small changes to adhere to the style guide so as not to block git hooks
Somewhat clearer that it's a range for the random multiplier
I can't reproduce the
It passes fine locally:
|
Wow this is great @jcaw. I'd definitely agree with your conclusion that you should get this upstream some how. I think the
I have pulled the code and run the following commands to test the code:
However I get the same issue as the CI
|
Initially I could not reproduce the error locally, but after debugging I found what is causing the test to break only on some machines. The problem is how one of the tests is defined, simply comparing two floating point numbers is error prone (0.1 + 0.2 == 0.3 is usually false). _test_ne(out[0], out[1]) The correct way would be assert not torch.allclose(out[0], out[1]) With this change the test breaks consistently on my machine. I did not investigate yet why it's breaking. |
So, the test starts with a two channel audio (shape The audio gets converted into a spectrogram (shape Finally, the delta is computed and here's the problem between the tests and implementation. The final step is to concatenate the different deltas: torch.cat([sg, delta, delta2], dim=1) This creates an output of shape |
Figured it was probably a floating point issue. Thanks for digging up the problem, I've updated that test to check the correct indices and tolerate float errors. This seems like a generic issue. Should we be using |
Codecov Report
@@ Coverage Diff @@
## master #85 +/- ##
==========================================
- Coverage 90.60% 86.21% -4.39%
==========================================
Files 12 13 +1
Lines 649 827 +178
Branches 70 82 +12
==========================================
+ Hits 588 713 +125
- Misses 34 83 +49
- Partials 27 31 +4
Continue to review full report at Codecov.
|
Yes we should but in a separate PR :) |
Hey @jcaw, we're going to get to this soon, sorry about the long wait! |
@@ -0,0 +1,251 @@ | |||
import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be inside the augment/ folder, as it contains the functional implementation of the data augmentations
from fastaudio.functional import region_mask | ||
|
||
|
||
class TestCreateRegionMask: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unittest style class can probably be transformed to the pytest style of simple functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we preferring simple function tests instead of the class style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
As far as I've seen, the new transforms do not require a GPU to be used, the only have the batched implementation that respects the device where the input came from. Considering this, I'm not sure if the *GPU suffix is correct to use when naming the transforms. |
I agree with those comments. I think it would also be nice to have the benchmark code as part of this PR. Great job @jcaw. I'd definitely consider getting this in torchaudiomentations as well if you want to. |
GPU-compatible batch transforms BREAKING CHANGE: GPU-Compatible batch transforms (#85)
🎉 This PR is included in version 1.0.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Introduction
This PR adds GPU implementations for the following transforms:
AddNoise
,ChangeVolume
,SignalCutout
,SignalLoss
TfmResize
,Delta
,MaskFreq
,MaskTime
The GPU implementations are currently added alongside the originals, e.g.
Delta
vs.DeltaGPU
. I propose replacing the originals outright where possible, but I've done a more thorough analysis with benchmarking below.Demos
I set up a Colab notebook with demos of the new transforms here. (It might make sense to turn this into documentation for all transforms at some point.)
Automatic Batching
I've added a wrapper,
@auto-batch
, that can be added to theencodes
method of any batch transform to make it compatible with items too. You just need to specify the number of dimensions in a single item, and when item tensors are received, they will have a dummy batch dimension added for the transform.As a result, all of these transforms work on both batches and items, with no user intervention.
(The overhead of the wrapper is measured below.)
Changes in Behaviour
Some methods have had their behaviour expanded/altered in the port.
AudioTensor
AddNoise
- the GPU version exposes a minimum value for the noise, and allows the transform to be applied to random items. I have seen certain nets degrade when noise is added to all samples, seemingly because they learn to always expect background noise and don't know how to deal with clean samples. Adding noise to a subset of items fixes this, so it seems a sensible addition.Noise values are also changed so the max & min are relative to one standard deviation of the original tensor, not the range
-1
to1
, so the noise level is consistent relative to the average of the signal. This has its own drawbacks (e.g. samples that are mostly silence get much less noise), so I'm not sure which method is better. Let me know which you prefer, I can change it easily.(Noise can also now be added directly to spectrograms too.)
ChangeVolume
- no change.SignalCutout
- a minimum cutout percentage is now exposed.SignalLoss
- no change.AudioSpectrogram
TfmResize
- no change.Delta
- the GPU version now usestorchaudio
to compute deltas, and exposes the padding mode.MaskFreq
&MaskTime
- the GPU version is modified to more closely match the originalSpecAugment
paper, with a couple of other additions.A. Masks are now assigned a random width (within a specified range).
B. The replacement value is now the mean of the masked area, to avoid changing the spectrogram's overall mean (although the standard deviation will still be affected). This can be overridden by specifying a
mask_val
.C. You can no longer specify where you want the mask(s) to start. It's always random.
D. One objective of SpecAugment masking seems to be encouraging the network to learn how to look at parts of the spectrogram it would otherwise ignore. The same mask will now span across all channels, to ensure the net does not avoid this by inferring the missing information from same region in another channel (which is likely to be quite similar).
Benchmarks
I benchmarked the new transforms on two boxes:
Results are presented for both. Benchmarks are repeated 1000 times on the Colab box and 100 times locally, except for the
batch_size=64
tests, which are repeated 250 times on the Colab box and 25 times locally. These benchmarks are on the plain Python versions of the transforms, I haven't compiled them to torchscript. Let me know how you'd like me to interact with torchscript and whether you'd like me to benchmark that too.Some results for the replacement delta transforms are missing on GPU due to an upstream bug affecting large tensors. It gets tripped due to the way
torchaudio
packs spectrograms for the delta method. This bug might not fire on newer cards (it may to be related to the maximum CUDA block size). I can pullDeltaGPU
out into a separate PR if you'd like to wait until the upstream issues are fixed.Old vs. New Implementations
I compared the execution speed on CPU between the old and new methods to establish which of the new methods add unacceptable overhead and which should replace the old implementations. Benchmarking script here.
Results are split between
AudioTensor
andAudioSpectrogram
objects. These operations are performed on single items with no batch dimension.AudioTensor
Colab (Xeon, x2 @ 2.20 GHz)
Local (i5-4590, x4 @ 3.30 GHz)
Conclusion
Based on these results I propose:
AddNoise
- replace. The GPU-compatible version seems to have similar overhead, but does more.ChangeVolume
- replace. This method is so fast to begin with that the loss of efficiency is not likely to be significant relative to the entire pipeline. Auto-batching may also be responsible for a chunk of this.SignalCutout
- undecided. The GPU-compatible version is slower, but also allows a minimum cut percentage to be specified. If the original is kept, I think it should also add that.SignalLoss
- replace. The additional overhead of the GPU version appears minimal. I propose replacing for cleanliness.AudioSpectrogram
Colab (Xeon, 2 Cores @ 2.20 GHz)
Local (i5-4590, 4 Cores @ 3.30 GHz)
Conclusion
TfmResize
- replace. The new implementation is comparable.Delta
- replace. The new implementation is much faster.MaskTime
- replace. The new implementation is comparable but does a lot more.MaskFreq
- replace. The new implementation is much slower but does a lot more. Interestingly, this shows the overhead of the conversion operations in the originalMaskTime
implementation - they dwarf the underlying transform.GPU Performance
I've also benchmarked the performance on GPU, to give an idea of how the new implementations scale and the relative overhead of the different operations.
I'm suspicious of the CPU vs. GPU results on the GTX 970 box. I think the GPU should be performing dramatically better (those that I've used in a real training loop have negligible overhead compared to their CPU counterparts), so there may be a problem in the benchmarking script. I believe the 970 also has strange memory characteristics that mean the upper portion of its VRAM is slow compared to the rest - I don't know if that would be affecting things.
GPU Only, Various Batch Sizes
This is just to illustrate how each transform scales. Some transforms at larger batch sizes are missing due to CUDA memory errors.
Tesla T4
GTX 970
GPU vs. CPU, Batch Size = 32
Tesla T4
GTX 970
GPU vs. CPU, Batch Size = 64
Tesla T4
GTX 970
Automatic Batching
I've measured two dummy transforms that do nothing - one with the
@auto_batch
wrapper and one without. The wrapper is very cheap, adding minimal overhead.Colab (Tesla, Xeon)
Local (970, i5-4590)
Tests
Tests are not final. I've currently just switched them over to the GPU versions of the transforms. Once a final set of transforms has been decided, I can concretize the tests.
Docstrings
Docstrings aren't final. I'll write them once the code is finalised. Let me know
what format you'd prefer.
FastAI API Integration
I've currently set the transforms up as subclasses of the basic
Transform
class, but this might not be ideal. I'm not familiar enough with the subclasses of Transform to know if this is correct. PerhapsDisplayedTransform
would be preferable?Conclusion
Let me know whether these implementations are acceptable and which of the original transforms you would like to keep/discard. I think it makes sense to merge these here initially, then I can look at upstreaming relevant transforms (E.g. the new SpecAugment masking implementation) into
torchaudio
ortorch-audiomentations
.