GPU-Compatible Batch Versions of Existing Transforms #85

jcaw · 2021-01-14T20:18:41Z

Introduction

This PR adds GPU implementations for the following transforms:

Signal: AddNoise, ChangeVolume, SignalCutout, SignalLoss
Spectrogram: - TfmResize, Delta, MaskFreq, MaskTime

The GPU implementations are currently added alongside the originals, e.g. Delta vs. DeltaGPU. I propose replacing the originals outright where possible, but I've done a more thorough analysis with benchmarking below.

Demos

I set up a Colab notebook with demos of the new transforms here. (It might make sense to turn this into documentation for all transforms at some point.)

Automatic Batching

I've added a wrapper, @auto-batch, that can be added to the encodes method of any batch transform to make it compatible with items too. You just need to specify the number of dimensions in a single item, and when item tensors are received, they will have a dummy batch dimension added for the transform.

As a result, all of these transforms work on both batches and items, with no user intervention.

(The overhead of the wrapper is measured below.)

Changes in Behaviour

Some methods have had their behaviour expanded/altered in the port.

AudioTensor

AddNoise - the GPU version exposes a minimum value for the noise, and allows the transform to be applied to random items. I have seen certain nets degrade when noise is added to all samples, seemingly because they learn to always expect background noise and don't know how to deal with clean samples. Adding noise to a subset of items fixes this, so it seems a sensible addition.

Noise values are also changed so the max & min are relative to one standard deviation of the original tensor, not the range -1 to 1, so the noise level is consistent relative to the average of the signal. This has its own drawbacks (e.g. samples that are mostly silence get much less noise), so I'm not sure which method is better. Let me know which you prefer, I can change it easily.

(Noise can also now be added directly to spectrograms too.)
ChangeVolume - no change.
SignalCutout - a minimum cutout percentage is now exposed.
SignalLoss - no change.

AudioSpectrogram

TfmResize - no change.
Delta - the GPU version now uses torchaudio to compute deltas, and exposes the padding mode.
MaskFreq & MaskTime - the GPU version is modified to more closely match the original SpecAugment paper, with a couple of other additions.
A. Masks are now assigned a random width (within a specified range).
B. The replacement value is now the mean of the masked area, to avoid changing the spectrogram's overall mean (although the standard deviation will still be affected). This can be overridden by specifying a mask_val.
C. You can no longer specify where you want the mask(s) to start. It's always random.
D. One objective of SpecAugment masking seems to be encouraging the network to learn how to look at parts of the spectrogram it would otherwise ignore. The same mask will now span across all channels, to ensure the net does not avoid this by inferring the missing information from same region in another channel (which is likely to be quite similar).

Benchmarks

I benchmarked the new transforms on two boxes:

Local - Nvidia GTX 970, i5-4590 (4 cores, 3.30GHz)
Colab - Tesla T4, Intel Xeon (2 cores, 2.20GHz)

Results are presented for both. Benchmarks are repeated 1000 times on the Colab box and 100 times locally, except for the batch_size=64 tests, which are repeated 250 times on the Colab box and 25 times locally. These benchmarks are on the plain Python versions of the transforms, I haven't compiled them to torchscript. Let me know how you'd like me to interact with torchscript and whether you'd like me to benchmark that too.

Some results for the replacement delta transforms are missing on GPU due to an upstream bug affecting large tensors. It gets tripped due to the way torchaudio packs spectrograms for the delta method. This bug might not fire on newer cards (it may to be related to the maximum CUDA block size). I can pull DeltaGPU out into a separate PR if you'd like to wait until the upstream issues are fixed.

Old vs. New Implementations

I compared the execution speed on CPU between the old and new methods to establish which of the new methods add unacceptable overhead and which should replace the old implementations. Benchmarking script here.

Results are split between AudioTensor and AudioSpectrogram objects. These operations are performed on single items with no batch dimension.

AudioTensor

Colab (Xeon, x2 @ 2.20 GHz)

Local (i5-4590, x4 @ 3.30 GHz)

Conclusion

Based on these results I propose:

AddNoise - replace. The GPU-compatible version seems to have similar overhead, but does more.
ChangeVolume - replace. This method is so fast to begin with that the loss of efficiency is not likely to be significant relative to the entire pipeline. Auto-batching may also be responsible for a chunk of this.
SignalCutout - undecided. The GPU-compatible version is slower, but also allows a minimum cut percentage to be specified. If the original is kept, I think it should also add that.
SignalLoss - replace. The additional overhead of the GPU version appears minimal. I propose replacing for cleanliness.

AudioSpectrogram

Colab (Xeon, 2 Cores @ 2.20 GHz)

Local (i5-4590, 4 Cores @ 3.30 GHz)

Conclusion

TfmResize - replace. The new implementation is comparable.
Delta - replace. The new implementation is much faster.
MaskTime - replace. The new implementation is comparable but does a lot more.
MaskFreq - replace. The new implementation is much slower but does a lot more. Interestingly, this shows the overhead of the conversion operations in the original MaskTime implementation - they dwarf the underlying transform.

GPU Performance

I've also benchmarked the performance on GPU, to give an idea of how the new implementations scale and the relative overhead of the different operations.

I'm suspicious of the CPU vs. GPU results on the GTX 970 box. I think the GPU should be performing dramatically better (those that I've used in a real training loop have negligible overhead compared to their CPU counterparts), so there may be a problem in the benchmarking script. I believe the 970 also has strange memory characteristics that mean the upper portion of its VRAM is slow compared to the rest - I don't know if that would be affecting things.

GPU Only, Various Batch Sizes

This is just to illustrate how each transform scales. Some transforms at larger batch sizes are missing due to CUDA memory errors.

Tesla T4

GTX 970

GPU vs. CPU, Batch Size = 32

Tesla T4

GTX 970

GPU vs. CPU, Batch Size = 64

Tesla T4

GTX 970

Automatic Batching

I've measured two dummy transforms that do nothing - one with the @auto_batch wrapper and one without. The wrapper is very cheap, adding minimal overhead.

Colab (Tesla, Xeon)

Local (970, i5-4590)

Tests

Tests are not final. I've currently just switched them over to the GPU versions of the transforms. Once a final set of transforms has been decided, I can concretize the tests.

Docstrings

Docstrings aren't final. I'll write them once the code is finalised. Let me know
what format you'd prefer.

FastAI API Integration

I've currently set the transforms up as subclasses of the basic Transform class, but this might not be ideal. I'm not familiar enough with the subclasses of Transform to know if this is correct. Perhaps DisplayedTransform would be preferable?

Conclusion

Let me know whether these implementations are acceptable and which of the original transforms you would like to keep/discard. I think it makes sense to merge these here initially, then I can look at upstreaming relevant transforms (E.g. the new SpecAugment masking implementation) into torchaudio or torch-audiomentations.

Items are automatically converted to batches for the transform

Includes & requires a port of the numpy colored noise generation to PyTorch.

Small changes to adhere to the style guide so as not to block git hooks

Somewhat clearer that it's a range for the random multiplier

jcaw · 2021-01-16T14:02:57Z

I can't reproduce the DeltaGPU test that's failing in the CI on my machine.

=========================== short test summary info ============================
FAILED tests/test_spectrogram_augment.py::test_delta_channels - AssertionErro...
============ 1 failed, 77 passed, 15 warnings in 136.70s (0:02:16) =============

It passes fine locally:

$ pytest tests/test_spectrogram_augment.py::test_delta_channels
=================================== test session starts ===================================
platform linux -- Python 3.8.5, pytest-6.2.1, py-1.9.0, pluggy-0.13.1 -- /home/jcaw/opt/miniconda3/envs/fastaudio-python38/bin/python

...

============================== 1 passed, 3 warnings in 2.47s ==============================

mogwai · 2021-01-17T19:37:37Z

Wow this is great @jcaw. I'd definitely agree with your conclusion that you should get this upstream some how. I think the @auto_batch magic is very clever.

Can't reproduce the DeltaGPU test that's failing in the CI on my machine.

I have pulled the code and run the following commands to test the code:

# Removes any stragling files (be warned this will remove files you might be working on including untracked files)
git clean -xdf
pip install -e .[dev,testing]
pytest -k delta

However I get the same issue as the CI

AssertionError: !=:
E       AudioSpectrogram([[ 22.0069,  20.6277,  15.9740,  ...,  15.9218,  20.5427,  21.9010],
E               [ 22.2190,  20.8464,  16.1923,  ...,  16.1378,  20.7625,  22.1109],
E               [ 22.5125,  21.1315,  16.4866,  ...,  16.4314,  21.0446,  22.4059],
E               ...,
E               [-12.0008, -13.3763, -18.0215,  ..., -18.0753, -13.4640, -12.1110],
E               [-11.9635, -13.3389, -17.9841,  ..., -18.0514, -13.4395, -12.0864],
E               [-11.9141, -13.2894, -17.9346,  ..., -18.0018, -13.3879, -12.0346]])
E       AudioSpectrogram([[ 22.0069,  20.6277,  15.9740,  ...,  15.9218,  20.5427,  21.9010],
E               [ 22.2190,  20.8464,  16.1923,  ...,  16.1378,  20.7625,  22.1109],
E               [ 22.5125,  21.1315,  16.4866,  ...,  16.4314,  21.0446,  22.4059],
E               ...,
E               [-12.0008, -13.3763, -18.0215,  ..., -18.0753, -13.4640, -12.1110],
E               [-11.9635, -13.3389, -17.9841,  ..., -18.0514, -13.4395, -12.0864],
E               [-11.9141, -13.2894, -17.9346,  ..., -18.0018, -13.3879, -12.0346]])

scart97 · 2021-01-18T19:19:59Z

Initially I could not reproduce the error locally, but after debugging I found what is causing the test to break only on some machines. The problem is how one of the tests is defined, simply comparing two floating point numbers is error prone (0.1 + 0.2 == 0.3 is usually false).
So, instead of comparing:

_test_ne(out[0], out[1])

The correct way would be

assert not torch.allclose(out[0], out[1])

With this change the test breaks consistently on my machine. I did not investigate yet why it's breaking.

scart97 · 2021-01-18T19:34:56Z

So, the test starts with a two channel audio (shape 2, 32000), where the two channels contain the same information (torch.allclose(audio[0], audio[1]) is true).

The audio gets converted into a spectrogram (shape 2, 128, 251), where again the two channels have the same information.

Finally, the delta is computed and here's the problem between the tests and implementation. The final step is to concatenate the different deltas:

torch.cat([sg, delta, delta2], dim=1)

This creates an output of shape 6, 128, 251. As I understand, the test breaking is trying to assert that the delta is different than the original spectrogram. In this case, it should test between out[0] and out[2]. We encounter this fact: out[0] and out[1] both contain the original spec, out[2] and out[3] are equal to the first delta, out[4] and out[5] are the second delta. And due to the two initial channels being equal, the test breaks.

jcaw · 2021-01-20T18:54:51Z

Figured it was probably a floating point issue. Thanks for digging up the problem, I've updated that test to check the correct indices and tolerate float errors.

This seems like a generic issue. Should we be using allclose (or similar) on all tests?

codecov · 2021-01-20T18:55:29Z

Codecov Report

Merging #85 (50fecc5) into master (449a0b4) will decrease coverage by 4.38%.
The diff coverage is 94.02%.

@@            Coverage Diff             @@
##           master      #85      +/-   ##
==========================================
- Coverage   90.60%   86.21%   -4.39%     
==========================================
  Files          12       13       +1     
  Lines         649      827     +178     
  Branches       70       82      +12     
==========================================
+ Hits          588      713     +125     
- Misses         34       83      +49     
- Partials       27       31       +4

Impacted Files	Coverage Δ
src/fastaudio/util.py	`91.17% <80.00%> (-8.83%)`	⬇️
src/fastaudio/augment/spectrogram.py	`85.61% <90.47%> (-5.73%)`	⬇️
src/fastaudio/augment/signal.py	`82.02% <96.49%> (-14.05%)`	⬇️
src/fastaudio/functional.py	`97.14% <97.14%> (ø)`
src/fastaudio/core/signal.py	`71.87% <0.00%> (-9.38%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 449a0b4...50fecc5. Read the comment docs.

mogwai · 2021-01-22T08:03:04Z

Yes we should but in a separate PR :)

mogwai · 2021-03-12T07:42:26Z

Hey @jcaw, we're going to get to this soon, sorry about the long wait!

scart97 · 2021-03-12T21:11:53Z

src/fastaudio/functional.py

@@ -0,0 +1,251 @@
+import torch


This file should be inside the augment/ folder, as it contains the functional implementation of the data augmentations

scart97 · 2021-03-12T21:13:20Z

tests/test_functional.py

+from fastaudio.functional import region_mask
+
+
+class TestCreateRegionMask:


This unittest style class can probably be transformed to the pytest style of simple functions

Are we preferring simple function tests instead of the class style?

scart97 · 2021-03-12T21:20:15Z

As far as I've seen, the new transforms do not require a GPU to be used, the only have the batched implementation that respects the device where the input came from. Considering this, I'm not sure if the *GPU suffix is correct to use when naming the transforms.
Any suggestions @mogwai ?

mogwai · 2021-03-12T21:26:37Z

I agree with those comments.

I think it would also be nice to have the benchmark code as part of this PR.

Great job @jcaw. I'd definitely consider getting this in torchaudiomentations as well if you want to.

GPU-compatible batch transforms BREAKING CHANGE: GPU-Compatible batch transforms (#85)

scart97 · 2021-05-23T21:02:02Z

🎉 This PR is included in version 1.0.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

jcaw added 10 commits January 14, 2021 14:04

Add wrapper to make batch transforms work on items

1753ca0

Items are automatically converted to batches for the transform

Add batch version of the ChangeVolume transform

a57f024

Add batch version of SignalCutout transform

2339fd2

Add batch form of SignalLoss transform

dea0477

Add batch version of AddNoise transform

bcbf4bf

Includes & requires a port of the numpy colored noise generation to PyTorch.

Add batch version of TfmResize transform

88bb21d

Add batch version of Delta transform

f2d07ff

Add batch versions of MaskTime/Freq transforms

078f1f9

Style guide adherence

a3b6bce

Small changes to adhere to the style guide so as not to block git hooks

Scale the rand multiplier, not the std dev

5c88d21

Somewhat clearer that it's a range for the random multiplier

scart97 requested review from mogwai and scart97 January 14, 2021 20:42

scart97 added the enhancement New feature or request label Jan 14, 2021

scart97 changed the title ~~GPU-Compatible Batch Versions of Existing Transforms~~ BREAKING CHANGE: GPU-Compatible Batch Versions of Existing Transforms Jan 14, 2021

scart97 changed the title ~~BREAKING CHANGE: GPU-Compatible Batch Versions of Existing Transforms~~ FEATURE: GPU-Compatible Batch Versions of Existing Transforms Jan 14, 2021

mogwai linked an issue Jan 17, 2021 that may be closed by this pull request

Porting Transforms to GPU #62

Closed

jcaw added 3 commits January 20, 2021 18:42

Check against correct delta stacking order

744de54

Update outdated docstring

c6dd7d4

Tolerate floating-point errors in delta test

50fecc5

mogwai changed the title ~~FEATURE: GPU-Compatible Batch Versions of Existing Transforms~~ GPU-Compatible Batch Versions of Existing Transforms Mar 4, 2021

scart97 reviewed Mar 12, 2021

View reviewed changes

mogwai approved these changes Mar 12, 2021

View reviewed changes

mogwai mentioned this pull request Mar 22, 2021

Use torch.allclose in tests #90

Open

scart97 added a commit that referenced this pull request May 23, 2021

BREAKING CHANGE: GPU-Compatible batch transforms (#85)

661d76d

scart97 merged commit 50fecc5 into fastaudio:master May 23, 2021

scart97 added a commit that referenced this pull request May 23, 2021

BREAKING CHANGE: GPU-Compatible batch transforms (#85)

38b4534

GPU-compatible batch transforms BREAKING CHANGE: GPU-Compatible batch transforms (#85)

scart97 added the released label May 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-Compatible Batch Versions of Existing Transforms #85

GPU-Compatible Batch Versions of Existing Transforms #85

jcaw commented Jan 14, 2021

jcaw commented Jan 16, 2021 •

edited

mogwai commented Jan 17, 2021

scart97 commented Jan 18, 2021

scart97 commented Jan 18, 2021

jcaw commented Jan 20, 2021 •

edited

codecov bot commented Jan 20, 2021 •

edited

mogwai commented Jan 22, 2021

mogwai commented Mar 12, 2021

scart97 Mar 12, 2021

scart97 Mar 12, 2021

mogwai Mar 12, 2021 •

edited

scart97 Mar 12, 2021

scart97 commented Mar 12, 2021

mogwai commented Mar 12, 2021

scart97 commented May 23, 2021

		from fastaudio.functional import region_mask


		class TestCreateRegionMask:

GPU-Compatible Batch Versions of Existing Transforms #85

GPU-Compatible Batch Versions of Existing Transforms #85

Conversation

jcaw commented Jan 14, 2021

Introduction

Demos

Automatic Batching

Changes in Behaviour

AudioTensor

AudioSpectrogram

Benchmarks

Old vs. New Implementations

AudioTensor

Colab (Xeon, x2 @ 2.20 GHz)

Local (i5-4590, x4 @ 3.30 GHz)

Conclusion

AudioSpectrogram

Colab (Xeon, 2 Cores @ 2.20 GHz)

Local (i5-4590, 4 Cores @ 3.30 GHz)

Conclusion

GPU Performance

GPU Only, Various Batch Sizes

Tesla T4

GTX 970

GPU vs. CPU, Batch Size = 32

Tesla T4

GTX 970

GPU vs. CPU, Batch Size = 64

Tesla T4

GTX 970

Automatic Batching

Colab (Tesla, Xeon)

Local (970, i5-4590)

Tests

Docstrings

FastAI API Integration

Conclusion

jcaw commented Jan 16, 2021 • edited

mogwai commented Jan 17, 2021

scart97 commented Jan 18, 2021

scart97 commented Jan 18, 2021

jcaw commented Jan 20, 2021 • edited

codecov bot commented Jan 20, 2021 • edited

Codecov Report

mogwai commented Jan 22, 2021

mogwai commented Mar 12, 2021

scart97 Mar 12, 2021

Choose a reason for hiding this comment

scart97 Mar 12, 2021

Choose a reason for hiding this comment

mogwai Mar 12, 2021 • edited

Choose a reason for hiding this comment

scart97 Mar 12, 2021

Choose a reason for hiding this comment

scart97 commented Mar 12, 2021

mogwai commented Mar 12, 2021

scart97 commented May 23, 2021

jcaw commented Jan 16, 2021 •

edited

jcaw commented Jan 20, 2021 •

edited

codecov bot commented Jan 20, 2021 •

edited

mogwai Mar 12, 2021 •

edited