Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-Compatible Batch Versions of Existing Transforms #85

Merged
merged 13 commits into from May 23, 2021

Conversation

jcaw
Copy link
Contributor

@jcaw jcaw commented Jan 14, 2021

Introduction

This PR adds GPU implementations for the following transforms:

  • Signal: AddNoise, ChangeVolume, SignalCutout, SignalLoss
  • Spectrogram: - TfmResize, Delta, MaskFreq, MaskTime

The GPU implementations are currently added alongside the originals, e.g. Delta vs. DeltaGPU. I propose replacing the originals outright where possible, but I've done a more thorough analysis with benchmarking below.

Demos

I set up a Colab notebook with demos of the new transforms here. (It might make sense to turn this into documentation for all transforms at some point.)

Automatic Batching

I've added a wrapper, @auto-batch, that can be added to the encodes method of any batch transform to make it compatible with items too. You just need to specify the number of dimensions in a single item, and when item tensors are received, they will have a dummy batch dimension added for the transform.

As a result, all of these transforms work on both batches and items, with no user intervention.

(The overhead of the wrapper is measured below.)

Changes in Behaviour

Some methods have had their behaviour expanded/altered in the port.

AudioTensor

  1. AddNoise - the GPU version exposes a minimum value for the noise, and allows the transform to be applied to random items. I have seen certain nets degrade when noise is added to all samples, seemingly because they learn to always expect background noise and don't know how to deal with clean samples. Adding noise to a subset of items fixes this, so it seems a sensible addition.

    Noise values are also changed so the max & min are relative to one standard deviation of the original tensor, not the range -1 to 1, so the noise level is consistent relative to the average of the signal. This has its own drawbacks (e.g. samples that are mostly silence get much less noise), so I'm not sure which method is better. Let me know which you prefer, I can change it easily.

    (Noise can also now be added directly to spectrograms too.)

  2. ChangeVolume - no change.

  3. SignalCutout - a minimum cutout percentage is now exposed.

  4. SignalLoss - no change.

AudioSpectrogram

  1. TfmResize - no change.
  2. Delta - the GPU version now uses torchaudio to compute deltas, and exposes the padding mode.
  3. MaskFreq & MaskTime - the GPU version is modified to more closely match the original SpecAugment paper, with a couple of other additions.
    A. Masks are now assigned a random width (within a specified range).
    B. The replacement value is now the mean of the masked area, to avoid changing the spectrogram's overall mean (although the standard deviation will still be affected). This can be overridden by specifying a mask_val.
    C. You can no longer specify where you want the mask(s) to start. It's always random.
    D. One objective of SpecAugment masking seems to be encouraging the network to learn how to look at parts of the spectrogram it would otherwise ignore. The same mask will now span across all channels, to ensure the net does not avoid this by inferring the missing information from same region in another channel (which is likely to be quite similar).

Benchmarks

I benchmarked the new transforms on two boxes:

  1. Local - Nvidia GTX 970, i5-4590 (4 cores, 3.30GHz)
  2. Colab - Tesla T4, Intel Xeon (2 cores, 2.20GHz)

Results are presented for both. Benchmarks are repeated 1000 times on the Colab box and 100 times locally, except for the batch_size=64 tests, which are repeated 250 times on the Colab box and 25 times locally. These benchmarks are on the plain Python versions of the transforms, I haven't compiled them to torchscript. Let me know how you'd like me to interact with torchscript and whether you'd like me to benchmark that too.

Some results for the replacement delta transforms are missing on GPU due to an upstream bug affecting large tensors. It gets tripped due to the way torchaudio packs spectrograms for the delta method. This bug might not fire on newer cards (it may to be related to the maximum CUDA block size). I can pull DeltaGPU out into a separate PR if you'd like to wait until the upstream issues are fixed.

Old vs. New Implementations

I compared the execution speed on CPU between the old and new methods to establish which of the new methods add unacceptable overhead and which should replace the old implementations. Benchmarking script here.

Results are split between AudioTensor and AudioSpectrogram objects. These operations are performed on single items with no batch dimension.

AudioTensor

Colab (Xeon, x2 @ 2.20 GHz)

device_name_plot

Local (i5-4590, x4 @ 3.30 GHz)

device_name_plot

Conclusion

Based on these results I propose:

  1. AddNoise - replace. The GPU-compatible version seems to have similar overhead, but does more.
  2. ChangeVolume - replace. This method is so fast to begin with that the loss of efficiency is not likely to be significant relative to the entire pipeline. Auto-batching may also be responsible for a chunk of this.
  3. SignalCutout - undecided. The GPU-compatible version is slower, but also allows a minimum cut percentage to be specified. If the original is kept, I think it should also add that.
  4. SignalLoss - replace. The additional overhead of the GPU version appears minimal. I propose replacing for cleanliness.

AudioSpectrogram

Colab (Xeon, 2 Cores @ 2.20 GHz)

device_name_plot

Local (i5-4590, 4 Cores @ 3.30 GHz)

device_name_plot

Conclusion

  1. TfmResize - replace. The new implementation is comparable.
  2. Delta - replace. The new implementation is much faster.
  3. MaskTime - replace. The new implementation is comparable but does a lot more.
  4. MaskFreq - replace. The new implementation is much slower but does a lot more. Interestingly, this shows the overhead of the conversion operations in the original MaskTime implementation - they dwarf the underlying transform.

GPU Performance

I've also benchmarked the performance on GPU, to give an idea of how the new implementations scale and the relative overhead of the different operations.

I'm suspicious of the CPU vs. GPU results on the GTX 970 box. I think the GPU should be performing dramatically better (those that I've used in a real training loop have negligible overhead compared to their CPU counterparts), so there may be a problem in the benchmarking script. I believe the 970 also has strange memory characteristics that mean the upper portion of its VRAM is slow compared to the rest - I don't know if that would be affecting things.

GPU Only, Various Batch Sizes

This is just to illustrate how each transform scales. Some transforms at larger batch sizes are missing due to CUDA memory errors.

Tesla T4

batch_size_plot
batch_size_plot

GTX 970

batch_size_plot
batch_size_plot

GPU vs. CPU, Batch Size = 32

Tesla T4

device_name_plot
device_name_plot

GTX 970

device_name_plot
device_name_plot

GPU vs. CPU, Batch Size = 64

Tesla T4

device_name_plot
device_name_plot

GTX 970

device_name_plot
device_name_plot

Automatic Batching

I've measured two dummy transforms that do nothing - one with the @auto_batch wrapper and one without. The wrapper is very cheap, adding minimal overhead.

Colab (Tesla, Xeon)

device_name_plot

Local (970, i5-4590)

device_name_plot

Tests

Tests are not final. I've currently just switched them over to the GPU versions of the transforms. Once a final set of transforms has been decided, I can concretize the tests.

Docstrings

Docstrings aren't final. I'll write them once the code is finalised. Let me know
what format you'd prefer.

FastAI API Integration

I've currently set the transforms up as subclasses of the basic Transform class, but this might not be ideal. I'm not familiar enough with the subclasses of Transform to know if this is correct. Perhaps DisplayedTransform would be preferable?

Conclusion

Let me know whether these implementations are acceptable and which of the original transforms you would like to keep/discard. I think it makes sense to merge these here initially, then I can look at upstreaming relevant transforms (E.g. the new SpecAugment masking implementation) into torchaudio or torch-audiomentations.

@scart97 scart97 added the enhancement New feature or request label Jan 14, 2021
@scart97 scart97 changed the title GPU-Compatible Batch Versions of Existing Transforms BREAKING CHANGE: GPU-Compatible Batch Versions of Existing Transforms Jan 14, 2021
@scart97 scart97 changed the title BREAKING CHANGE: GPU-Compatible Batch Versions of Existing Transforms FEATURE: GPU-Compatible Batch Versions of Existing Transforms Jan 14, 2021
@jcaw
Copy link
Contributor Author

jcaw commented Jan 16, 2021

I can't reproduce the DeltaGPU test that's failing in the CI on my machine.

=========================== short test summary info ============================
FAILED tests/test_spectrogram_augment.py::test_delta_channels - AssertionErro...
============ 1 failed, 77 passed, 15 warnings in 136.70s (0:02:16) =============

It passes fine locally:

$ pytest tests/test_spectrogram_augment.py::test_delta_channels
=================================== test session starts ===================================
platform linux -- Python 3.8.5, pytest-6.2.1, py-1.9.0, pluggy-0.13.1 -- /home/jcaw/opt/miniconda3/envs/fastaudio-python38/bin/python

...

============================== 1 passed, 3 warnings in 2.47s ==============================

@mogwai mogwai linked an issue Jan 17, 2021 that may be closed by this pull request
@mogwai
Copy link
Member

mogwai commented Jan 17, 2021

Wow this is great @jcaw. I'd definitely agree with your conclusion that you should get this upstream some how. I think the @auto_batch magic is very clever.

Can't reproduce the DeltaGPU test that's failing in the CI on my machine.

I have pulled the code and run the following commands to test the code:

# Removes any stragling files (be warned this will remove files you might be working on including untracked files)
git clean -xdf
pip install -e .[dev,testing]
pytest -k delta

However I get the same issue as the CI

AssertionError: !=:
E       AudioSpectrogram([[ 22.0069,  20.6277,  15.9740,  ...,  15.9218,  20.5427,  21.9010],
E               [ 22.2190,  20.8464,  16.1923,  ...,  16.1378,  20.7625,  22.1109],
E               [ 22.5125,  21.1315,  16.4866,  ...,  16.4314,  21.0446,  22.4059],
E               ...,
E               [-12.0008, -13.3763, -18.0215,  ..., -18.0753, -13.4640, -12.1110],
E               [-11.9635, -13.3389, -17.9841,  ..., -18.0514, -13.4395, -12.0864],
E               [-11.9141, -13.2894, -17.9346,  ..., -18.0018, -13.3879, -12.0346]])
E       AudioSpectrogram([[ 22.0069,  20.6277,  15.9740,  ...,  15.9218,  20.5427,  21.9010],
E               [ 22.2190,  20.8464,  16.1923,  ...,  16.1378,  20.7625,  22.1109],
E               [ 22.5125,  21.1315,  16.4866,  ...,  16.4314,  21.0446,  22.4059],
E               ...,
E               [-12.0008, -13.3763, -18.0215,  ..., -18.0753, -13.4640, -12.1110],
E               [-11.9635, -13.3389, -17.9841,  ..., -18.0514, -13.4395, -12.0864],
E               [-11.9141, -13.2894, -17.9346,  ..., -18.0018, -13.3879, -12.0346]])

@scart97
Copy link
Collaborator

scart97 commented Jan 18, 2021

Initially I could not reproduce the error locally, but after debugging I found what is causing the test to break only on some machines. The problem is how one of the tests is defined, simply comparing two floating point numbers is error prone (0.1 + 0.2 == 0.3 is usually false).
So, instead of comparing:

_test_ne(out[0], out[1])

The correct way would be

assert not torch.allclose(out[0], out[1])

With this change the test breaks consistently on my machine. I did not investigate yet why it's breaking.

@scart97
Copy link
Collaborator

scart97 commented Jan 18, 2021

So, the test starts with a two channel audio (shape 2, 32000), where the two channels contain the same information (torch.allclose(audio[0], audio[1]) is true).

The audio gets converted into a spectrogram (shape 2, 128, 251), where again the two channels have the same information.

Finally, the delta is computed and here's the problem between the tests and implementation. The final step is to concatenate the different deltas:

torch.cat([sg, delta, delta2], dim=1)

This creates an output of shape 6, 128, 251. As I understand, the test breaking is trying to assert that the delta is different than the original spectrogram. In this case, it should test between out[0] and out[2]. We encounter this fact: out[0] and out[1] both contain the original spec, out[2] and out[3] are equal to the first delta, out[4] and out[5] are the second delta. And due to the two initial channels being equal, the test breaks.

@jcaw
Copy link
Contributor Author

jcaw commented Jan 20, 2021

Figured it was probably a floating point issue. Thanks for digging up the problem, I've updated that test to check the correct indices and tolerate float errors.

This seems like a generic issue. Should we be using allclose (or similar) on all tests?

@codecov
Copy link

codecov bot commented Jan 20, 2021

Codecov Report

Merging #85 (50fecc5) into master (449a0b4) will decrease coverage by 4.38%.
The diff coverage is 94.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #85      +/-   ##
==========================================
- Coverage   90.60%   86.21%   -4.39%     
==========================================
  Files          12       13       +1     
  Lines         649      827     +178     
  Branches       70       82      +12     
==========================================
+ Hits          588      713     +125     
- Misses         34       83      +49     
- Partials       27       31       +4     
Impacted Files Coverage Δ
src/fastaudio/util.py 91.17% <80.00%> (-8.83%) ⬇️
src/fastaudio/augment/spectrogram.py 85.61% <90.47%> (-5.73%) ⬇️
src/fastaudio/augment/signal.py 82.02% <96.49%> (-14.05%) ⬇️
src/fastaudio/functional.py 97.14% <97.14%> (ø)
src/fastaudio/core/signal.py 71.87% <0.00%> (-9.38%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 449a0b4...50fecc5. Read the comment docs.

@mogwai
Copy link
Member

mogwai commented Jan 22, 2021

Yes we should but in a separate PR :)

@mogwai mogwai changed the title FEATURE: GPU-Compatible Batch Versions of Existing Transforms GPU-Compatible Batch Versions of Existing Transforms Mar 4, 2021
@mogwai
Copy link
Member

mogwai commented Mar 12, 2021

Hey @jcaw, we're going to get to this soon, sorry about the long wait!

@@ -0,0 +1,251 @@
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be inside the augment/ folder, as it contains the functional implementation of the data augmentations

from fastaudio.functional import region_mask


class TestCreateRegionMask:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unittest style class can probably be transformed to the pytest style of simple functions

Copy link
Member

@mogwai mogwai Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we preferring simple function tests instead of the class style?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@scart97
Copy link
Collaborator

scart97 commented Mar 12, 2021

As far as I've seen, the new transforms do not require a GPU to be used, the only have the batched implementation that respects the device where the input came from. Considering this, I'm not sure if the *GPU suffix is correct to use when naming the transforms.
Any suggestions @mogwai ?

@mogwai
Copy link
Member

mogwai commented Mar 12, 2021

I agree with those comments.

I think it would also be nice to have the benchmark code as part of this PR.

Great job @jcaw. I'd definitely consider getting this in torchaudiomentations as well if you want to.

@scart97 scart97 merged commit 50fecc5 into fastaudio:master May 23, 2021
scart97 added a commit that referenced this pull request May 23, 2021
GPU-compatible batch transforms

BREAKING CHANGE: GPU-Compatible batch transforms (#85)
@scart97
Copy link
Collaborator

scart97 commented May 23, 2021

🎉 This PR is included in version 1.0.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Porting Transforms to GPU
3 participants