New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding general data augmentation methods for speech preprocessing #5370
Conversation
This pull request is now in conflict :( |
Codecov Report
@@ Coverage Diff @@
## master #5370 +/- ##
==========================================
+ Coverage 77.13% 77.19% +0.05%
==========================================
Files 678 679 +1
Lines 61537 61703 +166
==========================================
+ Hits 47465 47630 +165
- Misses 14072 14073 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@Jungjee, can you review this PR? |
for more information, see https://pre-commit.ci
espnet2/layers/augmentation.py
Outdated
-4 for shifting pitch down by 4/`bins_per_octave` octaves | ||
4 for shifting pitch up by 4/`bins_per_octave` octaves | ||
bins_per_octave (int): number of steps per octave | ||
n_fft (int): length of FFT (in second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing that!
source_sample_rate = source_sample_rate // gcd | ||
target_sample_rate = target_sample_rate // gcd | ||
|
||
ret = torchaudio.functional.resample( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just questions.
Did you consider applying one without pitch shift? Would it cause severe more computation?
Also, how's the training training speed with this augment (bottlenecks in loading)?
Would time stretch equal to speed_perturb with factor > 1, except for the pitch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
speed_perturb
and time_stretch
are two different time scaling method. The former changes the pitch while the latter does not. I think it is dependent on the use case. So I just provide both for the user to choose.
I haven't strictly tested the speed difference yet. Will do some test later.
waveform, n_fft, hop_length, win_length, window=window, return_complex=True | ||
) | ||
freq = spec.size(-2) | ||
phase_advance = torch.linspace(0, math.pi * hop_length, freq)[..., None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is [..., None] equivalent to .unsqueeze(-1) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. These are just the same operations.
Returns: | ||
ret (torch.Tensor): compressed signal (..., time) | ||
""" | ||
ret = torchaudio.functional.apply_codec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about adding some warning or exception to not be called for unwanted torch version? (or in a different place because if you put that here, it can be called too often)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I think I can just raise NotImplementedError for this function.
rir_path = np.random.choice(rirs) | ||
rir = None | ||
if rir_path is not None: | ||
rir, _ = soundfile.read(rir_path, dtype=np.float64, always_2d=True) | ||
rir, fs = soundfile.read(rir_path, dtype=np.float64, always_2d=True) | ||
if tgt_fs and fs != tgt_fs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to warn of raise something because sample rate mismatch may not be intended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, mostly added suggestions/questions, not mandatory.
Thanks a lot! |
Strangely, I can locally pass the test in |
This pull request is now in conflict :( |
Thanks, @Emrys365! |
What?
This PR adds a series of data augmentation techniques for preprocessing speech data in various tasks:
The supported data augmentation techniques include:
The data augmentation methods can be easily configured via the yaml file:
Why?
Current preprocessors are not flexible to support applying various data augmentation methods at the same time.