-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add diffusion-base SE model to ESPnet-SE #5572
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
from torch_complex.tensor import ComplexTensor | ||
from typeguard import check_argument_types | ||
|
||
from espnet2.enh.layers.complex_utils import to_complex | ||
from espnet2.layers.inversible_interface import InversibleInterface | ||
from espnet.nets.pytorch_backend.nets_utils import make_pad_mask | ||
|
||
is_torch_1_10_plus = V(torch.__version__) >= V("1.10.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to add it because current ESPnet already dropped support for PyTorch versions before 1.12.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See line 93 of espnet2/layers/stft.py
. Add this check to use native support for FFT and STFT on all CPU targets including ARM.
Done. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Co-authored-by: Wangyou Zhang <C0me_On@163.com>
Co-authored-by: Wangyou Zhang <C0me_On@163.com>
Hi, @sw005320 , I also asked @popcornell to help review this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some high-level comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that a lot of lines are not tested, according to Codecov. Can you double-check it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be ok as this code is taken mostly as it is from sgmse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The files in ncsnpp_utils
are taken from the sgmse repo. Some of them are not used, and thus not tested. Should I add tests to those unused code lines or just remove them? @sw005320 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
The discussions are whether we can keep the unused functions or not.
- If we will use them in the future, we can add tests
- If not, maybe, we can keep them as they are or we can remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added unit tests for the unused NCSNpp functions as much as I could.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that a lot of lines are not tested, according to Codecov. Can you double-check it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that a lot of lines are not tested, according to Codecov. Can you double-check it?
spec_factor: float = 0.15, | ||
spec_abs_exponent: float = 0.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these values come from?
Can you explain them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they come largely from SGMSE https://arxiv.org/pdf/2203.17004.pdf but for spec factor they used 1/3 there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the waveform is standard dev normalized to 1, using 1/3 as spec factor bounds the STFT max value to +- 1.5.
I think using 0.15 bounds instead to -+ 0.75 which may actually be better.
Do you normalize the waveform std @LiChenda ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diffusion is super sensitive to input min max range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sw005320 , these numbers come from section V.D in [1]. I added comments to the code. @popcornell , in their journal paper [1], they use 0.15 and 0.5. I did not normalize the waveform std.
[1] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech Enhancement and Dereverberation With Diffusion-Based Generative Models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you normalize by the .abs().amax() to be always within -1 and 1 ?
https://github.com/sp-uhh/sgmse/blob/c6e3291ee56b07792c9d8c7d7d49487b3042e01b/sgmse/data_module.py#L72
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the normalize option but didn't use it in my model training by default. Because I feel it is not reasonable for causal models.
espnet/espnet2/enh/diffusion_enh.py
Line 113 in 12c0eb2
normfac = speech_mix.abs().max() * 1.1 + 1e-5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
for more information, see https://pre-commit.ci
LGTM. |
Hi, @sw005320 , I added unit tests for the unused NCSNpp functions as much as I could. Please merge it when the CI test is passed. |
What?
Why?
Extend ESPnet-SE to support diffusion-based generative enhancement models.
Others