-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input Feature to TRUNET #5
Comments
I think the input tensor in sample code is one-frame feature. If you want to feed a wav into the model, the input dimension might be (B,4,frames,257), but I'm not sure. Please email me (cch_amos@qq.com) if you have any insight. |
@yugeshav Hey! Any progress on this? I am also confused with the input shape |
@amirpashamobinitehrani The input shape for 1D conv is: (T, C,F) |
Thanks for you reply. Interesting! Yes, I had some presumptions. What still remains a mystery to me is to inject batch dimension into the play. (Batch, Time frames, Channels(4 features), Frequency bins) Which I assume we should refrain from. Right? We are simply processing 4 different features of 1 audio file in (time-frame) steps. So the time-frame dimension is fulfilling Batch dimension's role. |
Correct!Each frame is a data sample here. If you want to use the (Batch, Time, Features, Frequency) you should use 2D Convolution and set the filters’ dimension to (n, 1). |
Hi, I had the same question. Has anyone been able to successfully train this network? I think that as @atabakp mentioned, the input has to have shape Thanks, |
Hi Esteban, I am able to train this model. |
Thanks @atabakp ! As a follow-up question: How are you obtaining the "demodulated phase"? |
There are a few methods to do this, but I don't know what the Authors exactly mean. for example https://arxiv.org/pdf/1608.01953.pdf But for my training, I used Log Magnitude and normalized real/imag as inputs. |
I managed to implement the demodulated phase, using (log_magnitude, demod_real, demod_imag) as inputs to train the model. For some reasons, I am not witnessing the model successfully doing anything useful. It would be nice to get some insights regarding the implementations if any has made a promising progress on this! |
Thanks once again @atabakp!
One last question: How are you using the outputs, @atabakp ? I think it has 5 channels initially, but there is no explicit mention to what they exactly are. I was assuming two of them are magnitude masks (target and residual), two others are phase terms and the last one is the one used to estimate the phase's sign, but I was not sure. |
Yes, you are right. |
Thanks a lot once again, @atabakp ! I'll report back my progress as I manage to allocate time for working on it |
Section 3 of this paper also has some information about phase demodulation: https://www.isca-speech.org/archive_v0/Interspeech_2018/pdfs/1773.pdf |
Hi again @atabakp , How are you get the 10 channels? I looked again into the model's code and I'm getting only 5 channels. Here I'm attaching the I/O of each layer:
|
I also have a question about the TGRU along the same lines. According to the paper:
But then, the input to this layer is a To my understanding (please correct me if I'm wrong), the |
I answer myself about this one. The paper config listing for the decoder says:
where the last number is the number of channels, therefore you're right, they should be 10 instead. |
|
Hi @atabakp , Not sure if my interpretation of the outputs is correct, but I'm trying to follow the paper and even when the model trains, it may become unstable after some epochs. I believe that the # Control random seed
rand_seed = torch.manual_seed(0)
# Lets assume it has shape (1, 5, 257) (the expected output for a single source)
# Since the activation function is ReLU, values can be equal or greater
# than 0
x_features = torch.rand((1, 5, 257), dtype=torch.float32)
# Extract z_tf for target and residual
z_tf = x_features[:, 0:1, :]
z_tf_residual = x_features[:, 1:2, :]
# Extract phi
phi = x_features[:, 2:3, :]
# Estimate beta (due to softplus it will be one or greater)
beta = 1.0 + F.softplus(phi)
# Estimate sigmod of target and residual
sigmoid_tf = F.sigmoid(z_tf - z_tf_residual)
sigmoid_tf_residual = F.sigmoid(z_tf_residual - z_tf)
# Estimate upper bound for beta
beta_upper_bound = 1.0 / torch.abs(sigmoid_tf - sigmoid_tf_residual)
# Because of the absolute value in the denominator, the same upper bound
# can be applied to both betas
beta = torch.clip(beta, max=beta_upper_bound)
# Compute both target and residual masks using eq. (1)
mask_tf = beta * sigmoid_tf
mask_tf_residual = beta * sigmoid_tf_residual
# Now that we have both masks, let's compute the triangle cosine law
cos_phase = (
(1.0 + mask_tf.square() - mask_tf_residual.square())
/ (2.0 * mask_tf))
# Use trigonometric identity to obtain the sine
sin_phase = torch.sqrt(1.0 - cos_phase.square())
# Now estimate the sign
q0 = x_features[:, 3:4, :]
q1 = x_features[:, 4:5, :]
gamma0 = F.gumbel_softmax(q0, tau=1.0)
gamma1 = F.gumbel_softmax(q1, tau=1.0)
sign = torch.where(gamma0 > gamma1, -1.0, 1.0)
# Finally, estimate the complex mask
complex_mask = mask_tf * (cos_phase + sign * 1j * sin_phase)
# Then it should be applied to the stft and inverted using the istft
... |
1- I didn't apply ReLu for the last layer(x_features). 5-Estimating sign is a bit confusing, and I believe there is a typo in the formula of the paper. (I believe sign does not much matter for the performance) sign = torch.where(gamma_0 > gamma_1, -1.0, 1.0) |
Thanks you very much @atabakp !
|
Hi again @atabakp , When training the model, are you using 2s audio as the paper claims or are you using gradient accumulation or something like that to pass more data between steps? I'm currently trying to train the model for dereverberation only, but 2s per audio in all case is very slow to train. So far I haven't reached to point to evaluate how successful the model is in the task, but it doesn't seem that it's learning quickly. |
I am using random-length sequences, single sequence per iteration (batch size =1) |
Sorry for necroposting here, but I'm trying to train this model, and with no luck yet. I managed to add trainable PCEN (as described in paper) and training on spectrograms. I construct input feature from PCEN (output of trainable layer), log magnitude, real and imag parts of STFT and feed it to the rest of the model described here. I also implemented 2d convs since I wanted to train on batches. Losses are the same as in the paper - multires cosine similarity + multires spectrum MSE. |
Hi @JBloodless , In my case I am using Conv1d, but I decided to change original losses for a GAN since they weren't quite working for me (the model was converging, but not with the expected quality, and sometimes exploding after this). |
Can you elaborate a bit, what do you mean by GAN loss? |
Check section 2 of this paper: https://arxiv.org/pdf/2010.10677.pdf
Check section 2 of this paper: https://arxiv.org/pdf/2010.10677.pdf |
@eagomez2 I'm assuming that you using your code above to calculate complex mask and then multiplying this with complex input to produce output waveform, is this correct? |
Yes. You will need to repeat the process to obtain both the direct speech waveform and the residual waveform |
Seems fine to me. Maybe the problem really is PHM computation. For now I settled with
Since you mentiond that order doesn't matter, I assumed that in the second pair non-noise will be first, so I'm directly calculating mask for direct path and non-noise signal, and then obtaining reveberation mask as in fig.2 of the paper. |
So is it working now @JBloodless ? I double-checked the code I used and is very similar to my initial post. The only changes I see is that I got rid of the randomness of gumbel softmax and simply replaced it by softmax with temperature, and I added some All in all, I'm inclined to think that even though the sign prediction math in the paper makes sense, in practice it is not as crucial for the network's performance. |
I Totally agree, even the PHM is not very crucial, the network can directly output the clean speech mask. |
@atabakp thanks for bringing this up. Have you tried training without PHM? I was also curious about doing this, but I haven't found the time so far. |
Nope, still doesn't work. The only thing that "worked" is skipping PHM and multiplying one channel of last output with input, but I didn't wait for it to converge yet. I'll try these fixes, thanks. One more question for @atabakp : I mentioned that my losses was wrong. By that I meant that in paper loss is the sum of losses for direct source, noise and reverberant path (last equation of section 3.3). How did you calculate them, did you do this sum of 3 or something else? Because I don't see good way to calculate target reverberant path, I just subtract clean signal from reverbed signal, and use tensor of 1e-6 for noise target (since I only train for dereverberation) |
I tried different variations, but I found out that only using loss on direct is good enough. |
Yes, I tried, I ended up using a single channel for masking the magnitude. |
I've managed to train the model without PHM and with single loss on direct (same as paper, multires cosine similarity + multires spectrum MSE). The model converges: but the result is strange Maybe it's because of PCEN feature (my implementation may be not ideal), but voice harmonics in the spectrum seem to be "dereverbed", so I'll try to locate the reason of noisiness. |
@atabakp did you extract magnitude of input spectrum as torch.abs() and not as torch.real? |
|
I calculated it for both direct residual but as mentioned by @atabakp , I'm inclined to think that direct should be good enough although I haven't tried this |
Hi @JBloodless , In my case at least I didn't observe such problems using the GAN setup previously described. |
In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins. |
Do you discard it in the input only or also in the target? |
Discard in the input, and replace it with zero in the output, It means the model is not predicting the mask for DC bin. |
In my case model stopped dereverbing lower frequencies at all, and overall it sounds like no dereverbed at all (since psychoacoustics and all). Am I getting it correctly that you zeroed out lowest bin in input and model output, but not in target (for loss calculation)? And if so, which n_fft did you use? For context: first is reverbed input, second - output of the model without zeroes in DC, last - output of the model with zeros in DC |
@JBloodless yes, I've trained it at 48kHz |
Can you share, what modifications of architecture did you make? For now I just made convolutions and GRU bigger, but it seems that it's not enough |
I didn't do any changes. I just disabled any resampling algorithms (my data was originally at 48kHz) and trained it normally |
You mentioned paper on GAN losses that you use. Did you use the same setup as in that paper? (discriminator on wave and adversarial + reconstruction) |
@JBloodless yes, it's the same setup |
Hi, @atabakp you mean we need to modify the network,right? The shape of TGRU's input(x9) is (Time, 16, 64). Since it should aggregate the information along the time-axis and batch_first=True in your implementation, therefore the input of TGRU should have "Time" should be the second dimension. I just add: Then,GRUBlock's input shape is (16,T,64),output shape is (16,64,T).Next FirstTrCNN's output shape will be (16,64,2(T-1)+1). @atabakp @JBloodless @eagomez2 All responses are welcome |
@caihaoran-00 I modified whole model to work with separate batch dimension, so my answer probably is not so relevant, but I didn't do anything out of ordinary with TrCNNs, they already have all necessary logic for such concat (mostly paddings) |
OK,thank you very much. I'm not sure how exactly synthesize a reverberated signal that has a similar amplitude to the clean signal(Maybe this is causing strange problems with my network training). @JBloodless |
convolved signal doesn't need to be normalized again, changes in amplitude are technically part of reverberation. Just generate IR and convolve it with clean input |
OK,you use n or (n,1) for your 2-D kernel? |
Hi
As per the paper, 4 features must be concatenated as input to TRUNET,
so the input will become (Batchsize, 4 features, No.of frames in STFT, No.of STFT bins) , so it is a 4 dimesional one
But in the sample code you are showing input as 3 dimension (1,4,257), since first layer is conv1d
I'm confused whether the input to TRUNET is 3 dimension or 4 dimension ?
Regards
Yugesh
The text was updated successfully, but these errors were encountered: