You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the above network architecture specification, what does padding -1 mean for C2 layers?.
Also I do not understand what is the expected shape of the speech features to this network. Let's say I have a batch of size "b" of mfcc features of dimension "d" and time steps "t" . How do you view the tensor of shape b x t x d as the input to the first layer of convolutions?. I do not understand why do we need 2d convolutions here. Sorry, it was not clear from the paper.
The text was updated successfully, but these errors were encountered:
The first layer of the encoder, V -1 NFEAT 1 0, will change your input to be of shape (t, d, 1, b), which is the required input shape to the second Conv2D layer. (0 means keeping the original dimension, -1 means inferring the dimension from the full size) We recommend you directly use (t, d, 1, b) as the input tensor shape, even though there's flexibility due to the first layer.
https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/seq2seq_tds/librispeech/network.arch
In the above network architecture specification, what does padding -1 mean for C2 layers?.
Also I do not understand what is the expected shape of the speech features to this network. Let's say I have a batch of size "b" of mfcc features of dimension "d" and time steps "t" . How do you view the tensor of shape b x t x d as the input to the first layer of convolutions?. I do not understand why do we need 2d convolutions here. Sorry, it was not clear from the paper.
The text was updated successfully, but these errors were encountered: