Some clarifications #410

cijose · 2019-09-25T12:38:45Z

https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/seq2seq_tds/librispeech/network.arch

In the above network architecture specification, what does padding -1 mean for C2 layers?.

Also I do not understand what is the expected shape of the speech features to this network. Let's say I have a batch of size "b" of mfcc features of dimension "d" and time steps "t" . How do you view the tensor of shape b x t x d as the input to the first layer of convolutions?. I do not understand why do we need 2d convolutions here. Sorry, it was not clear from the paper.

an918tw · 2019-09-25T14:38:46Z

padding = -1 means same padding, where we apply smallest possible padding such that out_size = ceil(in_size/stride) (https://github.com/facebookresearch/flashlight/blob/master/flashlight/common/Defines.h#L50-L53)

The first layer of the encoder, V -1 NFEAT 1 0, will change your input to be of shape (t, d, 1, b), which is the required input shape to the second Conv2D layer. (0 means keeping the original dimension, -1 means inferring the dimension from the full size) We recommend you directly use (t, d, 1, b) as the input tensor shape, even though there's flexibility due to the first layer.

cijose · 2019-09-26T09:32:24Z

@an918tw Thanks a lot.

cijose closed this as completed Sep 26, 2019

tomassosorio mentioned this issue Mar 12, 2020

Add model Wav2Letter pytorch/audio#462

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some clarifications #410

Some clarifications #410

cijose commented Sep 25, 2019

an918tw commented Sep 25, 2019

cijose commented Sep 26, 2019

Some clarifications #410

Some clarifications #410

Comments

cijose commented Sep 25, 2019

an918tw commented Sep 25, 2019

cijose commented Sep 26, 2019