-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decoding in the Transformer-based model #119
Comments
Hi @lovodkin93, I think there's two issues to keep in mind:
best, |
amazing, thank you! |
In fact, I will explain what my purpose is, as the current time step might be redundant: or rather would it be generated thrice: 1,0,0 1,0,0 Thanks! |
attn_weights will only be generated once per attention head and timestep. If you want to propagate this to the loss, you'll need to properly pack this information (a base Transformer has 48 attention heads), and you can look at the layer_memories for inspiration (they're currently only used for caching at inference time) (@demelin is the original author of that code, and might help with more detailed questions). |
@rsennrich So let me make sure I understand clearly: |
@rsennrich In other words - are all the words of a sentence processed in the same single timestep or each is processed in a separate timestep? |
I use timestep to the position in the sentence (it could be a word, subword, or character). So I'm saying that attn_weights will only be generated once per attention head and subword, and that all subwords are processed in parallel as a single tensor (for examples, the key tensor has the shape [batch_size, time_steps, num_heads, num_features]. |
@rsennrich @demelin In fact, I have also printed the attn_weights and its size, and looked at the output. The size of the attn_weights was [2 8 16 16] (namely a batch of 2 sentences, the longer of which is of length 16 for, so in fact there was a matrix of 16X16 for each of the sentences and for each of the 8 heads). I also printed the matrix of the first sentence of one of those heads, and got the following output (each row contains 16 elements): So to ask a more practical question, for every timestep, is this whole matrix being generated, so that for a sentence of length 16, 16 matrices of size 16X16 will be generated over 16 timesteps, per head (one matrix per timestep)? Or alternatively, For every timestep, just one of the rows of the matrix is generated, and they are all concatenated in the end (so for a sentence of length 16, 16 rows of size 1X16, and not 16 matrices of size 16X16 will be generated, one row per timestep). |
Hey there, During training, all attention weights are computed in parallel for all time-steps (i.e. encoder self-attention, decoder self-attention, and decoder-to-encoder attention). During inference, encoder self-attention is computed in parallel for all time-steps, as we have access to the full source sentence. Decoder self-attention is re-computed for all target-side tokens each time we expand the generated sequence, so as to update the latent representations at positions 0, ..., t-1 with the information contained in the token t. Decoder-to-encoder attention, on the other hand, is only computed between the target token at the current position, i.e. t, and all hidden states in the final layer of the encoder. The result is concatenated with previous time-step-specific attention weights which are cached in layer_memories to avoid redundant computation. To answer you question regarding the attention matrix in your example, during inference at t == 1, the attention matrix has size 1x1, at t == 2 it has size 2x2, ..., at t == 16 it has size 16x16. During training, it's 16x16 right away and we apply a mask to prevent the model from attending to future positions, as the decoder is given the full (shifted) target sequence as input. This enables the efficient, non-recurrent training process. Hope that helps. All the best, |
@demelin |
Yes, the training step produces a single attention matrix (or, more accurately, a 4D attention tensor) per mini-batch per attention type, i.e. one for encoder self-attention, one for decoder self-attention, and one for decoder-to-encoder attention.
Yep :) . 16 matrices would be way too many. Also, the training is not sequential (e.g. as in RNNs), but parallel (e.g. as in FFNs). |
@demelin |
Hello,
I was going through your transformer-based model, and something didn't seem to add up, and I would appreciate your help with clarifying it:
in transformer.py line 366, the whole output sentence is passed as input to the self_attention layer(which I also double-checked by printing), whereas from what I know, the decoder needs to receive every time only the last word that was computed by the decoder.
So, I went deeper into the code, and got to the loss function in the end (at transformer.py line 86), where again it appears the loss is computed based on the whole sentence, rather than just the currently computed word (which again I printed to double check).
So my question is - does your model generate the target language sentence sequentially? And if so, where exactly does the sequential decoding come into play? And lastly, if there is indeed a sequential decoding, is there a parameter of the transformer that incidates which word it is decoding currently (namely, first word, second word, third word, etc.).
Thanks!
The text was updated successfully, but these errors were encountered: