decoding in the Transformer-based model #119

lovodkin93 · 2021-02-08T15:42:39Z

Hello,
I was going through your transformer-based model, and something didn't seem to add up, and I would appreciate your help with clarifying it:
in transformer.py line 366, the whole output sentence is passed as input to the self_attention layer(which I also double-checked by printing), whereas from what I know, the decoder needs to receive every time only the last word that was computed by the decoder.
So, I went deeper into the code, and got to the loss function in the end (at transformer.py line 86), where again it appears the loss is computed based on the whole sentence, rather than just the currently computed word (which again I printed to double check).
So my question is - does your model generate the target language sentence sequentially? And if so, where exactly does the sequential decoding come into play? And lastly, if there is indeed a sequential decoding, is there a parameter of the transformer that incidates which word it is decoding currently (namely, first word, second word, third word, etc.).
Thanks!

rsennrich · 2021-02-09T08:41:50Z

Hi @lovodkin93,

I think there's two issues to keep in mind:

both functions you point to are active at training time. During training, all time steps are processed in parallel (the fact that this is possible with Transformers is a big advantage over RNNs, where you need to process time steps sequentially).
even during inference, the Transformer needs access to all decoder states from previous time steps for self-attention (unlike RNNs, where access to the hidden state(s) from the last time step is enough). You can find the inference code here. Yes, this function has a variable `current_time_step' that indicates which word is currently being processed.

best,
Rico

lovodkin93 · 2021-02-09T08:54:04Z

amazing, thank you!
I do have a follow-up question - I have searched for the current_time_step variable, but I see it appears only in the tranformer_inference.py code. I actually need it in the training process, rather than the inference process, so I was wondering if there is something similar for the transformer.py code? Namely, is there a way to know which word is being processed through the decoder at a given train step (of course, it is all done simultaneously, but I am talking about at a given thread at a given time step).

lovodkin93 · 2021-02-09T09:57:04Z

In fact, I will explain what my purpose is, as the current time step might be redundant:
I am trying to propagate the attn_weights of the decoder layers' self-attention layer from transformer_attention_modules.py line 168 all the way to transformer.py line 75 (by adding the appropriate outputs in all the relevant functions on the way). My goal is to incorporate a certain manipulation on the attn_weights in the loss function. The thing is, I don't want redundancy in the loss, so I want for each word in the target sentence to have just the row in attn_weights relevant to it (namely for word generated in time step i I want just the row that showed the attn_weights of that word to the i-1 first words). So my question is, is the generated triangular matrix attn_weights computed once per sentence or as many as the number of words in the sentence?
For example, for the generated target sentence "I eat pizza", will the attn_weights be generated once (namely for each word the equivalent row will be computed once and since it is all done simultaneously then it will be generated once):
1,0,0
x, 1-x,0
x, y, 1-x-y

or rather would it be generated thrice:
1,0,0
nonsense3
nonsense3

1,0,0
x, 1-x,0
nonesnse*3

1,0,0
x, 1-x,0
x, y, 1-x-y

Thanks!

rsennrich · 2021-02-10T10:07:05Z

attn_weights will only be generated once per attention head and timestep. If you want to propagate this to the loss, you'll need to properly pack this information (a base Transformer has 48 attention heads), and you can look at the layer_memories for inspiration (they're currently only used for caching at inference time) (@demelin is the original author of that code, and might help with more detailed questions).

lovodkin93 · 2021-02-10T10:18:08Z

@rsennrich So let me make sure I understand clearly:
given the sentence "I eat pizza", all the words of the sentence ("I", "eat", "pizza") will be processed in the same timestep (as they are processed in parallel), so attn_weights will be computed once for the whole sentence for this timestep, right?

lovodkin93 · 2021-02-10T10:35:45Z

@rsennrich In other words - are all the words of a sentence processed in the same single timestep or each is processed in a separate timestep?

rsennrich · 2021-02-10T10:42:49Z

I use timestep to the position in the sentence (it could be a word, subword, or character). So I'm saying that attn_weights will only be generated once per attention head and subword, and that all subwords are processed in parallel as a single tensor (for examples, the key tensor has the shape [batch_size, time_steps, num_heads, num_features].

lovodkin93 · 2021-02-10T11:47:18Z

@rsennrich @demelin
From what I know, there are two attention layers in the decoder - the cross-attention layer which receives as inputs the output of the encoders' block (which is quite clear how it is implemented in the code), and the self-attention which receives all the words that were generated up to the current timestep. So what I don't seem to understand is - how can a word at timestep t be computed without the knowledge of all the t-1 first words (which are being computed in parallel to it and are not necessarily known at timestep t)?

In fact, I have also printed the attn_weights and its size, and looked at the output. The size of the attn_weights was [2 8 16 16] (namely a batch of 2 sentences, the longer of which is of length 16 for, so in fact there was a matrix of 16X16 for each of the sentences and for each of the 8 heads). I also printed the matrix of the first sentence of one of those heads, and got the following output (each row contains 16 elements):

So to ask a more practical question, for every timestep, is this whole matrix being generated, so that for a sentence of length 16, 16 matrices of size 16X16 will be generated over 16 timesteps, per head (one matrix per timestep)? Or alternatively, For every timestep, just one of the rows of the matrix is generated, and they are all concatenated in the end (so for a sentence of length 16, 16 rows of size 1X16, and not 16 matrices of size 16X16 will be generated, one row per timestep).

demelin · 2021-02-10T13:12:06Z

Hey there,

During training, all attention weights are computed in parallel for all time-steps (i.e. encoder self-attention, decoder self-attention, and decoder-to-encoder attention). During inference, encoder self-attention is computed in parallel for all time-steps, as we have access to the full source sentence. Decoder self-attention is re-computed for all target-side tokens each time we expand the generated sequence, so as to update the latent representations at positions 0, ..., t-1 with the information contained in the token t. Decoder-to-encoder attention, on the other hand, is only computed between the target token at the current position, i.e. t, and all hidden states in the final layer of the encoder. The result is concatenated with previous time-step-specific attention weights which are cached in layer_memories to avoid redundant computation.

To answer you question regarding the attention matrix in your example, during inference at t == 1, the attention matrix has size 1x1, at t == 2 it has size 2x2, ..., at t == 16 it has size 16x16. During training, it's 16x16 right away and we apply a mask to prevent the model from attending to future positions, as the decoder is given the full (shifted) target sequence as input. This enables the efficient, non-recurrent training process. Hope that helps.

All the best,
Denis

lovodkin93 · 2021-02-10T14:14:21Z

@demelin
ok, I think I understand.
So just to make sure I understand:
For the example I included, what is done in the inference phase in 16 sequential steps yielding 16 matrices (1X1,2X2,...), is equivalent to a single sequential step in the training yielding a single matrix of size 16X16 (and not 16 matrices of size 16X16)?
My main concern is the number of matrices yield during a single propagation of all the words in a sentence in the training phase.

demelin · 2021-02-10T14:23:01Z

Yes, the training step produces a single attention matrix (or, more accurately, a 4D attention tensor) per mini-batch per attention type, i.e. one for encoder self-attention, one for decoder self-attention, and one for decoder-to-encoder attention.

what is done in the inference phase in 16 sequential steps yielding 16 matrices (1X1,2X2,...), is equivalent to a single sequential step in the training yielding a single matrix of size 16X16 (and not 16 matrices of size 16X16)

Yep :) . 16 matrices would be way too many. Also, the training is not sequential (e.g. as in RNNs), but parallel (e.g. as in FFNs).

lovodkin93 · 2021-02-10T14:36:23Z

@demelin
Great, now I understand.
Thank you very much!

lovodkin93 closed this as completed Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decoding in the Transformer-based model #119

decoding in the Transformer-based model #119

lovodkin93 commented Feb 8, 2021 •

edited

Loading

rsennrich commented Feb 9, 2021

lovodkin93 commented Feb 9, 2021

lovodkin93 commented Feb 9, 2021 •

edited

Loading

rsennrich commented Feb 10, 2021

lovodkin93 commented Feb 10, 2021

lovodkin93 commented Feb 10, 2021

rsennrich commented Feb 10, 2021

lovodkin93 commented Feb 10, 2021 •

edited

Loading

demelin commented Feb 10, 2021

lovodkin93 commented Feb 10, 2021 •

edited

Loading

demelin commented Feb 10, 2021 •

edited

Loading

lovodkin93 commented Feb 10, 2021

decoding in the Transformer-based model #119

decoding in the Transformer-based model #119

Comments

lovodkin93 commented Feb 8, 2021 • edited Loading

rsennrich commented Feb 9, 2021

lovodkin93 commented Feb 9, 2021

lovodkin93 commented Feb 9, 2021 • edited Loading

rsennrich commented Feb 10, 2021

lovodkin93 commented Feb 10, 2021

lovodkin93 commented Feb 10, 2021

rsennrich commented Feb 10, 2021

lovodkin93 commented Feb 10, 2021 • edited Loading

demelin commented Feb 10, 2021

lovodkin93 commented Feb 10, 2021 • edited Loading

demelin commented Feb 10, 2021 • edited Loading

lovodkin93 commented Feb 10, 2021

lovodkin93 commented Feb 8, 2021 •

edited

Loading

lovodkin93 commented Feb 9, 2021 •

edited

Loading

lovodkin93 commented Feb 10, 2021 •

edited

Loading

lovodkin93 commented Feb 10, 2021 •

edited

Loading

demelin commented Feb 10, 2021 •

edited

Loading