Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decoding in the Transformer-based model #119

Closed
lovodkin93 opened this issue Feb 8, 2021 · 12 comments
Closed

decoding in the Transformer-based model #119

lovodkin93 opened this issue Feb 8, 2021 · 12 comments

Comments

@lovodkin93
Copy link

lovodkin93 commented Feb 8, 2021

Hello,
I was going through your transformer-based model, and something didn't seem to add up, and I would appreciate your help with clarifying it:
in transformer.py line 366, the whole output sentence is passed as input to the self_attention layer(which I also double-checked by printing), whereas from what I know, the decoder needs to receive every time only the last word that was computed by the decoder.
So, I went deeper into the code, and got to the loss function in the end (at transformer.py line 86), where again it appears the loss is computed based on the whole sentence, rather than just the currently computed word (which again I printed to double check).
So my question is - does your model generate the target language sentence sequentially? And if so, where exactly does the sequential decoding come into play? And lastly, if there is indeed a sequential decoding, is there a parameter of the transformer that incidates which word it is decoding currently (namely, first word, second word, third word, etc.).
Thanks!

@rsennrich
Copy link
Collaborator

Hi @lovodkin93,

I think there's two issues to keep in mind:

  • both functions you point to are active at training time. During training, all time steps are processed in parallel (the fact that this is possible with Transformers is a big advantage over RNNs, where you need to process time steps sequentially).
  • even during inference, the Transformer needs access to all decoder states from previous time steps for self-attention (unlike RNNs, where access to the hidden state(s) from the last time step is enough). You can find the inference code here. Yes, this function has a variable `current_time_step' that indicates which word is currently being processed.

best,
Rico

@lovodkin93
Copy link
Author

amazing, thank you!
I do have a follow-up question - I have searched for the current_time_step variable, but I see it appears only in the tranformer_inference.py code. I actually need it in the training process, rather than the inference process, so I was wondering if there is something similar for the transformer.py code? Namely, is there a way to know which word is being processed through the decoder at a given train step (of course, it is all done simultaneously, but I am talking about at a given thread at a given time step).

@lovodkin93
Copy link
Author

lovodkin93 commented Feb 9, 2021

In fact, I will explain what my purpose is, as the current time step might be redundant:
I am trying to propagate the attn_weights of the decoder layers' self-attention layer from transformer_attention_modules.py line 168 all the way to transformer.py line 75 (by adding the appropriate outputs in all the relevant functions on the way). My goal is to incorporate a certain manipulation on the attn_weights in the loss function. The thing is, I don't want redundancy in the loss, so I want for each word in the target sentence to have just the row in attn_weights relevant to it (namely for word generated in time step i I want just the row that showed the attn_weights of that word to the i-1 first words). So my question is, is the generated triangular matrix attn_weights computed once per sentence or as many as the number of words in the sentence?
For example, for the generated target sentence "I eat pizza", will the attn_weights be generated once (namely for each word the equivalent row will be computed once and since it is all done simultaneously then it will be generated once):
1,0,0
x, 1-x,0
x, y, 1-x-y

or rather would it be generated thrice:
1,0,0
nonsense3
nonsense
3

1,0,0
x, 1-x,0
nonesnse*3

1,0,0
x, 1-x,0
x, y, 1-x-y

Thanks!

@rsennrich
Copy link
Collaborator

attn_weights will only be generated once per attention head and timestep. If you want to propagate this to the loss, you'll need to properly pack this information (a base Transformer has 48 attention heads), and you can look at the layer_memories for inspiration (they're currently only used for caching at inference time) (@demelin is the original author of that code, and might help with more detailed questions).

@lovodkin93
Copy link
Author

@rsennrich So let me make sure I understand clearly:
given the sentence "I eat pizza", all the words of the sentence ("I", "eat", "pizza") will be processed in the same timestep (as they are processed in parallel), so attn_weights will be computed once for the whole sentence for this timestep, right?

@lovodkin93
Copy link
Author

@rsennrich In other words - are all the words of a sentence processed in the same single timestep or each is processed in a separate timestep?

@rsennrich
Copy link
Collaborator

I use timestep to the position in the sentence (it could be a word, subword, or character). So I'm saying that attn_weights will only be generated once per attention head and subword, and that all subwords are processed in parallel as a single tensor (for examples, the key tensor has the shape [batch_size, time_steps, num_heads, num_features].

@lovodkin93
Copy link
Author

lovodkin93 commented Feb 10, 2021

@rsennrich @demelin
From what I know, there are two attention layers in the decoder - the cross-attention layer which receives as inputs the output of the encoders' block (which is quite clear how it is implemented in the code), and the self-attention which receives all the words that were generated up to the current timestep. So what I don't seem to understand is - how can a word at timestep t be computed without the knowledge of all the t-1 first words (which are being computed in parallel to it and are not necessarily known at timestep t)?

In fact, I have also printed the attn_weights and its size, and looked at the output. The size of the attn_weights was [2 8 16 16] (namely a batch of 2 sentences, the longer of which is of length 16 for, so in fact there was a matrix of 16X16 for each of the sentences and for each of the 8 heads). I also printed the matrix of the first sentence of one of those heads, and got the following output (each row contains 16 elements):
image

So to ask a more practical question, for every timestep, is this whole matrix being generated, so that for a sentence of length 16, 16 matrices of size 16X16 will be generated over 16 timesteps, per head (one matrix per timestep)? Or alternatively, For every timestep, just one of the rows of the matrix is generated, and they are all concatenated in the end (so for a sentence of length 16, 16 rows of size 1X16, and not 16 matrices of size 16X16 will be generated, one row per timestep).

@demelin
Copy link

demelin commented Feb 10, 2021

Hey there,

During training, all attention weights are computed in parallel for all time-steps (i.e. encoder self-attention, decoder self-attention, and decoder-to-encoder attention). During inference, encoder self-attention is computed in parallel for all time-steps, as we have access to the full source sentence. Decoder self-attention is re-computed for all target-side tokens each time we expand the generated sequence, so as to update the latent representations at positions 0, ..., t-1 with the information contained in the token t. Decoder-to-encoder attention, on the other hand, is only computed between the target token at the current position, i.e. t, and all hidden states in the final layer of the encoder. The result is concatenated with previous time-step-specific attention weights which are cached in layer_memories to avoid redundant computation.

To answer you question regarding the attention matrix in your example, during inference at t == 1, the attention matrix has size 1x1, at t == 2 it has size 2x2, ..., at t == 16 it has size 16x16. During training, it's 16x16 right away and we apply a mask to prevent the model from attending to future positions, as the decoder is given the full (shifted) target sequence as input. This enables the efficient, non-recurrent training process. Hope that helps.

All the best,
Denis

@lovodkin93
Copy link
Author

lovodkin93 commented Feb 10, 2021

@demelin
ok, I think I understand.
So just to make sure I understand:
For the example I included, what is done in the inference phase in 16 sequential steps yielding 16 matrices (1X1,2X2,...), is equivalent to a single sequential step in the training yielding a single matrix of size 16X16 (and not 16 matrices of size 16X16)?
My main concern is the number of matrices yield during a single propagation of all the words in a sentence in the training phase.

@demelin
Copy link

demelin commented Feb 10, 2021

Yes, the training step produces a single attention matrix (or, more accurately, a 4D attention tensor) per mini-batch per attention type, i.e. one for encoder self-attention, one for decoder self-attention, and one for decoder-to-encoder attention.

what is done in the inference phase in 16 sequential steps yielding 16 matrices (1X1,2X2,...), is equivalent to a single sequential step in the training yielding a single matrix of size 16X16 (and not 16 matrices of size 16X16)

Yep :) . 16 matrices would be way too many. Also, the training is not sequential (e.g. as in RNNs), but parallel (e.g. as in FFNs).

@lovodkin93
Copy link
Author

@demelin
Great, now I understand.
Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants