Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using trg[: ,;-1] during training #136

Closed
wajihullahbaig opened this issue Sep 4, 2020 · 7 comments
Closed

Using trg[: ,;-1] during training #136

wajihullahbaig opened this issue Sep 4, 2020 · 7 comments

Comments

@wajihullahbaig
Copy link

wajihullahbaig commented Sep 4, 2020

Thank you for this awesome repo you have made public. I had one question, during the training loop, you perform the following step
output, _ = model(src, trg[:,:-1])

I was wondering why are we doing the trg[:,:-1] step?

Kind regards
Wajih

@bentrevett
Copy link
Owner

This is because we have a target sequence, trg, of something like [<sos>, A, B, C, <eos>]. We want our decoder to predict what the next item in the predicted target sequence should be, given the previously predicted target tokens. So, we input a sequence of [<sos>, A, B, C] (which is trg[:,:-1]) and want our decoder to predict [A, B, C, <eos>] (which is trg[:,1:]).

Thus, we input trg[:,-1] and use the predicted target with trg[:,1:] to calculate our losses.

Let me know if this needs clarifying.

@wajihullahbaig
Copy link
Author

This is because we have a target sequence, trg, of something like [<sos>, A, B, C, <eos>]. We want our decoder to predict what the next item in the predicted target sequence should be, given the previously predicted target tokens. So, we input a sequence of [<sos>, A, B, C] (which is trg[:,:-1]) and want our decoder to predict [A, B, C, <eos>] (which is trg[:,1:]).

Thus, we input trg[:,-1] and use the predicted target with trg[:,1:] to calculate our losses.

Let me know if this needs clarifying.

Oh I understand now, Thanks indeed for the elaborated reply.

Wajih

@fabio-deep
Copy link

fabio-deep commented Nov 20, 2020

This is because we have a target sequence, trg, of something like [<sos>, A, B, C, <eos>]. We want our decoder to predict what the next item in the predicted target sequence should be, given the previously predicted target tokens. So, we input a sequence of [<sos>, A, B, C] (which is trg[:,:-1]) and want our decoder to predict [A, B, C, <eos>] (which is trg[:,1:]).

Thus, we input trg[:,-1] and use the predicted target with trg[:,1:] to calculate our losses.

Let me know if this needs clarifying.

Hi, how does this work when the trg sentence is padded? In that case I imagine the eos token would no longer be in last position right? or am I missing something.

EDIT: nevermind I figured it out, in case anyone else is wondering: it works with padded inputs anyway because of ignore_index in the loss function.

@bentrevett
Copy link
Owner

This is because we have a target sequence, trg, of something like [<sos>, A, B, C, <eos>]. We want our decoder to predict what the next item in the predicted target sequence should be, given the previously predicted target tokens. So, we input a sequence of [<sos>, A, B, C] (which is trg[:,:-1]) and want our decoder to predict [A, B, C, <eos>] (which is trg[:,1:]).
Thus, we input trg[:,-1] and use the predicted target with trg[:,1:] to calculate our losses.
Let me know if this needs clarifying.

Hi, how does this work when the trg sentence is padded? In that case I imagine the eos token would no longer be in last position right? or am I missing something.

EDIT: nevermind I figured it out, in case anyone else is wondering: it works with padded inputs anyway because of ignore_index in the loss function.

Sorry for the late reply - seems like you've figured it out now but just in case someone else is reading this then I'll explain.

When we have padding then our trg sequence will be something like [<sos>, A, B, C, <eos>, <pad>, <pad>]. So the sequence input into the decoder is [<sos>, A, B, C, <eos>, <pad>] (trg[:,:,-1]) and our decoder will be trying to predict the sequence [A, B, C, <eos>, <pad>] (trg[:,1:]).

This means that yes, the <eos> token is input into the model even though it shouldn't be - because why should you predict something after the end of the sequence? - but there is no way to avoid this when padding sequences. However, because we set the ignore_index of our CrossEntropyLoss to be the index of the padding token, whenever the decoder's target token is a <pad> token we don't calculate losses over that token.

So in the above example, we only calculate the losses when the decoder's input is [<sos>, A, B, C] because the <eos> and <pad> token both have a target token of <pad>. This means we calculate our losses (and thus update our parameters) as if the padding tokens didn't exist (sort of, we still have to waste some computation but this is offset by the fact that we can use batches instead of feeding in examples one at a time or only making batches where every sequence is the exact same length)

@liuxiaoqun
Copy link

liuxiaoqun commented Jun 16, 2021

This is because we have a target sequence, trg, of something like [<sos>, A, B, C, <eos>]. We want our decoder to predict what the next item in the predicted target sequence should be, given the previously predicted target tokens. So, we input a sequence of [<sos>, A, B, C] (which is trg[:,:-1]) and want our decoder to predict [A, B, C, <eos>] (which is trg[:,1:]).

Thus, we input trg[:,-1] and use the predicted target with trg[:,1:] to calculate our losses.

Let me know if this needs clarifying.

I have a question
the sentence is padded after eos
so, the sentences are like:
sos y1,y2, eos , pad , pad , pad
sos y1,y2,y3,y4,y5, eos
sos y1,y2, y3,y4, eos , pad

the size is trg is [3,7]
if trg is trg[:,:-1]
the sentences is cutted like
sos y1,y2, eos , pad , pad
sos y1,y2,y3,y4,y5,
sos y1,y2, y3,y4, eos
so, it is not cut all eos

I check the torchtxt, the sentence is concatenated as:
sos sentence eos pad , trg[:,:-1] will not cut all eos

if sentence is concatenated like:
sos sentence pad eos ,
in this case, it will cut all eos

@ProxJ
Copy link

ProxJ commented Feb 27, 2022

For anyone who'll find this in future, output, _ = model(src, trg[:,:-1]) seems to no longer be there, but the decoder loop in the Seq2Seq class starts from 0 to trg-1. It's currently written as for t in range(1, trg_len):, where input is always t-1 at the start of each loop (it increments at the end).
Took me a minute to figure out where the [:,:-1] went.

#182 # more in depth explaination of trg[:,:-1] and how it interacts with padding.
#43 (comment) #impact of <sos> and <eos> tokens on src -> model learns to ignore.

@wajihullahbaig
Copy link
Author

For anyone who'll find this in future, output, _ = model(src, trg[:,:-1]) seems to no longer be there, but the decoder loop in the Seq2Seq class starts from 0 to trg-1. It's currently written as for t in range(1, trg_len):, where input is always t-1 at the start of each loop (it increments at the end). Took me a minute to figure out where the [:,:-1] went.

#182 # more in depth explaination of trg[:,:-1] and how it interacts with padding. #43 (comment) #impact of <sos> and <eos> tokens on src -> model learns to ignore.

You are correct. Seems to have been updated now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants