-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Fim #9
WIP: Fim #9
Conversation
megatron/data/gpt_dataset.py
Outdated
# pass | ||
|
||
curr_start_position = loc + 1 # jump over the EOD token | ||
# TODO: Check that we are not skipping the last subsequence (after the last eod)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that we might be skipping the last segment, after the last eod token. If there are N eod-tokens that separate the sequence, then these define N+1 segments that should be permuted, IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think you’re right!
I had been under the impression that Megatron dataloaders add an EOD to the end of data samples (since they return a seq of len 2049 at this step if the sequence length is set to 2048) but upon a quick check of final tokens in ‘sample’ here it doesn’t seem to be EOD.
megatron/data/gpt_dataset.py
Outdated
# print(loc - curr_start_position, flush=True) | ||
# permute {prefix, suffix, middle} or {suffix, prefix, middle} | ||
# try: | ||
if loc - curr_start_position > 10: # sometimes examples start with EOD or are too short. so avoid this case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason for the minimum length 10?
What happens for examples smaller than that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I'm just not applying the FIM transformation to examples of less than 10 characters in length--the intuition was that either < 10-character documents would be 10 or fewer tokens and as such not be valuable to apply an infilling transformation to, or that in the process of applying the FIM transformation and adding 3 sentinel tokens, essentially all of the document (or all of the suffix portion, which is padded/truncated as needed here ) would be lost thus making the document end up harming FIM performance.
The actual choice of 10 is arbitrary, and was not motivated by any #-of-tokens-to-characters factor w.r.t. the GPT-NeoX-20b tokenizer... but I hope this is helpful in whether you want to keep it in!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(actually, this is a check for > 10 tokens in doc, my bad! but the reasoning was the same)
Fill-in-the-middle: #2 , #8
TODO: