Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fim #9

Merged
merged 7 commits into from
Nov 22, 2022
Merged

WIP: Fim #9

merged 7 commits into from
Nov 22, 2022

Conversation

RaymondLi0
Copy link
Collaborator

@RaymondLi0 RaymondLi0 commented Nov 3, 2022

Fill-in-the-middle: #2 , #8

TODO:

  • Add sentinel tokens to the tokenizer
  • add SPM+PSM mode
  • Check that it does not hurt training speed

# pass

curr_start_position = loc + 1 # jump over the EOD token
# TODO: Check that we are not skipping the last subsequence (after the last eod)?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we might be skipping the last segment, after the last eod token. If there are N eod-tokens that separate the sequence, then these define N+1 segments that should be permuted, IIUC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think you’re right!

I had been under the impression that Megatron dataloaders add an EOD to the end of data samples (since they return a seq of len 2049 at this step if the sequence length is set to 2048) but upon a quick check of final tokens in ‘sample’ here it doesn’t seem to be EOD.

# print(loc - curr_start_position, flush=True)
# permute {prefix, suffix, middle} or {suffix, prefix, middle}
# try:
if loc - curr_start_position > 10: # sometimes examples start with EOD or are too short. so avoid this case
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason for the minimum length 10?
What happens for examples smaller than that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I'm just not applying the FIM transformation to examples of less than 10 characters in length--the intuition was that either < 10-character documents would be 10 or fewer tokens and as such not be valuable to apply an infilling transformation to, or that in the process of applying the FIM transformation and adding 3 sentinel tokens, essentially all of the document (or all of the suffix portion, which is padded/truncated as needed here ) would be lost thus making the document end up harming FIM performance.

The actual choice of 10 is arbitrary, and was not motivated by any #-of-tokens-to-characters factor w.r.t. the GPT-NeoX-20b tokenizer... but I hope this is helpful in whether you want to keep it in!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(actually, this is a check for > 10 tokens in doc, my bad! but the reasoning was the same)

@RaymondLi0 RaymondLi0 merged commit 8169dec into multi-query-attention Nov 22, 2022
@RaymondLi0 RaymondLi0 deleted the fim branch November 22, 2022 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants