WIP: Fim #9

RaymondLi0 · 2022-11-03T19:53:31Z

Fill-in-the-middle: #2 , #8

Add code from https://github.com/EleutherAI/gpt-neox/blob/FIM-clean/megatron/data/gpt2_dataset.py, with some changes

TODO:

Add sentinel tokens to the tokenizer
add SPM+PSM mode
Check that it does not hurt training speed

RaymondLi0 · 2022-11-03T20:13:42Z

megatron/data/gpt_dataset.py

+                    #     pass
+
+                    curr_start_position = loc + 1 # jump over the EOD token
+                # TODO: Check that we are not skipping the last subsequence (after the last eod)?


It seems to me that we might be skipping the last segment, after the last eod token. If there are N eod-tokens that separate the sequence, then these define N+1 segments that should be permuted, IIUC.

Ah I think you’re right!

I had been under the impression that Megatron dataloaders add an EOD to the end of data samples (since they return a seq of len 2049 at this step if the sequence length is set to 2048) but upon a quick check of final tokens in ‘sample’ here it doesn’t seem to be EOD.

RaymondLi0 · 2022-11-03T20:15:47Z

megatron/data/gpt_dataset.py

+                    # print(loc - curr_start_position, flush=True)
+                    # permute {prefix, suffix, middle} or {suffix, prefix, middle}
+                    # try:
+                    if loc - curr_start_position > 10: # sometimes examples start with EOD or are too short. so avoid this case


Is there a specific reason for the minimum length 10?
What happens for examples smaller than that?

In this case, I'm just not applying the FIM transformation to examples of less than 10 characters in length--the intuition was that either < 10-character documents would be 10 or fewer tokens and as such not be valuable to apply an infilling transformation to, or that in the process of applying the FIM transformation and adding 3 sentinel tokens, essentially all of the document (or all of the suffix portion, which is padded/truncated as needed here ) would be lost thus making the document end up harming FIM performance.

The actual choice of 10 is arbitrary, and was not motivated by any #-of-tokens-to-characters factor w.r.t. the GPT-NeoX-20b tokenizer... but I hope this is helpful in whether you want to keep it in!

(actually, this is a check for > 10 tokens in doc, my bad! but the reasoning was the same)

RaymondLi0 added 2 commits November 2, 2022 17:33

add FIM code from EleutherAI, some comments and todo

a360666

add a tokenizer-type for FIM

4390812

RaymondLi0 commented Nov 3, 2022

View reviewed changes

RaymondLi0 added 5 commits November 3, 2022 17:03

add spm+psm variants

5ab8702

also permute the segment after last eod token, fix permute boundaries

1f85184

fix data type in permutation

1290e49

truncate or pad after all segments are joined back

a5161d7

some cleanup

641af1d

RaymondLi0 merged commit 8169dec into multi-query-attention Nov 22, 2022

RaymondLi0 deleted the fim branch November 22, 2022 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Fim #9

WIP: Fim #9

RaymondLi0 commented Nov 3, 2022 •

edited

Loading

RaymondLi0 Nov 3, 2022

haileyschoelkopf Nov 3, 2022

RaymondLi0 Nov 3, 2022

haileyschoelkopf Nov 3, 2022

haileyschoelkopf Nov 3, 2022

WIP: Fim #9

WIP: Fim #9

Conversation

RaymondLi0 commented Nov 3, 2022 • edited Loading

RaymondLi0 Nov 3, 2022

Choose a reason for hiding this comment

haileyschoelkopf Nov 3, 2022

Choose a reason for hiding this comment

RaymondLi0 Nov 3, 2022

Choose a reason for hiding this comment

haileyschoelkopf Nov 3, 2022

Choose a reason for hiding this comment

haileyschoelkopf Nov 3, 2022

Choose a reason for hiding this comment

RaymondLi0 commented Nov 3, 2022 •

edited

Loading