[Feature] Implement sample-ids-to-text extractor by wade3han · Pull Request #116 · bigscience-workshop/Megatron-DeepSpeed

wade3han · 2021-09-26T12:36:04Z

Hello, I tried to tackle on the issue #56, and here is the sample script I used:

Example usage:


source $six_ALL_CCFRWORK/code/tr8-104B/bigscience/train/tr8-104B-wide/start-tr8-104B
MEGATRON_DEEPSPEED_REPO=$six_ALL_CCFRWORK/code/tr8-104B/Megatron-DeepSpeed-tr8-104B

cd $MEGATRON_DEEPSPEED_REPO

VOCAB_FILE=$MEGATRON_DEEPSPEED_REPO/data/gpt2-vocab.json
MERGE_FILE=$MEGATRON_DEEPSPEED_REPO/data/gpt2-merges.txt
DATA_PATH=$six_ALL_CCFRWORK/datasets-custom/oscar-en/meg-gpt2_text_document

SEQ_LEN=2048
python ~/bin/sample_idxs_to_text.py \
    --print-text \
    --sample-id-range 260912 261872 \
    --seed 42 \
    --tokenizer-type GPT2BPETokenizer \
    --seq-length $SEQ_LEN \
    --train-samples 300_000_000 \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --data-path $DATA_PATH \
    --data-impl mmap \
    --output-file tr8-104B-glitch1-1st-time.txt

If you want tokens instead of text, remove --print-text and add --print-tokens (but you can have both). If you want full token dumps add --all-tokens

Full doc is inside the script's preamble

(edited to bring the OP to up-to-date)

Fixes #56

stas00 · 2021-09-28T23:59:57Z

I started validating that the data matches and it does for the first samples, but then it no longer matches.

I'm trying to find out where the indices start to diverge. I think the problem is that this script in your PR indexes into the dataset, rather than the dataloader and thus they aren't the same, as the dataloader shuffles the data (albeit it saves a cache once it did that).

To compare I'm dumping the sequence indices on the deepspeed side:

diff --git a/deepspeed/runtime/pipe/engine.py b/deepspeed/runtime/pipe/engine.py
index cadb5b8..215ea96 100644
--- a/deepspeed/runtime/pipe/engine.py
+++ b/deepspeed/runtime/pipe/engine.py
@@ -88,6 +88,7 @@ class PipelineEngine(DeepSpeedEngine):
         self.prev_stage = self.stage_id - 1
         self.next_stage = self.stage_id + 1

+        self.data_sample_id = 0
         self.data_iterator = None
         self.batch_fn = None

@@ -545,6 +546,9 @@ class PipelineEngine(DeepSpeedEngine):
         if self.data_iterator is not None:
             batch = next(self.data_iterator)

+        print(f"{self.data_sample_id} {batch}")
+        self.data_sample_id += 1
+
         # Any post-processing, like broadcasting across a slice-parallel group.
         if self.batch_fn:
             batch = self.batch_fn(batch)

and trying to match the 2 sides.

stas00 · 2021-09-29T02:48:10Z

Interesting, it's identical first, but then it starts diverging at idx 256:

meg-ds

254 {'text': tensor([[  290,   484,   481,  ..., 35592,  1080,    13]])}
255 {'text': tensor([[ 2393,  1635, 12960,  ...,  3762, 13986,   613]])}
256 {'text': tensor([[18821,  2421,   281,  ..., 10005,   393,  1844]])}
257 {'text': tensor([[11491,   290, 12497,  ..., 18234,   262,  7852]])}
258 {'text': tensor([[  632,   743,   307,  ..., 17144,   661,   423]])}

your script, dumping range 0..300 (modified a bit)

254 {'text': tensor([[  290,   484,   481,  ..., 35592,  1080,    13]])}
255 {'text': tensor([[ 2393,  1635, 12960,  ...,  3762, 13986,   613]])}
256 {'text': tensor([[5752,  286,  262,  ...,  307, 1043,  319]])}
257 {'text': tensor([[1795, 3716, 5699,  ...,  262, 2560, 2685]])}
258 {'text': tensor([[2989,  257, 4731,  ..., 2151, 3586,   13]])}

something occurs at sample 256. Trying to figure out.

stas00 · 2021-09-29T03:44:36Z

I figured it out. My config was just to run for 100 iterations and it consumed 256 samples and then it did validation - since I was dumping inside deepspeed's from the iterator it actually was dumping from the validation dataloader from 256 on, hence the discrepancy.

So I think it's all good. But will do some more checks.

stas00 · 2021-09-29T04:24:49Z

I added a few variations on what data we dump and how we dump it:

So the new added args are:

Get text from sample idxs.:
  --sample-id-range SAMPLE_ID_RANGE [SAMPLE_ID_RANGE ...]
                        The number of samples consumed. ex) --sample-id-range 1024 2048
  --all_tokens          Whether to dump all tokens per record
  --print_tokens        Whether to print tokens
  --print_text          Whether to print text

And an example:

DATA_PATH=data/meg-gpt2_oscar-combined_text_document
VOCAB_FILE=data/gpt2-vocab.json
MERGE_FILE=data/gpt2-merges.txt

python tools/sample_idxs_to_text.py \
    --seed 42 \
    --sample-id-range 5 20 \
    --print_tokens \
    --print_text \
    --all_tokens \
    --data-path $DATA_PATH \
    --data-impl mmap \
    --seq-length 1024 \
    --micro-batch-size 1 \
    --global-batch-size 16 \
    --train-samples 100 \
    --eval-interval 100 \
    --eval-iters 100 \
    --num-layers 2 \
    --hidden-size 64 \
    --num-attention-heads 2 \
    --max-position-embeddings 1024 \
    --tokenizer-type GPT2BPETokenizer \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE

we should think where to document this new feature.

stas00

Great work @wade3han - thank you for figuring out the hardest part!

I think it's now pretty ready. Just need to add a doc.

Let me know if you have any other ideas.

stas00 · 2021-09-29T05:36:01Z

I'm thinking we probably don't need some of these args we now require for the script to run as in the example in OP?

Need to check which of these don't impact the .npy file generated and set those to some default. e.g. I don't think --num-layers makes any difference.

wade3han · 2021-09-29T12:45:40Z

You are right that arguments like --num-layers don't impact any of the result, however the function get_args() requires those model-specific parameters. Thus I put those dummy values as --num-layers 1. I guess putting some comment on the script might be helpful!

stas00 · 2021-09-29T20:25:15Z

OK, I managed to change the code to not require irrelevant args, as we only need 3 of those to get the correct npy files.

So now it's just:

DATA_PATH=data/meg-gpt2_oscar-combined_text_document
VOCAB_FILE=data/gpt2-vocab.json
MERGE_FILE=data/gpt2-merges.txt
SEQ_LEN=2048

python tools/sample_idxs_to_text.py \
    --print_tokens \
    --sample-id-range 5 6 \
    --seed 42 \
    --train-samples 100 \
    --seq-length $SEQ_LEN \
    --data-path $DATA_PATH \
    --data-impl mmap \
    --tokenizer-type GPT2BPETokenizer \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE

Please let me know if the latest incarnation looks good and I'm happy to merge it then.

stas00 · 2021-09-30T00:19:55Z

Hmm, I think I was using this script incorrectly. The index here is the sample ID and I used it to match an iteration ID.

sample ID is very dynamic while we do batch size ramp up, so it can be tricky to calculate the right sample ID.

Moreover in the current trainings we use GBS=1024, so it'd be pretty impossible to actually find which record led to a blow up as one will have to review up to 10240 records if we log every 10 iterations.

Hmm. I will post on slack and we will see if we find some better way to "zoom in".

But we should probably stress in the documentation that the sample ID is not an iteration ID

wade3han · 2021-09-30T00:23:45Z

Yes, so I usually looked at the consumed_samples which is displayed on the training log.

But we should probably stress in the documentation that the sample ID is not an iteration ID

That would be nice to stress it!

wade3han · 2021-09-30T00:25:40Z

~~So IMO the arguments like batch size will be relevant to the script, as you said the sample ID matters.~~

I think this opinion is wrong. Never mind!

stas00 · 2021-09-30T00:38:03Z

We have to deal with batch size ramp up, so as you said "consumed_samples" is probably what is needed - except with GBS=2048 and 10 iterations we have to read through 20K records!!!

Let's continue the discussion here: https://huggingface.slack.com/archives/C01NHER1JLS/p1632961981146200

we can post a summary after we sort it out.

stas00 · 2021-09-30T01:18:36Z

ok, we probably should add an output file option, otherwise all that logging gets added to the output.

OK, done.

wade3han · 2021-10-01T00:04:39Z

LGTM! Thanks @stas00

wade3han and others added 2 commits September 26, 2021 21:35

Implement sample-ids-to-text extractor

855af07

enable printing full numpy records

a417bf3

stas00 added 2 commits September 28, 2021 21:15

extend with flexible features

b7c2512

cleanup

d1c1383

stas00 reviewed Sep 29, 2021

View reviewed changes

Comment thread megatron/arguments.py Outdated

revert unrelated change

c9a8f21

stas00 self-requested a review September 29, 2021 04:25

stas00 approved these changes Sep 29, 2021

View reviewed changes

wade3han and others added 4 commits September 29, 2021 21:54

Prettify code and and docs in the code

ee4cf83

Fix typo and add detail

eb5d404

simplify

733c763

inject irrelevant args; update docs

8dd51dd

fix args; add option to save to file

4289e58

add a note about iterations vs. sample

0f56035

stas00 merged commit 2f029fa into bigscience-workshop:main Oct 1, 2021

Conversation

wade3han commented Sep 26, 2021 • edited by stas00 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 28, 2021

Uh oh!

stas00 commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 29, 2021

Uh oh!

Uh oh!

stas00 commented Sep 29, 2021

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

stas00 commented Sep 29, 2021

Uh oh!

wade3han commented Sep 29, 2021

Uh oh!

stas00 commented Sep 29, 2021

Uh oh!

stas00 commented Sep 30, 2021

Uh oh!

wade3han commented Sep 30, 2021

Uh oh!

wade3han commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wade3han commented Oct 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wade3han commented Sep 26, 2021 •

edited by stas00

Loading

stas00 commented Sep 29, 2021 •

edited

Loading

wade3han commented Sep 30, 2021 •

edited

Loading

stas00 commented Sep 30, 2021 •

edited

Loading

stas00 commented Sep 30, 2021 •

edited

Loading