Skip to content

[Feature] Implement sample-ids-to-text extractor#116

Merged
stas00 merged 11 commits intobigscience-workshop:mainfrom
wade3han:feature/sample-ids-to-text
Oct 1, 2021
Merged

[Feature] Implement sample-ids-to-text extractor#116
stas00 merged 11 commits intobigscience-workshop:mainfrom
wade3han:feature/sample-ids-to-text

Conversation

@wade3han
Copy link
Copy Markdown
Collaborator

@wade3han wade3han commented Sep 26, 2021

Hello, I tried to tackle on the issue #56, and here is the sample script I used:

Example usage:


source $six_ALL_CCFRWORK/code/tr8-104B/bigscience/train/tr8-104B-wide/start-tr8-104B
MEGATRON_DEEPSPEED_REPO=$six_ALL_CCFRWORK/code/tr8-104B/Megatron-DeepSpeed-tr8-104B

cd $MEGATRON_DEEPSPEED_REPO

VOCAB_FILE=$MEGATRON_DEEPSPEED_REPO/data/gpt2-vocab.json
MERGE_FILE=$MEGATRON_DEEPSPEED_REPO/data/gpt2-merges.txt
DATA_PATH=$six_ALL_CCFRWORK/datasets-custom/oscar-en/meg-gpt2_text_document

SEQ_LEN=2048
python ~/bin/sample_idxs_to_text.py \
    --print-text \
    --sample-id-range 260912 261872 \
    --seed 42 \
    --tokenizer-type GPT2BPETokenizer \
    --seq-length $SEQ_LEN \
    --train-samples 300_000_000 \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --data-path $DATA_PATH \
    --data-impl mmap \
    --output-file tr8-104B-glitch1-1st-time.txt

If you want tokens instead of text, remove --print-text and add --print-tokens (but you can have both). If you want full token dumps add --all-tokens

Full doc is inside the script's preamble

(edited to bring the OP to up-to-date)

Fixes #56

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Sep 28, 2021

I started validating that the data matches and it does for the first samples, but then it no longer matches.

I'm trying to find out where the indices start to diverge. I think the problem is that this script in your PR indexes into the dataset, rather than the dataloader and thus they aren't the same, as the dataloader shuffles the data (albeit it saves a cache once it did that).

To compare I'm dumping the sequence indices on the deepspeed side:

diff --git a/deepspeed/runtime/pipe/engine.py b/deepspeed/runtime/pipe/engine.py
index cadb5b8..215ea96 100644
--- a/deepspeed/runtime/pipe/engine.py
+++ b/deepspeed/runtime/pipe/engine.py
@@ -88,6 +88,7 @@ class PipelineEngine(DeepSpeedEngine):
         self.prev_stage = self.stage_id - 1
         self.next_stage = self.stage_id + 1

+        self.data_sample_id = 0
         self.data_iterator = None
         self.batch_fn = None

@@ -545,6 +546,9 @@ class PipelineEngine(DeepSpeedEngine):
         if self.data_iterator is not None:
             batch = next(self.data_iterator)

+        print(f"{self.data_sample_id} {batch}")
+        self.data_sample_id += 1
+
         # Any post-processing, like broadcasting across a slice-parallel group.
         if self.batch_fn:
             batch = self.batch_fn(batch)

and trying to match the 2 sides.

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Sep 29, 2021

Interesting, it's identical first, but then it starts diverging at idx 256:

meg-ds

254 {'text': tensor([[  290,   484,   481,  ..., 35592,  1080,    13]])}
255 {'text': tensor([[ 2393,  1635, 12960,  ...,  3762, 13986,   613]])}
256 {'text': tensor([[18821,  2421,   281,  ..., 10005,   393,  1844]])}
257 {'text': tensor([[11491,   290, 12497,  ..., 18234,   262,  7852]])}
258 {'text': tensor([[  632,   743,   307,  ..., 17144,   661,   423]])}

your script, dumping range 0..300 (modified a bit)

254 {'text': tensor([[  290,   484,   481,  ..., 35592,  1080,    13]])}
255 {'text': tensor([[ 2393,  1635, 12960,  ...,  3762, 13986,   613]])}
256 {'text': tensor([[5752,  286,  262,  ...,  307, 1043,  319]])}
257 {'text': tensor([[1795, 3716, 5699,  ...,  262, 2560, 2685]])}
258 {'text': tensor([[2989,  257, 4731,  ..., 2151, 3586,   13]])}

something occurs at sample 256. Trying to figure out.

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Sep 29, 2021

I figured it out. My config was just to run for 100 iterations and it consumed 256 samples and then it did validation - since I was dumping inside deepspeed's from the iterator it actually was dumping from the validation dataloader from 256 on, hence the discrepancy.

So I think it's all good. But will do some more checks.

Comment thread megatron/arguments.py Outdated
@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Sep 29, 2021

I added a few variations on what data we dump and how we dump it:

So the new added args are:

Get text from sample idxs.:
  --sample-id-range SAMPLE_ID_RANGE [SAMPLE_ID_RANGE ...]
                        The number of samples consumed. ex) --sample-id-range 1024 2048
  --all_tokens          Whether to dump all tokens per record
  --print_tokens        Whether to print tokens
  --print_text          Whether to print text

And an example:

DATA_PATH=data/meg-gpt2_oscar-combined_text_document
VOCAB_FILE=data/gpt2-vocab.json
MERGE_FILE=data/gpt2-merges.txt

python tools/sample_idxs_to_text.py \
    --seed 42 \
    --sample-id-range 5 20 \
    --print_tokens \
    --print_text \
    --all_tokens \
    --data-path $DATA_PATH \
    --data-impl mmap \
    --seq-length 1024 \
    --micro-batch-size 1 \
    --global-batch-size 16 \
    --train-samples 100 \
    --eval-interval 100 \
    --eval-iters 100 \
    --num-layers 2 \
    --hidden-size 64 \
    --num-attention-heads 2 \
    --max-position-embeddings 1024 \
    --tokenizer-type GPT2BPETokenizer \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE

we should think where to document this new feature.

@stas00 stas00 self-requested a review September 29, 2021 04:25
Copy link
Copy Markdown
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @wade3han - thank you for figuring out the hardest part!

I think it's now pretty ready. Just need to add a doc.

Let me know if you have any other ideas.

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Sep 29, 2021

I'm thinking we probably don't need some of these args we now require for the script to run as in the example in OP?

Need to check which of these don't impact the .npy file generated and set those to some default. e.g. I don't think --num-layers makes any difference.

@wade3han
Copy link
Copy Markdown
Collaborator Author

You are right that arguments like --num-layers don't impact any of the result, however the function get_args() requires those model-specific parameters. Thus I put those dummy values as --num-layers 1. I guess putting some comment on the script might be helpful!

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Sep 29, 2021

OK, I managed to change the code to not require irrelevant args, as we only need 3 of those to get the correct npy files.

So now it's just:

DATA_PATH=data/meg-gpt2_oscar-combined_text_document
VOCAB_FILE=data/gpt2-vocab.json
MERGE_FILE=data/gpt2-merges.txt
SEQ_LEN=2048

python tools/sample_idxs_to_text.py \
    --print_tokens \
    --sample-id-range 5 6 \
    --seed 42 \
    --train-samples 100 \
    --seq-length $SEQ_LEN \
    --data-path $DATA_PATH \
    --data-impl mmap \
    --tokenizer-type GPT2BPETokenizer \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE

Please let me know if the latest incarnation looks good and I'm happy to merge it then.

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Sep 30, 2021

Hmm, I think I was using this script incorrectly. The index here is the sample ID and I used it to match an iteration ID.

sample ID is very dynamic while we do batch size ramp up, so it can be tricky to calculate the right sample ID.

Moreover in the current trainings we use GBS=1024, so it'd be pretty impossible to actually find which record led to a blow up as one will have to review up to 10240 records if we log every 10 iterations.

Hmm. I will post on slack and we will see if we find some better way to "zoom in".

But we should probably stress in the documentation that the sample ID is not an iteration ID

@wade3han
Copy link
Copy Markdown
Collaborator Author

Yes, so I usually looked at the consumed_samples which is displayed on the training log.

But we should probably stress in the documentation that the sample ID is not an iteration ID

That would be nice to stress it!

@wade3han
Copy link
Copy Markdown
Collaborator Author

wade3han commented Sep 30, 2021

So IMO the arguments like batch size will be relevant to the script, as you said the sample ID matters.

I think this opinion is wrong. Never mind!

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Sep 30, 2021

We have to deal with batch size ramp up, so as you said "consumed_samples" is probably what is needed - except with GBS=2048 and 10 iterations we have to read through 20K records!!!


Let's continue the discussion here: https://huggingface.slack.com/archives/C01NHER1JLS/p1632961981146200

we can post a summary after we sort it out.

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Sep 30, 2021

ok, we probably should add an output file option, otherwise all that logging gets added to the output.

OK, done.

@wade3han
Copy link
Copy Markdown
Collaborator Author

wade3han commented Oct 1, 2021

LGTM! Thanks @stas00

@stas00 stas00 merged commit 2f029fa into bigscience-workshop:main Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature request] Implement sample-ids-to-text extractor

2 participants