update merge_preprocessed_data to use distributed merge by adammoody · Pull Request #82 · bigscience-workshop/Megatron-DeepSpeed

adammoody · 2021-08-26T23:40:16Z

This extends the merge_preprocessed_data.py script to optionally use a parallel merge. It changes that script to assume a distributed launch if given --merge distributed.

adammoody · 2021-09-14T19:49:43Z

@thomasw21 , would you please also review this one when you get a chance? This smaller PR might be an easier review/merge. Thanks.

thomasw21

This looks good! Some small comments.

Can you document this somewhere? That his option is available and how to use it? I'd imagine you'd have to use torch.distributed.run to run this in parallel
Can you write a test? I haven't really followed the new test framework, but I'm guessing we can at least use a single node with multiple ranks in CI.

megatron/data/indexed_dataset.py

thomasw21 · 2021-09-15T08:31:37Z

tools/merge_preprocessed_data.py

+    group.add_argument('--local_rank', type=int, default=None,
+                       help='Local rank of calling process on its node (from torch.distributed.launch).')


Do you not get this from distctx already? https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/data/distdata.py#L19

This is in there because the torch.distributed.launch environment invokes the script and specifies this option. If the script does not process --local_rank, then argparse kicks out with an unsupported option error.

Search for "--local_rank" on this page:
https://pytorch.org/docs/stable/distributed.html

That same documentation says that the launcher will not specify --local_rank if $LOCAL_RANK is set in the environment. So we could drop this option if we instruct users to set $LOCAL_RANK in our directions on how to run.

I think it's a bit easier for the user if we continue to handle the option, but either way if fine by me.

Yeah let's keep this. My bad I didn't know torc distributed would add arguments to parse.

tools/merge_preprocessed_data.py

thomasw21 · 2021-09-15T08:43:49Z

tools/merge_preprocessed_data.py

@@ -1,8 +1,15 @@
+import os


Not needed.

adammoody · 2021-09-15T19:56:09Z

This looks good! Some small comments.

Can you document this somewhere? That his option is available and how to use it? I'd imagine you'd have to use torch.distributed.run to run this in parallel

Can you write a test? I haven't really followed the new test framework, but I'm guessing we can at least use a single node with multiple ranks in CI.

Yes, I'll do both of those.

adammoody · 2021-09-20T22:51:24Z

I added a test case for both serial and distributed merges.

adammoody · 2021-09-20T23:01:22Z

@thomasw21 , I took another pass on this one to address your suggestions.

thomasw21

Awesome thanks ! I have one very small nit.

thomasw21 · 2021-09-21T07:58:05Z

tests/test_preprocessing.py

Suggested change

dset = dset.select(range(linelimit))

dset = dset[:linelimit]

I think dataset supports slicing.

EDIT: it's actually not the same thing, and select returns a datasets.arrow_dataset.Dataset object, so yours is better.

Yes, I tried the dset = dset[:linelimit] slice approach first and discovered that it does not return a dataset but that dset.select() did.

thomasw21 · 2021-09-21T09:16:50Z

tools/merge_preprocessed_data.py

+    # initialize distributed environment if distributed merge requested
+    if args.merge == 'distributed':
+        if args.torch_backend is None:
+            args.torch_backend = 'gloo'


Can you add a print saying we default to 'gloo'?

Yes, thanks. That's in there now.

thomasw21 · 2021-09-21T22:19:15Z

Great work ! Thank you !

adammoody · 2021-09-22T16:32:05Z

Thanks again for your help, @thomasw21 .

…orkshop#82) * update merge_preprocessed_data to use parallel merge * indexed_dataset: add docstrings to merge and gather methods * merge_preprocessed_data: tweak interface, add documentation * merge: improvements after testing * tests: serial and distributed merge * avoid setting pythonpath within script * merge script: fix typo in usage comments * print default backend when not set in distributed merge

…cience-workshop#82)" This reverts commit a354dd6.

huu4ontocord · 2021-10-23T15:10:21Z

Hi @adammoody , excellent work on data preprocessing! Would you like to help with a script to do multi-node text preprocessing, including filtering, perplexity sampling and clustering? I was thinking about using your work as base. It will be very interesting I think. Lmk

adammoody · 2021-10-25T23:08:44Z

Hi @ontocord . Sure, I'd be happy to see how I might help. Do you have pointers on what you're looking to do?

huu4ontocord · 2021-10-26T19:18:24Z

Hi Adam, Check out the discussion on this effort here: https://docs.google.com/document/d/1bx7lzAIWALH2IX5PLAiRfkHr3025dC-ZYkEmq4zB2gI/edit We have a meeting on Oct 28 at noon ET. Send me an email to Huu at ontocord.ai if you want to join. Really hope you can help. I think it will be very interesting. We will open another repo in BigScience for this effort. Thank you! Huu

…

On Oct 25, 2021, at 7:08 PM, Adam Moody ***@***.***> wrote: Hi @ontocord . Sure, I'd be happy to see how I might help. Do you have pointers on what you're looking to do? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

update merge_preprocessed_data to use parallel merge

cb5e674

adammoody changed the title ~~WIP: update merge_preprocessed_data to use parallel merge~~ update merge_preprocessed_data to use parallel merge Sep 1, 2021

thomasw21 reviewed Sep 15, 2021

View reviewed changes

adammoody added 5 commits September 20, 2021 11:52

indexed_dataset: add docstrings to merge and gather methods

ab81641

merge_preprocessed_data: tweak interface, add documentation

ff37f44

Merge branch 'main' into mergescript

9babb81

merge: improvements after testing

cde15ea

tests: serial and distributed merge

cdb7477

adammoody added 2 commits September 20, 2021 15:52

avoid setting pythonpath within script

549b3fe

merge script: fix typo in usage comments

942fcda

thomasw21 approved these changes Sep 21, 2021

View reviewed changes

print default backend when not set in distributed merge

ff63c32

adammoody changed the title ~~update merge_preprocessed_data to use parallel merge~~ update merge_preprocessed_data to use distributed merge Sep 21, 2021

thomasw21 merged commit 0c82064 into bigscience-workshop:main Sep 21, 2021

SaulLu added a commit to SaulLu/Megatron-DeepSpeed that referenced this pull request Sep 24, 2021

Revert "update merge_preprocessed_data to use distributed merge (bigs…

21a1996

…cience-workshop#82)" This reverts commit a354dd6.

		group.add_argument('--local_rank', type=int, default=None,
		help='Local rank of calling process on its node (from torch.distributed.launch).')

Conversation

adammoody commented Aug 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adammoody commented Sep 14, 2021

Uh oh!

thomasw21 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasw21 Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

adammoody Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasw21 Sep 16, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasw21 Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

adammoody commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adammoody commented Sep 20, 2021

Uh oh!

adammoody commented Sep 20, 2021

Uh oh!

thomasw21 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasw21 Sep 21, 2021

Choose a reason for hiding this comment

Uh oh!

adammoody Sep 21, 2021

Choose a reason for hiding this comment

Uh oh!

thomasw21 Sep 21, 2021

Choose a reason for hiding this comment

Uh oh!

adammoody Sep 21, 2021

Choose a reason for hiding this comment

Uh oh!

thomasw21 commented Sep 21, 2021

Uh oh!

adammoody commented Sep 22, 2021

Uh oh!

huu4ontocord commented Oct 23, 2021

Uh oh!

adammoody commented Oct 25, 2021

Uh oh!

huu4ontocord commented Oct 26, 2021 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adammoody commented Aug 26, 2021 •

edited

Loading

thomasw21 left a comment •

edited

Loading

adammoody Sep 15, 2021 •

edited

Loading

adammoody commented Sep 15, 2021 •

edited

Loading

thomasw21 left a comment •

edited

Loading