update merge_preprocessed_data to use distributed merge#82
update merge_preprocessed_data to use distributed merge#82thomasw21 merged 9 commits intobigscience-workshop:mainfrom
Conversation
|
@thomasw21 , would you please also review this one when you get a chance? This smaller PR might be an easier review/merge. Thanks. |
There was a problem hiding this comment.
This looks good! Some small comments.
- Can you document this somewhere? That his option is available and how to use it? I'd imagine you'd have to use
torch.distributed.runto run this in parallel - Can you write a test? I haven't really followed the new test framework, but I'm guessing we can at least use a single node with multiple ranks in CI.
| group.add_argument('--local_rank', type=int, default=None, | ||
| help='Local rank of calling process on its node (from torch.distributed.launch).') |
There was a problem hiding this comment.
Do you not get this from distctx already? https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/data/distdata.py#L19
There was a problem hiding this comment.
This is in there because the torch.distributed.launch environment invokes the script and specifies this option. If the script does not process --local_rank, then argparse kicks out with an unsupported option error.
Search for "--local_rank" on this page:
https://pytorch.org/docs/stable/distributed.html
That same documentation says that the launcher will not specify --local_rank if $LOCAL_RANK is set in the environment. So we could drop this option if we instruct users to set $LOCAL_RANK in our directions on how to run.
I think it's a bit easier for the user if we continue to handle the option, but either way if fine by me.
There was a problem hiding this comment.
Yeah let's keep this. My bad I didn't know torc distributed would add arguments to parse.
tools/merge_preprocessed_data.py
Outdated
| @@ -1,8 +1,15 @@ | |||
| import os | |||
Yes, I'll do both of those. |
|
I added a test case for both serial and distributed merges. |
|
@thomasw21 , I took another pass on this one to address your suggestions. |
There was a problem hiding this comment.
| dset = dset.select(range(linelimit)) | |
| dset = dset[:linelimit] |
I think dataset supports slicing.
EDIT: it's actually not the same thing, and select returns a datasets.arrow_dataset.Dataset object, so yours is better.
There was a problem hiding this comment.
Yes, I tried the dset = dset[:linelimit] slice approach first and discovered that it does not return a dataset but that dset.select() did.
| # initialize distributed environment if distributed merge requested | ||
| if args.merge == 'distributed': | ||
| if args.torch_backend is None: | ||
| args.torch_backend = 'gloo' |
There was a problem hiding this comment.
Can you add a print saying we default to 'gloo'?
There was a problem hiding this comment.
Yes, thanks. That's in there now.
|
Great work ! Thank you ! |
|
Thanks again for your help, @thomasw21 . |
…orkshop#82) * update merge_preprocessed_data to use parallel merge * indexed_dataset: add docstrings to merge and gather methods * merge_preprocessed_data: tweak interface, add documentation * merge: improvements after testing * tests: serial and distributed merge * avoid setting pythonpath within script * merge script: fix typo in usage comments * print default backend when not set in distributed merge
…cience-workshop#82)" This reverts commit a354dd6.
|
Hi @adammoody , excellent work on data preprocessing! Would you like to help with a script to do multi-node text preprocessing, including filtering, perplexity sampling and clustering? I was thinking about using your work as base. It will be very interesting I think. Lmk |
|
Hi @ontocord . Sure, I'd be happy to see how I might help. Do you have pointers on what you're looking to do? |
|
Hi Adam,
Check out the discussion on this effort here: https://docs.google.com/document/d/1bx7lzAIWALH2IX5PLAiRfkHr3025dC-ZYkEmq4zB2gI/edit
We have a meeting on Oct 28 at noon ET. Send me an email to Huu at ontocord.ai if you want to join. Really hope you can help. I think it will be very interesting. We will open another repo in BigScience for this effort.
Thank you!
Huu
… On Oct 25, 2021, at 7:08 PM, Adam Moody ***@***.***> wrote:
Hi @ontocord . Sure, I'd be happy to see how I might help. Do you have pointers on what you're looking to do?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
This extends the
merge_preprocessed_data.pyscript to optionally use a parallel merge. It changes that script to assume a distributed launch if given--merge distributed.